Monitoring your ClickHouse Cloud deployment
Overview
This guide provides enterprise teams with information on monitoring and observability capabilities for production deployments of ClickHouse Cloud. Enterprise customers frequently ask about out-of-the-box monitoring features, integration with existing observability stacks including tools like Datadog and AWS CloudWatch, and how ClickHouse's monitoring compares to self-hosted deployments.
Users can use the following methods to monitor their ClickHouse deployment:
| Section | Description | Wakes idle services? | Setup required |
|---|---|---|---|
| Cloud Console dashboards | Day-to-day monitoring with built-in dashboards for service health, resource utilization, and query performance | No | None |
| Notifications | Alerts for scaling events, errors, mutations, and billing | No | None (customizable) |
| Prometheus endpoint | Export metrics to Grafana, Datadog, or other Prometheus-compatible tools | No | API key + scraper config |
| System table queries | Deep debugging and custom analysis via direct SQL queries against system tables | Yes | SQL queries |
| Community and partner integrations | Datadog agent integration, community monitoring tools, and the Billing & Usage API | Varies | Tool-specific |
| Advanced dashboard reference | Detailed reference for each advanced dashboard visualization, including troubleshooting examples | No | None |
Quick start
Open the ClickHouse Cloud console to the Monitoring tab. This blog captures common things to watch out for when getting started.
For most users, the Cloud Console dashboards provide everything needed to monitor service health, resource utilization, and query performance without any configuration. If you need to integrate with an external monitoring stack, start with the Prometheus-compatible metrics endpoint.
System impact considerations
The above approaches use a mixture of either relying on Prometheus endpoints, being managed by ClickHouse Cloud, or querying system tables directly. The latter of these options relies on querying the production ClickHouse service, which adds query load to the system under observation and prevents ClickHouse Cloud instances from idling which can impact costs. Additionally, if the production system fails, monitoring may also be affected, since the two are coupled.
Querying system tables directly works well for deep introspection and debugging but is less appropriate for real-time production monitoring. The Cloud Console dashboards and the Prometheus endpoint both use pre-scraped metrics that do not wake idle services, making them better suited for ongoing production monitoring. Consider these trade-offs between detailed system analysis capabilities and operational overhead.