Skip to main content
Skip to main content
Edit this page

Monitoring in the ClickHouse Cloud Console

Services in ClickHouse Cloud come with out-of-the-box monitoring components that serve users with dashboards and notifications. By default, all users in the Cloud Console can access these dashboards.

Dashboards

Service health

The Service Health dashboard can be used to monitor the high-level health of a service. ClickHouse Cloud scrapes and stores metrics displayed on this dashboard from system tables so they can be viewed when a service is idled.

Resource utilization

The Infrastructure dashboard provides a detailed view of resources being used by the ClickHouse process. ClickHouse Cloud scrapes and stores metrics displayed on this dashboard from system tables so they can be viewed when a service is idled.

Memory and CPU

The Allocated CPU and Allocated Memory graphs display the total compute resources available for each replica in your service. These allocations can be changed by using ClickHouse Cloud's scaling features.

The Memory Usage and CPU Usage graphs estimate how much CPU and memory is actually being utilized by ClickHouse processes in each replica, including queries as well as background processes like merges.

Performance degradation

If the memory or CPU utilization is approaching the allocated memory or CPU, you may begin to experience performance degradation. To resolve, we recommend:

  • Optimizing your queries
  • Changing the partitioning of your table engines
  • Adding more compute resources to your service using scaling

These are the corresponding system table metrics displayed on these graphs:

GraphCorresponding metric nameAggregationNotes
Allocated memoryCGroupMemoryTotalMax
Allocated CPUCGroupMaxCPUMax
Memory usedMemoryResidentMax
CPU usedSystem CPU metricMaxClickHouseServer_UsageCores via Prometheus endpoint

Data transfer

Graphs display data ingress and egress from ClickHouse Cloud. Learn more about network data transfer.

Advanced dashboard

This dashboard is a modified version of the built-in advanced observability dashboard, with each series representing metrics per replica. This dashboard can be useful for monitoring and troubleshooting ClickHouse-specific issues.

Note

ClickHouse Cloud scrapes and stores metrics displayed on this dashboard from system tables so they can be viewed even when a service is idled. Accessing these metrics does not issue a query to the underlying service and will not wake idle services.

The table below maps each graph in the Advanced Dashboard to its corresponding ClickHouse metric, system table source, and aggregation type:

GraphCorresponding ClickHouse metric nameSystem tableAggregation Type
Queries/secProfileEvent_Querymetric_logSum / bucketSizeSeconds
Queries runningCurrentMetric_Querymetric_logAvg
Merges runningCurrentMetric_Mergemetric_logAvg
Selected bytes/secProfileEvent_SelectedBytesmetric_logSum / bucketSizeSeconds
IO WaitProfileEvent_OSIOWaitMicrosecondsmetric_logSum / bucketSizeSeconds
S3 read waitProfileEvent_ReadBufferFromS3Microsecondsmetric_logSum / bucketSizeSeconds
S3 read errors/secProfileEvent_ReadBufferFromS3RequestsErrorsmetric_logSum / bucketSizeSeconds
CPU waitProfileEvent_OSCPUWaitMicrosecondsmetric_logSum / bucketSizeSeconds
OS CPU usage (userspace, normalized)OSUserTimeNormalizedasynchronous_metric_log
OS CPU usage (kernel, normalized)OSSystemTimeNormalizedasynchronous_metric_log
Read from diskProfileEvent_OSReadBytesmetric_logSum / bucketSizeSeconds
Read from filesystemProfileEvent_OSReadCharsmetric_logSum / bucketSizeSeconds
Memory (tracked, bytes)CurrentMetric_MemoryTrackingmetric_log
Total MergeTree partsTotalPartsOfMergeTreeTablesasynchronous_metric_log
Max parts for partitionMaxPartCountForPartitionasynchronous_metric_log
Read from S3ProfileEvent_ReadBufferFromS3Bytesmetric_logSum / bucketSizeSeconds
Filesystem cache sizeCurrentMetric_FilesystemCacheSizemetric_log
Disk S3 write req/secProfileEvent_DiskS3PutObject + ProfileEvent_DiskS3UploadPart + ProfileEvent_DiskS3CreateMultipartUpload + ProfileEvent_DiskS3CompleteMultipartUploadmetric_logSum / bucketSizeSeconds
Disk S3 read req/secProfileEvent_DiskS3GetObject + ProfileEvent_DiskS3HeadObject + ProfileEvent_DiskS3ListObjectsmetric_logSum / bucketSizeSeconds
FS cache hit ratesum(ProfileEvent_CachedReadBufferReadFromCacheBytes) / (sum(ProfileEvent_CachedReadBufferReadFromCacheBytes) + sum(ProfileEvent_CachedReadBufferReadFromSourceBytes))metric_log
Page cache hit rategreatest(0, (sum(ProfileEvent_OSReadChars) - sum(ProfileEvent_OSReadBytes)) / (sum(ProfileEvent_OSReadChars) + sum(ProfileEvent_ReadBufferFromS3Bytes)))metric_log
Network receive bytes/secNetworkReceiveBytesasynchronous_metric_logSum / bucketSizeSeconds
Network send bytes/secNetworkSendBytesasynchronous_metric_logSum / bucketSizeSeconds
Concurrent TCP connectionsCurrentMetric_TCPConnectionmetric_log
Concurrent MySQL connectionsCurrentMetric_MySQLConnectionmetric_log
Concurrent HTTP connectionsCurrentMetric_HTTPConnectionmetric_log

For detailed information on each visualization and how to use them for troubleshooting, see the advanced dashboard documentation.

Query insights

The Query Insights feature makes ClickHouse's built-in query log easier to use through various visualizations and tables. ClickHouse's system.query_log table is a key source of information for query optimization, debugging, and monitoring overall cluster health and performance.

After selecting a service, the Monitoring navigation item in the left sidebar expands to reveal a Query insights sub-item:

Top-level metrics

The stat boxes at the top represent basic query metrics over the selected time period. Beneath them, time-series charts show query volume, latency, and error rate broken down by query kind (select, insert, other). The latency chart can be adjusted to display p50, p90, and p99 latencies:

Recent queries

A table displays query log entries grouped by normalized query hash and user over the selected time window. Recent queries can be filtered and sorted by any available field, and the table can be configured to display or hide additional fields such as tables, p90, and p99 latencies:

Query drill-down

Selecting a query from the recent queries table will open a flyout containing metrics and information specific to the selected query:

All metrics in the Query info tab are aggregated metrics, but we can also view metrics from individual runs by selecting the Query history tab:

From this pane, the Settings and Profile Events items for each query run can be expanded to reveal additional information.