-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Merge System Bottleneck Causing 35-Minute Write Lag at 55K rows/sec - Merge Throughput Unable to Keep Up with Insert Rate**
Environment
ClickHouse Altinity Version: 25.3.6
Cluster: 3-node cluster (replication enabled)
Table Engine: ReplicatedMergeTree
Workload: High-volume time-series ingestion
Operating System: Rocky Linux 9
Memory: 128GB, 32GB, 32GB
CPU: 16 cores per node
Kafka: 10 Partitions, 10 Kafka Connectors writing 100K batches to DB from Kafka Topic
We discovered a merge system performance bottleneck when ingesting high-volume time-series data into a replicated 3-node ClickHouse cluster.
Data pipeline stalled due to merge system bottleneck and eventually consumer lag being increased and slower insertions:
| Metric | Value |
|---|---|
| Insert Rate | 51.6K rows/sec |
| Merge Rate | ~41K rows/sec |
| Deficit | 10K rows/sec backing up |
| Write Lag | 35 minutes |
| Kafka Backlog | 22M rows |
| System Status | Stable but at ceiling |
After 156 minutes: 485M rows ingested, but 35-minute delay before data is queryable.
Key Findings
Performance Metrics
CPU Usage:
- Peak: 491% (4.9 cores equivalent)
- Active merge threads: 3.07 cores (19% utilization)
- Observation: NOT CPU-bound, system has 16 cores but only using 3
Memory Usage:
- Peak: 6.49 GB RSS (20% of 32GB available)
- Status: NOT memory-constrained, plenty of headroom
I/O Performance:
- 306 parts being managed (stable, not growing)
- 32 concurrent merges (saturated at max limit)
- Replication factor 3 = 918 total part files across cluster
- Average part size: 1.59M rows per part
- Disk throughput: Bottleneck identified
Others:
- Insert Rate: 55.6K rows/sec (sustained for 156 minutes)
- Write Lag: 35 minutes (data queued in Kafka consumer lag around 22-28M rows)
- Consumer Backlog: 22M rows accumulating
Merge Activity Progression
| Time | Total Rows | Rate | Write Lag | Merges | Parts |
|---|---|---|---|---|---|
| T+30m | ~150M | 52K/s | 5-10m | 10-15 | ~150 |
| T+60m | ~280M | 54K/s | 15-20m | 25-30 | ~250 |
| T+90m | ~330M | 52K/s | ~25m | 30-32 | ~290 |
| T+120m | 403.5M | 56K/s | ~32m | 32 | 305 |
| T+156m | 485.4M | 51.6K/s | ~35m | 32 | 306 |
Critical observation: After T+90min, merge count plateaus at 32 (max concurrent limit), part count stabilizes around 300-310, and write lag continues accumulating.
Root Cause
Replication serialization blocks merge completion:
Merge Operation Timeline:
1. Merge part locally → 100-500ms
2. Write merged part to disk → 200-800ms
3. Replicate to Node 2 → 1000-2000ms (network/disk wait)
4. Replicate to Node 3 → 1000-2000ms (sequential or parallel?)
5. Keeper coordination sync → 50-100ms
6. Acknowledge completion → 50-100ms
Total: ~2000-3500ms per merge operation
With 32 concurrent merges × 2.5 seconds average = 80 seconds per full merge cycle
Kafka batches arrive every 30 seconds
Result: Queue builds faster than merges complete → lag accumulates
Why CPU/Memory are Low but System is Bottlenecked:
- Disk I/O operations are blocking (waiting for network replication)
- CPU has nothing to do while waiting for disk/network
- Memory is allocated but unused (merge buffers waiting on I/O)
- Replication latency is the constraint, not compute resources
System Equilibrium Reached:
- Merge throughput: ~41K rows/sec (calculated from sustained 10K rows/sec deficit)
- Insert throughput: 55.6K rows/sec (from Kafka)
- Deficit: 10K rows/sec accumulates as write lag
- At 35 minutes: 10K × 2,100 sec = 21M rows in backlog (matches observed 22M)
Current Table Schema
CREATE TABLE default.`tdf-data` ON CLUSTER 'clickhouse' (
name String,
value Variant(Float64,Int64,String,BOOLEAN),
status BOOLEAN,
timestamp DateTime64(6),
received_at DateTime64(6) DEFAULT now()
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/tdf-data', '{replica}')
PARTITION BY toStartOfDay(timestamp)
PRIMARY KEY (name, toStartOfHour(timestamp))
ORDER BY (name, toStartOfHour(timestamp), timestamp)
TTL toDateTime(timestamp) + INTERVAL 1 DAY
SETTINGS
max_partitions_per_insert_block = 100,
max_replicated_merges_in_queue = 32,
max_parts_in_total = 300,
parts_to_delay_insert = 100,
parts_to_throw_insert = 150,
max_bytes_to_merge_at_max_space_in_pool = 536870912,
max_bytes_to_merge_at_min_space_in_pool = 10737418240,
max_insert_block_size = 100000,
merge_max_block_size = 100000,
max_delay_to_insert = 10,
min_delay_to_insert_ms = 100,
index_granularity = 32768,
old_parts_lifetime = 120;We almost tried every configuration and came to conclusion that the data merge is slow and cannot keep up with the speed of insertions and hence slows insertions and results.
Looking for someone to share their thoughts/solution on this