Skip to content

ClickHouse Merge Performance Bottleneck causing slower insertions with High-Volume Time-Series Ingestion #1252

@bhaskarblur

Description

@bhaskarblur

Merge System Bottleneck Causing 35-Minute Write Lag at 55K rows/sec - Merge Throughput Unable to Keep Up with Insert Rate**

Environment

ClickHouse Altinity Version: 25.3.6
Cluster: 3-node cluster (replication enabled)
Table Engine: ReplicatedMergeTree
Workload: High-volume time-series ingestion
Operating System: Rocky Linux 9
Memory: 128GB, 32GB, 32GB
CPU: 16 cores per node
Kafka: 10 Partitions, 10 Kafka Connectors writing 100K batches to DB from Kafka Topic

We discovered a merge system performance bottleneck when ingesting high-volume time-series data into a replicated 3-node ClickHouse cluster.

Data pipeline stalled due to merge system bottleneck and eventually consumer lag being increased and slower insertions:

Metric Value
Insert Rate 51.6K rows/sec
Merge Rate ~41K rows/sec
Deficit 10K rows/sec backing up
Write Lag 35 minutes
Kafka Backlog 22M rows
System Status Stable but at ceiling

After 156 minutes: 485M rows ingested, but 35-minute delay before data is queryable.

Key Findings

Performance Metrics

CPU Usage:

  • Peak: 491% (4.9 cores equivalent)
  • Active merge threads: 3.07 cores (19% utilization)
  • Observation: NOT CPU-bound, system has 16 cores but only using 3

Memory Usage:

  • Peak: 6.49 GB RSS (20% of 32GB available)
  • Status: NOT memory-constrained, plenty of headroom

I/O Performance:

  • 306 parts being managed (stable, not growing)
  • 32 concurrent merges (saturated at max limit)
  • Replication factor 3 = 918 total part files across cluster
  • Average part size: 1.59M rows per part
  • Disk throughput: Bottleneck identified

Others:

  • Insert Rate: 55.6K rows/sec (sustained for 156 minutes)
  • Write Lag: 35 minutes (data queued in Kafka consumer lag around 22-28M rows)
  • Consumer Backlog: 22M rows accumulating

Merge Activity Progression

Time Total Rows Rate Write Lag Merges Parts
T+30m ~150M 52K/s 5-10m 10-15 ~150
T+60m ~280M 54K/s 15-20m 25-30 ~250
T+90m ~330M 52K/s ~25m 30-32 ~290
T+120m 403.5M 56K/s ~32m 32 305
T+156m 485.4M 51.6K/s ~35m 32 306

Critical observation: After T+90min, merge count plateaus at 32 (max concurrent limit), part count stabilizes around 300-310, and write lag continues accumulating.

Root Cause

Replication serialization blocks merge completion:

Merge Operation Timeline:
1. Merge part locally          → 100-500ms
2. Write merged part to disk   → 200-800ms
3. Replicate to Node 2         → 1000-2000ms (network/disk wait)
4. Replicate to Node 3         → 1000-2000ms (sequential or parallel?)
5. Keeper coordination sync    → 50-100ms
6. Acknowledge completion      → 50-100ms

Total: ~2000-3500ms per merge operation

With 32 concurrent merges × 2.5 seconds average = 80 seconds per full merge cycle
Kafka batches arrive every 30 seconds

Result: Queue builds faster than merges complete → lag accumulates

Why CPU/Memory are Low but System is Bottlenecked:

  • Disk I/O operations are blocking (waiting for network replication)
  • CPU has nothing to do while waiting for disk/network
  • Memory is allocated but unused (merge buffers waiting on I/O)
  • Replication latency is the constraint, not compute resources

System Equilibrium Reached:

  • Merge throughput: ~41K rows/sec (calculated from sustained 10K rows/sec deficit)
  • Insert throughput: 55.6K rows/sec (from Kafka)
  • Deficit: 10K rows/sec accumulates as write lag
  • At 35 minutes: 10K × 2,100 sec = 21M rows in backlog (matches observed 22M)

Current Table Schema

CREATE TABLE default.`tdf-data` ON CLUSTER 'clickhouse' (
    name String,
    value Variant(Float64,Int64,String,BOOLEAN),
    status BOOLEAN,
    timestamp DateTime64(6),
    received_at DateTime64(6) DEFAULT now()
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/tdf-data', '{replica}')
PARTITION BY toStartOfDay(timestamp)
PRIMARY KEY (name, toStartOfHour(timestamp))
ORDER BY (name, toStartOfHour(timestamp), timestamp)
TTL toDateTime(timestamp) + INTERVAL 1 DAY
SETTINGS
    max_partitions_per_insert_block = 100,
    max_replicated_merges_in_queue = 32,
    max_parts_in_total = 300,
    parts_to_delay_insert = 100,
    parts_to_throw_insert = 150,
    max_bytes_to_merge_at_max_space_in_pool = 536870912,
    max_bytes_to_merge_at_min_space_in_pool = 10737418240,
    max_insert_block_size = 100000,
    merge_max_block_size = 100000,
    max_delay_to_insert = 10,
    min_delay_to_insert_ms = 100,
    index_granularity = 32768,
    old_parts_lifetime = 120;

We almost tried every configuration and came to conclusion that the data merge is slow and cannot keep up with the speed of insertions and hence slows insertions and results.

Looking for someone to share their thoughts/solution on this

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions