[Query Engine] Improve DistinctCountSmartHLL for dictionary-encoded columns #17411
+704
−16
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
ISSUE=#17336
For dictionary-encoded columns, DISTINCT_COUNT_SMART_HLL currently uses a RoaringBitmap to deduplicate dictionary IDs before feeding values into HLL. While efficient for low cardinality, this approach becomes CPU-intensive for high cardinality (hundreds of thousands to millions of distinct values), where RoaringBitmap insertions dominate query execution time and negate the benefits of HLL.
Proposal
Introduce a cardinality-aware execution path for DISTINCT_COUNT_HLL:
Observed improvements
Testing Done
Added JMH benchmark covering:
This JMH benchmark isolates server-side aggregation cost for the DistinctCountHLLAggregationFunction under controlled parameters: Each variation was run for 10 minutes
recordCount: {100K, 500K, 1M, 5M, 10M, 25M}
cardinalityRatioPercent: {1, 10, 30, 50, 80, 100} → Creates a record with configured cardinality
useRoaringBitMap/HLL -> Controls on to run the test with useRoaringBitMap or HLL
DictIds are pre-generated so benchmark timing includes only aggregation, not data generation.
Sample plots :

Flame graph after optimization : Aggregate doesn't dominate CPU

Benchmark Results (Average Latency, ms/op)
Record Count = 100,000
Record Count = 500,000
Record Count = 1,000,000
Record Count = 5,000,000
Record Count = 10,000,000
Record Count = 25,000,000
Recommendation:
Based on the micro-benchmark results across record counts and cardinalities, 100K distinct values is a good default threshold to start with for switching away from the RoaringBitmap path. At this scale, RoaringBitmap remains efficient for low-cardinality cases, while higher cardinalities already show clear benefits from using direct HLL updates. This threshold provides a safe balance between preserving deduplication benefits for low cardinality and avoiding excessive bitmap maintenance cost for high-cardinality workloads