Skip to content

Conversation

@JiriCtvrtka
Copy link
Contributor

@JiriCtvrtka JiriCtvrtka commented Dec 12, 2025

PMM-14566

FB: Percona-Lab/pmm-submodules#4151

The interval was 5 years, and the instances were running for around 12 hours. Version 2 had a different (simpler) load because of different scripts on our instances. Locally, both endpoints took around 0.3 seconds, even when they were running for days under load.

Version getFilters AVG (MIN - MAX) getReport AVG (MIN - MAX)
2.44.2 0.8 (0.5 - 1.9) 1.3 (0.8 - 2.2)
3.4.0 1.4 (0.7 - 1.9) 2.5 (1 - 5)
3.5.0 1.5 (0.7 - 2) 3 (1 - 6)
3.6.0 with indexes 1 (0.6 - 1.9) 2 (0.9 - 5.5)

Since version 2.44.1, we have added many new columns and additional logic there. PMM is also generally more resource-intensive. Because the metrics table is shared across all technologies, for example new MongoDB columns can also affect PostgreSQL and others.

Overall, it looks like adding indexes, fixing race conditions in HA, and other improvements in the current version helped reduce the average and bring it closer to the average in version 2.

No matter what I tried, I was not able to reproduce 12 seconds for getReport and 6 seconds for getFilter. From the HAR file attached by the customer, I can see that out of the total 12.1 seconds for getReport, the waiting time for the server response is only 6.28 seconds, while content download takes almost the other half. See the attached image.
Screenshot 2025-12-15 at 14 50 33
So, in summary, there might be an issue with the network or with system resources. Let’s have the customer update to 3.6.0 and see how it differs for them.

Originally, I wanted to create a specialized view in scope of this PR to improve performance and problem with non related fields for specific technology, but it turned out to be a much larger task. It would require creating multiple views specialized for specific actions/technologies. A separate ticket would probably need to be created if we decide to go this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty down migration because when indexes are removed nothing else is needs to be done to revert "OPTIMIZE".

@JiriCtvrtka JiriCtvrtka marked this pull request as ready for review December 15, 2025 16:05
@JiriCtvrtka JiriCtvrtka requested a review from a team as a code owner December 15, 2025 16:05
@JiriCtvrtka JiriCtvrtka requested review from idoqo and maxkondr and removed request for a team December 15, 2025 16:05
@@ -0,0 +1 @@
OPTIMIZE TABLE metrics FINAL;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To apply indexes right now on current (historical) data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's drop it because we are not able to reliably test it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion with Alex we agreed to remove OPTIMIZE and it will be done on background when data are used for time, or manually by user.

@@ -0,0 +1 @@
ALTER TABLE metrics ADD INDEX idx_metrics_queryid queryid TYPE set(100);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we chose the specific number of granules for the index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, higher granularity improves speed but increases memory consumption. In this case, there should be a balance between performance and memory consumption. Feel free to suggest a different value.

@JiriCtvrtka JiriCtvrtka requested a review from ademidoff January 5, 2026 08:09
@@ -0,0 +1 @@
OPTIMIZE TABLE metrics FINAL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's drop it because we are not able to reliably test it.

@@ -0,0 +1 @@
ALTER TABLE metrics ADD INDEX idx_metrics_period_start period_start TYPE minmax;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ALTER TABLE metrics ADD INDEX idx_metrics_period_start period_start TYPE minmax;
ALTER TABLE metrics ADD INDEX IF NOT EXISTS idx_metrics_period_start period_start TYPE minmax;

Copy link
Contributor Author

@JiriCtvrtka JiriCtvrtka Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All migrations are applied from scratch, so it should not exists, but lets discuss this. Same for other migrations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is always a good approach to make changes idempotent. It really costs nothing but prevents from possible errors

@@ -0,0 +1 @@
ALTER TABLE metrics ADD INDEX idx_metrics_service_id service_id TYPE set(100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ALTER TABLE metrics ADD INDEX idx_metrics_service_id service_id TYPE set(100);
ALTER TABLE metrics ADD INDEX IF NOT EXISTS idx_metrics_service_id service_id TYPE set(100);

@@ -0,0 +1 @@
ALTER TABLE metrics ADD INDEX idx_metrics_queryid queryid TYPE set(100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ALTER TABLE metrics ADD INDEX idx_metrics_queryid queryid TYPE set(100);
ALTER TABLE metrics ADD INDEX IF NOT EXISTS idx_metrics_queryid queryid TYPE set(100);

@@ -0,0 +1 @@
ALTER TABLE metrics DROP INDEX idx_metrics_queryid;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ALTER TABLE metrics DROP INDEX idx_metrics_queryid;
ALTER TABLE metrics DROP INDEX IF EXISTS idx_metrics_queryid;

@@ -0,0 +1 @@
ALTER TABLE metrics DROP INDEX idx_metrics_period_start;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ALTER TABLE metrics DROP INDEX idx_metrics_period_start;
ALTER TABLE metrics DROP INDEX IF EXISTS idx_metrics_period_start;

@@ -0,0 +1 @@
ALTER TABLE metrics DROP INDEX idx_metrics_service_id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ALTER TABLE metrics DROP INDEX idx_metrics_service_id;
ALTER TABLE metrics DROP INDEX IF EXISTS idx_metrics_service_id;

@@ -0,0 +1 @@
ALTER TABLE metrics ADD INDEX idx_metrics_queryid queryid TYPE set(100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about materialising the index?

Copy link
Contributor Author

@JiriCtvrtka JiriCtvrtka Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alex suggested to don’t apply the new indexes to existing data, since we currently can’t properly test the impact/safety of materializing them. See the discussion here: #4837 (comment)

Our data TTL is 30 days, but it looks like our query that drops old partitions doesn’t work reliably. I created a ticket for this: https://perconadev.atlassian.net/browse/PMM-14670

This definitely happens in HA, and it’s likely caused by our approach of taking the partition name, converting it to UInt32, and comparing it to a timestamp. I suspect this logic might be broken even without HA as well.

Because of that, my original idea was to run an OPTIMIZE to force merges (and indirectly rebuild indexes / clean up parts), see the suggested query here: 07ff5eb

but again, we cannot test it.

@maxkondr
Copy link
Contributor

maxkondr commented Jan 7, 2026

Creating index with TYPE bloom_filter or TYPE set(n) for dateTime column and queries like SELECT ... WHERE start_time >= <..> AND start_time <= <...> will have no effect since set() and bloom_filter don't work with ranges (see https://clickhouse.com/docs/optimize/skipping-indexes#skip-index-functions)

In general, set indexes and Bloom filter based indexes (another type of set index) are both unordered and therefore do not work with ranges. In contrast, minmax indexes work particularly well with ranges since determining whether ranges intersect is very fast.

so for column period_start your choice to use TYPE minmax is correct.
Next, what I found:

Summary of Use Cases
- Use a set(N) index when you have a column with a limited number of discrete values 
appearing within each data block (e.g., status codes, error types) and require exact data skipping 
without any false positives.

- Use a bloom_filter index (including variants like ngrambf_v1 for partial text search) for columns with 
a large number of distinct values (e.g., email addresses, UUIDs, full text). It is a space-efficient way 
to filter "needle in a haystack" queries, even with the possibility of minor false positives.
  • as for service_id - in this case set() can really be useful.
  • as for queryId - this column (I assume) has very high cardinality, so using set here will not give good benefits and bloom_filter looks like better choice here.

BUT! Everything shall be measured in order to make a final decision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants