Skip to content

Conversation

@kaessert
Copy link
Contributor

The metrics documentation previously mixed metrics from different sources in a single table, making it unclear where each metric originates.

Split into four sections:

  • Crossplane core metrics: function runner, response cache, circuit breaker, and engine metrics emitted by the Crossplane pod
  • Provider metrics: crossplane_managed_resource_* metrics from crossplane-runtime, emitted by all providers
  • Upjet provider metrics: upjet_resource_* metrics only from Upjet-based providers (provider-aws, provider-azure, provider-gcp)
  • Controller-runtime and Kubernetes client metrics: external dependency metrics emitted by both Crossplane and providers

Additional changes:

  • Fixed metric name composition_run_function_seconds to function_run_function_seconds (matching actual code)
  • Added missing metrics: cache metrics, engine metrics, crossplane_managed_resource_drift_seconds
  • Added missing upjet metrics: cli_duration, active_cli_invocations, running_processes
  • Removed _bucket suffix from histogram metric names (added by Prometheus)

Applied to all doc versions: v1.20, v2.0-preview, v2.0, v2.1, master

@netlify
Copy link

netlify bot commented Nov 26, 2025

Deploy Preview for crossplane ready!

Name Link
🔨 Latest commit a46d215
🔍 Latest deploy log https://app.netlify.com/projects/crossplane/deploys/692f4ac7c870fe00084454cf
😎 Deploy Preview https://deploy-preview-1025--crossplane.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
Lighthouse
Lighthouse
1 paths audited
Performance: 95 (🟢 up 1 from production)
Accessibility: 90 (🔴 down 2 from production)
Best Practices: 92 (no change from production)
SEO: 100 (no change from production)
PWA: 70 (no change from production)
View the detailed breakdown and full score reports

To edit notification comments on pull requests, go to your Netlify project configuration.


## Upjet provider metrics

These metrics are only emitted by Upjet-based providers (such as provider-aws, provider-azure, provider-gcp).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provider-upjet-aws, provider-upjet-azure, provider-upjet-gcp, wonder if we want to use the repo links here

prometheus.io/scrape: "true"
```
## Crossplane core metrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the beginning of the document we describe how to enable metrics in crossplane core and adding prometheus specifics to enable the scraping

wonder if it makes sense to describe the same for the providers - per default we have the http-prom port and you need a PodMonitor or the prometheus annotations..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that does sound useful - looks like we do have a bit of that already in the provider section, e.g.:

Providers expose metrics on the metrics port (default 8080). To scrape these metrics, configure a PodMonitor or add Prometheus annotations to the provider's DeploymentRuntimeConfig.

@haarchri
Copy link
Member

thanks for make this more clear - my approval is without binding we need @jbw976 for an additional look

Copy link
Member

@jbw976 jbw976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @kaessert, thank you for taking the initiative to make these metrics easier to understand! I do have a couple comments for potential further improvements, let me know what you think of them. Feel free to ping directly so we can iterate fast too 😇

These metrics are emitted by the Crossplane pod itself.
{{< table "table table-hover table-striped table-sm">}}
| Metric Name | Description | Further Explanation |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "Further Explanation" column has no entries in it, so it's just an empty column and it causes some values from the previous column to wrap unnecessarily - should the column header just be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, i dropped it :)

Providers expose metrics on the `metrics` port (default `8080`). To scrape these metrics, configure a `PodMonitor` or add Prometheus annotations to the provider's `DeploymentRuntimeConfig`.

{{< table "table table-hover table-striped table-sm">}}
| Metric Name | Description | Further Explanation |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, the "further explanation" column is empty and can be removed...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

These metrics come from the controller-runtime framework and Kubernetes client libraries. They are emitted by both Crossplane and providers.

{{< table "table table-hover table-striped table-sm">}}
| Metric Name | Description | Further Explanation |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, even in a table that is using the "further explanation" column, i'm a bit skeptical on the value it's providing. It seems like most of the information there could just be included in the description column. What do you think, do you believe a dedicated column for "further explanation" is the right approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged into description

prometheus.io/scrape: "true"
```
## Crossplane core metrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that does sound useful - looks like we do have a bit of that already in the provider section, e.g.:

Providers expose metrics on the metrics port (default 8080). To scrape these metrics, configure a PodMonitor or add Prometheus annotations to the provider's DeploymentRuntimeConfig.

| {{<hover label="circuit_breaker_opens_total" line="6">}}circuit_breaker_opens_total{{</hover>}} | Number of times the XR circuit breaker transitioned from closed to open | |
| {{<hover label="circuit_breaker_closes_total" line="7">}}circuit_breaker_closes_total{{</hover>}} | Number of times the XR circuit breaker transitioned from open to closed | |
| {{<hover label="circuit_breaker_events_total" line="8">}}circuit_breaker_events_total{{</hover>}} | Number of XR watch events handled by the circuit breaker, labelled by outcome | |
| {{<hover label="engine_controllers_started_total" line="9">}}engine_controllers_started_total{{</hover>}} | Total number of controllers started | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good find on getting these names correct! i screwed this up in the release notes for v2.0, so i just retroactively went back and updated that now. Thank you! 😇

https://github.com/crossplane/crossplane/releases/tag/v2.0.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome 😊

| {{<hover label="function_run_function_request_total" line="1">}}function_run_function_request_total{{</hover>}} | Total number of RunFunctionRequests sent | |
| {{<hover label="function_run_function_response_total" line="2">}}function_run_function_response_total{{</hover>}} | Total number of RunFunctionResponses received | |
| {{<hover label="function_run_function_seconds" line="3">}}function_run_function_seconds{{</hover>}} | Histogram of RunFunctionResponse latency (seconds) | |
| {{<hover label="function_run_function_response_cache_hits_total" line="4">}}function_run_function_response_cache_hits_total{{</hover>}} | Total number of RunFunctionResponse cache hits | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some more cache related metrics not listed here but are in the code in https://github.com/crossplane/crossplane/blob/release-2.1/internal/xfn/cached/cached_runner_metrics.go#L76-L134 - consider adding those too. did they show up for your in your testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

| {{<hover label="controller_runtime_active_workers" line="3">}}controller_runtime_active_workers{{</hover>}} | Number of used workers per controller | The number of threads processing jobs from the work queue. |
| {{<hover label="controller_runtime_max_concurrent_reconciles" line="4">}}controller_runtime_max_concurrent_reconciles{{</hover>}} | Maximum number of concurrent reconciles per controller | Describes how reconciles can happen in parallel. |
| {{<hover label="controller_runtime_reconcile_errors_total" line="5">}}controller_runtime_reconcile_errors_total{{</hover>}} | Total number of reconciliation errors per controller | A counter that counts reconcile errors. Sharp or non stop rising of this metric might be a problem. |
| {{<hover label="controller_runtime_reconcile_time_seconds_bucket" line="6">}}controller_runtime_reconcile_time_seconds_bucket{{</hover>}} | Length of time per reconciliation per controller | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you want to remove these _bucket suffixes also, to match what was in the PR body? (there's like 4 of them I think)

Removed _bucket suffix from histogram metric names (added by Prometheus)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Removed

@kaessert
Copy link
Contributor Author

kaessert commented Dec 2, 2025

@jbw976 that does sound useful - looks like we do have a bit of that already in the provider section - doesn't hurt to mention it here though or wdyt? 🤔

@kaessert kaessert force-pushed the fix-metrics-docs branch 3 times, most recently from 905ca02 to 39bc1b0 Compare December 2, 2025 09:04
@jbw976
Copy link
Member

jbw976 commented Dec 2, 2025

doesn't hurt to mention it here though or wdyt? 🤔

yeah for sure, it wouldn't hurt to mention it in the top section as well!

Copy link
Member

@jbw976 jbw976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @kaessert, this is looking really good now! just another comment though on the older versions and how best we can approach those. What do you think?

prometheus.io/scrape: "true"
```
## Crossplane core metrics
Copy link
Member

@jbw976 jbw976 Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

come to think of it, i don't think we should update old versions like v1.20 and v2.0-preview, because a lot of these values had different names or didn't even exist in those previous versions. but let's also try not to overcomplicate it either, so what do you think of the following?

  • don't bother updating v2.0-preview, that release is a bit of a one-off and those docs aren't even exposed in the version picker anymore
  • don't bother updating v1.20, unless you really want to get every old name correct for crossplane_composition_watches_* etc. as shown in https://github.com/crossplane/crossplane/releases/tag/v2.0.0
  • remove circuit_breaker_* metrics from v2.0, those were added in v2.1
  • keep v2.1 and master as is 🎉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great points!

  • v2.0-preview changes i threw away
  • put an effort to get every v1.20 old name correct -> we already went too far for shortcuts 😅
  • circuit_breaker_* was removed from everywhere except v2.1
  • v2.1 master kept as is ;)

@jbw976
Copy link
Member

jbw976 commented Dec 2, 2025

oh also looks like there are a number of Vale issues to fix for this PR too 😇
https://github.com/crossplane/docs/actions/runs/19853004247/job/56928239618?pr=1025

thanks for continuing to make progress @kaessert!

The metrics documentation previously mixed metrics from different sources
in a single table, making it unclear where each metric originates.

Split into four sections:
- Crossplane core metrics: function runner, response cache, circuit
  breaker, and engine metrics emitted by the Crossplane pod
- Provider metrics: crossplane_managed_resource_* metrics from
  crossplane-runtime, emitted by all providers
- Upjet provider metrics: upjet_resource_* metrics only from
  Upjet-based providers (provider-aws, provider-azure, provider-gcp)
- Controller-runtime and Kubernetes client metrics: external
  dependency metrics emitted by both Crossplane and providers

Additional changes:
- Fixed metric name composition_run_function_seconds to
  function_run_function_seconds (matching actual code)
- Added missing metrics: cache metrics, engine metrics,
  crossplane_managed_resource_drift_seconds
- Added missing upjet metrics: cli_duration, active_cli_invocations,
  running_processes
- Removed _bucket suffix from histogram metric names (added by Prometheus)

Applied to all doc versions: v1.20, v2.0-preview, v2.0, v2.1, master

Signed-off-by: Tobias Kässer <tobias.kasser@upbound.io>
@kaessert
Copy link
Contributor Author

kaessert commented Dec 2, 2025

@jbw976 I think we should be good now. Everything should be addressed and vale is passing for me at least locally too :)

Copy link
Member

@jbw976 jbw976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already went too far for shortcuts

i love this sentiment 😂

this looks great to me @kaessert, thanks for getting this to the finish line!! 🏁

@jbw976 jbw976 merged commit da99598 into crossplane:master Dec 2, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants