Skip to content

Conversation

@iblancasa
Copy link
Contributor

Description

  • Added failure.reason and failure.permanent attributes in detailed mode to otelcol_exporter_send_failed_<signal> metrics

Suggested here #13957 (comment)

Link to tracking issue

Fixes #13956

@codspeed-hq
Copy link

codspeed-hq bot commented Dec 2, 2025

CodSpeed Performance Report

Merging #14247 will not alter performance

Comparing iblancasa:13956-2 (8d444c6) with main (dc9c08d)

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

Summary

✅ 59 untouched
⏩ 20 skipped1

Footnotes

  1. 20 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@codecov
Copy link

codecov bot commented Dec 2, 2025

Codecov Report

❌ Patch coverage is 95.58824% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.19%. Comparing base (dc9c08d) to head (8d444c6).

Files with missing lines Patch % Lines
...orter/exporterhelper/internal/obs_report_sender.go 94.11% 1 Missing and 1 partial ⚠️
exporter/exporterhelper/internal/retry_sender.go 90.90% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14247      +/-   ##
==========================================
+ Coverage   92.17%   92.19%   +0.01%     
==========================================
  Files         668      668              
  Lines       41492    41545      +53     
==========================================
+ Hits        38245    38301      +56     
+ Misses       2214     2210       -4     
- Partials     1033     1034       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iblancasa I think this is a good direction, left a handful of suggestions

Comment on lines 183 to 185
if strings.Contains(err.Error(), "no more retries left") {
return "retries_exhausted"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if strings.Contains(err.Error(), "no more retries left") {
return "retries_exhausted"
}

Based on #13957 (comment), I'm not sure that this one makes sense. Won't it be the case that any error that could be retried will always end up with this as the reason? Then you lose information about the underlying reason that caused the retries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you want to know when you reached that. To count that moment, right? With that, you can create alerts based on the number of exhausted retries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you do that by filtering on non-permanent errors? Even after retries are exhausted, the failure should be considered non-permanent.

Based on your collector configuration you can infer that those ones would have been retried; and as you noted in the linked comment, the metric will only be updated after all retries are exhausted.

Let's consider two scenarios:

  1. Export fails due to an authentication error, which gets classified as a permanent error.
  2. Export fails due to a network error, which gets classified as a temporary error.

In (1), only a single attempt will be made, and the metric will be updated with error.type=Unauthenticated, failure.permanent=true.

In (2), multiple attempts will be made, and only after all retries are exhausted will the metric be updated with (say) error.type=Unavailable, failure.permanent=false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I understand the point about filtering on failure.permanent, but I have concerns about clarity for end users.

To distinguish these, users need to know:

  1. Which error types are retry-able (no clear consensus - Allow configuring retryable status codes #14228)
  2. Whether retries were actually attempted (can't tell from just the error type)

Without this context, users can't tell if failure.permanent=false means "retries were attempted and exhausted" or "error wasn't retry-able, so no retries were attempted."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new failure.retries_exhausted. With this, we can have all the situations covered. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this context, users can't tell if failure.permanent=false means "retries were attempted and exhausted" or "error wasn't retry-able, so no retries were attempted."

I would expect in the latter case that failure.permanent=true would be set, i.e. non-retryable implies permanent.

In the case of #14228, that would mean that the retry_sender code may mark an error as permanent if it's not already, and it's configured to not be retried. Does that sound sensible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect in the latter case that failure.permanent=true would be set, i.e. non-retryable implies permanent.

Yes, but if I want to create alerts on retryable errors when the number of attempts is exhausted, I need to know what errors are retryable. That information is not here. You don't have a way to differentatiate

In the case of #14228, that would mean that the retry_sender code may mark an error as permanent if it's not already, and it's configured to not be retried. Does that sound sensible?

What I meant pointing to that issue is that there is not a unified vision about what errors should be retryable or not. If that information is not clear, it is not easy to create alerts for things like "this was retried but we exhausted the number of retries". You will need to check first if the error is retryable or not and create your alerts, instead of just relying on the metric attributes. I think the failure.retries_exhausted helps with that, completing the escenarios.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I'm being dense, I'm still not seeing it... Here's my thinking, please let me know on which point(s) we disagree:

  • If an error is retryable, that means the error would not have failure.permanent=true set (or would have failure.permanent=false set)
  • If an error is retryable and still results in a failure (counted in metrics), that implies that either retries were exhausted or retries are disabled (depends on pipeline configuration)
  • If you know that retries are enabled in your pipeline, and you know that an error is retryable, then you can infer that the operation would have been retried and that retries would have been exhausted.
  • Hence, it's enough to know that the error is retryable/non-permanent.

Copy link
Contributor Author

@iblancasa iblancasa Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I'm being dense, I'm still not seeing it...

It's fine. Very likely I'm not explaining it well (or I'm terribly mistaken 😅).

  • If you know that retries are enabled in your pipeline, and you know that an error is retryable, then you can infer that the operation would have been retried and that retries would have been exhausted.

That's the thing: how a user can know an error is retryable without the needing of reading the source code or waiting until the collector emits some metrics with failure.permanent and failure.permanent?

AFAIK, there is no current way of knowing that. As a user, very likely I don't want to go as far and just create some metrics or alerts to detect when a transient error is not recoverable. But I don't have a way to know if an error is transient or not until I run the collector for some time and, after that, infer this kind of error is one of the ones you retry.

Signed-off-by: Israel Blancas <iblancasa@gmail.com>
@iblancasa
Copy link
Contributor Author

Sorry for the force-push but I got some issues with the CI after merging main into my branch and noticed some of them were related to the merge.

Signed-off-by: Israel Blancas <iblancasa@gmail.com>
Copy link
Contributor

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I thought I had hit send already

Comment on lines 202 to 204
type httpStatusCoder interface {
HTTPStatusCode() int
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What implements this? I think mostly we go the other way, propagate gRPC status through errors, and convert them to HTTP status codes.

Comment on lines 183 to 185
if strings.Contains(err.Error(), "no more retries left") {
return "retries_exhausted"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you do that by filtering on non-permanent errors? Even after retries are exhausted, the failure should be considered non-permanent.

Based on your collector configuration you can infer that those ones would have been retried; and as you noted in the linked comment, the metric will only be updated after all retries are exhausted.

Let's consider two scenarios:

  1. Export fails due to an authentication error, which gets classified as a permanent error.
  2. Export fails due to a network error, which gets classified as a temporary error.

In (1), only a single attempt will be made, and the metric will be updated with error.type=Unauthenticated, failure.permanent=true.

In (2), multiple attempts will be made, and only after all retries are exhausted will the metric be updated with (say) error.type=Unavailable, failure.permanent=false.

Signed-off-by: Israel Blancas <iblancasa@gmail.com>
Signed-off-by: Israel Blancas <iblancasa@gmail.com>
Signed-off-by: Israel Blancas <iblancasa@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[exporter/exporterhelper] Add exporter retry-drop metrics

3 participants