Add detailed failure attributes to exporter send_failed metrics #14247

iblancasa · 2025-12-02T16:25:34Z

Description

Added failure.reason and failure.permanent attributes in detailed mode to otelcol_exporter_send_failed_<signal> metrics

Suggested here #13957 (comment)

Link to tracking issue

Fixes #13956

codspeed-hq · 2025-12-02T17:17:25Z

CodSpeed Performance Report

Merging #14247 will not alter performance

_{Comparing iblancasa:13956-2 (8d444c6) with main (dc9c08d)}

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

Summary

✅ 59 untouched
⏩ 20 skipped¹

20 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

codecov · 2025-12-02T17:44:38Z

Codecov Report

❌ Patch coverage is 95.58824% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.19%. Comparing base (dc9c08d) to head (8d444c6).

Files with missing lines	Patch %	Lines
...orter/exporterhelper/internal/obs_report_sender.go	94.11%	1 Missing and 1 partial ⚠️
exporter/exporterhelper/internal/retry_sender.go	90.90%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #14247      +/-   ##
==========================================
+ Coverage   92.17%   92.19%   +0.01%     
==========================================
  Files         668      668              
  Lines       41492    41545      +53     
==========================================
+ Hits        38245    38301      +56     
+ Misses       2214     2210       -4     
- Partials     1033     1034       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

axw

@iblancasa I think this is a good direction, left a handful of suggestions

exporter/exporterhelper/internal/obs_report_sender.go

axw · 2025-12-03T02:13:42Z

exporter/exporterhelper/internal/obs_report_sender.go

+	if strings.Contains(err.Error(), "no more retries left") {
+		return "retries_exhausted"
+	}


Suggested change

if strings.Contains(err.Error(), "no more retries left") {

return "retries_exhausted"

}

Based on #13957 (comment), I'm not sure that this one makes sense. Won't it be the case that any error that could be retried will always end up with this as the reason? Then you lose information about the underlying reason that caused the retries.

Yes, but you want to know when you reached that. To count that moment, right? With that, you can create alerts based on the number of exhausted retries.

Can't you do that by filtering on non-permanent errors? Even after retries are exhausted, the failure should be considered non-permanent.

Based on your collector configuration you can infer that those ones would have been retried; and as you noted in the linked comment, the metric will only be updated after all retries are exhausted.

Let's consider two scenarios:

Export fails due to an authentication error, which gets classified as a permanent error.

Export fails due to a network error, which gets classified as a temporary error.

In (1), only a single attempt will be made, and the metric will be updated with error.type=Unauthenticated, failure.permanent=true.

In (2), multiple attempts will be made, and only after all retries are exhausted will the metric be updated with (say) error.type=Unavailable, failure.permanent=false.

Thanks for the feedback. I understand the point about filtering on failure.permanent, but I have concerns about clarity for end users.

To distinguish these, users need to know:

Which error types are retry-able (no clear consensus - Allow configuring retryable status codes #14228)

Whether retries were actually attempted (can't tell from just the error type)

Without this context, users can't tell if failure.permanent=false means "retries were attempted and exhausted" or "error wasn't retry-able, so no retries were attempted."

I added a new failure.retries_exhausted. With this, we can have all the situations covered. What do you think?

Without this context, users can't tell if failure.permanent=false means "retries were attempted and exhausted" or "error wasn't retry-able, so no retries were attempted."

I would expect in the latter case that failure.permanent=true would be set, i.e. non-retryable implies permanent.

In the case of #14228, that would mean that the retry_sender code may mark an error as permanent if it's not already, and it's configured to not be retried. Does that sound sensible?

I would expect in the latter case that failure.permanent=true would be set, i.e. non-retryable implies permanent.

Yes, but if I want to create alerts on retryable errors when the number of attempts is exhausted, I need to know what errors are retryable. That information is not here. You don't have a way to differentatiate

In the case of #14228, that would mean that the retry_sender code may mark an error as permanent if it's not already, and it's configured to not be retried. Does that sound sensible?

What I meant pointing to that issue is that there is not a unified vision about what errors should be retryable or not. If that information is not clear, it is not easy to create alerts for things like "this was retried but we exhausted the number of retries". You will need to check first if the error is retryable or not and create your alerts, instead of just relying on the metric attributes. I think the failure.retries_exhausted helps with that, completing the escenarios.

Sorry if I'm being dense, I'm still not seeing it... Here's my thinking, please let me know on which point(s) we disagree:

If an error is retryable, that means the error would not have failure.permanent=true set (or would have failure.permanent=false set)

If an error is retryable and still results in a failure (counted in metrics), that implies that either retries were exhausted or retries are disabled (depends on pipeline configuration)

If you know that retries are enabled in your pipeline, and you know that an error is retryable, then you can infer that the operation would have been retried and that retries would have been exhausted.

Hence, it's enough to know that the error is retryable/non-permanent.

Sorry if I'm being dense, I'm still not seeing it...

It's fine. Very likely I'm not explaining it well (or I'm terribly mistaken 😅).

If you know that retries are enabled in your pipeline, and you know that an error is retryable, then you can infer that the operation would have been retried and that retries would have been exhausted.

That's the thing: how a user can know an error is retryable without the needing of reading the source code or waiting until the collector emits some metrics with failure.permanent and failure.permanent?

AFAIK, there is no current way of knowing that. As a user, very likely I don't want to go as far and just create some metrics or alerts to detect when a transient error is not recoverable. But I don't have a way to know if an error is transient or not until I run the collector for some time and, after that, infer this kind of error is one of the ones you retry.

exporter/exporterhelper/internal/obs_report_sender.go

Signed-off-by: Israel Blancas <iblancasa@gmail.com>

iblancasa · 2025-12-03T15:50:23Z

Sorry for the force-push but I got some issues with the CI after merging main into my branch and noticed some of them were related to the merge.

Signed-off-by: Israel Blancas <iblancasa@gmail.com>

axw

Sorry for the delay, I thought I had hit send already

axw · 2025-12-04T01:55:40Z

exporter/exporterhelper/internal/obs_report_sender.go

+	type httpStatusCoder interface {
+		HTTPStatusCode() int
+	}


What implements this? I think mostly we go the other way, propagate gRPC status through errors, and convert them to HTTP status codes.

axw · 2025-12-04T03:23:12Z

exporter/exporterhelper/internal/obs_report_sender.go

+	if strings.Contains(err.Error(), "no more retries left") {
+		return "retries_exhausted"
+	}


Can't you do that by filtering on non-permanent errors? Even after retries are exhausted, the failure should be considered non-permanent.

Based on your collector configuration you can infer that those ones would have been retried; and as you noted in the linked comment, the metric will only be updated after all retries are exhausted.

Let's consider two scenarios:

Export fails due to an authentication error, which gets classified as a permanent error.

Export fails due to a network error, which gets classified as a temporary error.

In (1), only a single attempt will be made, and the metric will be updated with error.type=Unauthenticated, failure.permanent=true.

In (2), multiple attempts will be made, and only after all retries are exhausted will the metric be updated with (say) error.type=Unavailable, failure.permanent=false.

Signed-off-by: Israel Blancas <iblancasa@gmail.com>

…tor into 13956-2

…r into 13956-2

Signed-off-by: Israel Blancas <iblancasa@gmail.com>

iblancasa requested review from a team, bogdandrutu and dmitryax as code owners December 2, 2025 16:25

iblancasa mentioned this pull request Dec 2, 2025

Add retry dropped item metrics and an exhausted retry error marker for exporter helper retries #13957

Closed

iblancasa force-pushed the 13956-2 branch 2 times, most recently from 82b21b3 to 17eb1c3 Compare December 2, 2025 17:08

iblancasa requested review from dmathieu, evan-bradley and mx-psi as code owners December 2, 2025 17:08

iblancasa force-pushed the 13956-2 branch from 17eb1c3 to e3b09fa Compare December 2, 2025 17:31

axw reviewed Dec 3, 2025

View reviewed changes

jade-guiton-dd reviewed Dec 3, 2025

View reviewed changes

exporter/exporterhelper/internal/obs_report_sender.go Outdated Show resolved Hide resolved

Add detailed failure attributes to exporter send_failed metrics

eb7f0bd

Signed-off-by: Israel Blancas <iblancasa@gmail.com>

iblancasa force-pushed the 13956-2 branch from 4fa8866 to eb7f0bd Compare December 3, 2025 15:49

Fix CI

ab77cad

Signed-off-by: Israel Blancas <iblancasa@gmail.com>

axw reviewed Dec 5, 2025

View reviewed changes

iblancasa added 6 commits December 5, 2025 17:08

Apply feedback from code review and failure.retries_exhausted

bff292f

Signed-off-by: Israel Blancas <iblancasa@gmail.com>

Merge branch 'main' into 13956-2

ecbeb51

Add e2e test

522f200

Signed-off-by: Israel Blancas <iblancasa@gmail.com>

Merge branch 'main' of github.com:open-telemetry/opentelemetry-collec…

30cff73

…tor into 13956-2

Merge branch '13956-2' of github.com:iblancasa/opentelemetry-collecto…

6816b42

…r into 13956-2

Fix gomod for tests

8d444c6

Signed-off-by: Israel Blancas <iblancasa@gmail.com>

	if strings.Contains(err.Error(), "no more retries left") {
	return "retries_exhausted"
	}

Add detailed failure attributes to exporter send_failed metrics #14247

Are you sure you want to change the base?

Add detailed failure attributes to exporter send_failed metrics #14247

Conversation

iblancasa commented Dec 2, 2025

Description

Link to tracking issue

Uh oh!

codspeed-hq bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging #14247 will not alter performance

Summary

Footnotes

Uh oh!

codecov bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

axw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iblancasa Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iblancasa commented Dec 3, 2025

Uh oh!

axw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codspeed-hq bot commented Dec 2, 2025 •

edited

Loading

codecov bot commented Dec 2, 2025 •

edited

Loading

iblancasa Dec 9, 2025 •

edited

Loading