-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add detailed failure attributes to exporter send_failed metrics #14247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
82b21b3 to
17eb1c3
Compare
CodSpeed Performance ReportMerging #14247 will not alter performanceComparing
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #14247 +/- ##
==========================================
+ Coverage 92.17% 92.19% +0.01%
==========================================
Files 668 668
Lines 41492 41545 +53
==========================================
+ Hits 38245 38301 +56
+ Misses 2214 2210 -4
- Partials 1033 1034 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
axw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iblancasa I think this is a good direction, left a handful of suggestions
| if strings.Contains(err.Error(), "no more retries left") { | ||
| return "retries_exhausted" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if strings.Contains(err.Error(), "no more retries left") { | |
| return "retries_exhausted" | |
| } |
Based on #13957 (comment), I'm not sure that this one makes sense. Won't it be the case that any error that could be retried will always end up with this as the reason? Then you lose information about the underlying reason that caused the retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but you want to know when you reached that. To count that moment, right? With that, you can create alerts based on the number of exhausted retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you do that by filtering on non-permanent errors? Even after retries are exhausted, the failure should be considered non-permanent.
Based on your collector configuration you can infer that those ones would have been retried; and as you noted in the linked comment, the metric will only be updated after all retries are exhausted.
Let's consider two scenarios:
- Export fails due to an authentication error, which gets classified as a permanent error.
- Export fails due to a network error, which gets classified as a temporary error.
In (1), only a single attempt will be made, and the metric will be updated with error.type=Unauthenticated, failure.permanent=true.
In (2), multiple attempts will be made, and only after all retries are exhausted will the metric be updated with (say) error.type=Unavailable, failure.permanent=false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. I understand the point about filtering on failure.permanent, but I have concerns about clarity for end users.
To distinguish these, users need to know:
- Which error types are retry-able (no clear consensus - Allow configuring retryable status codes #14228)
- Whether retries were actually attempted (can't tell from just the error type)
Without this context, users can't tell if failure.permanent=false means "retries were attempted and exhausted" or "error wasn't retry-able, so no retries were attempted."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new failure.retries_exhausted. With this, we can have all the situations covered. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this context, users can't tell if failure.permanent=false means "retries were attempted and exhausted" or "error wasn't retry-able, so no retries were attempted."
I would expect in the latter case that failure.permanent=true would be set, i.e. non-retryable implies permanent.
In the case of #14228, that would mean that the retry_sender code may mark an error as permanent if it's not already, and it's configured to not be retried. Does that sound sensible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect in the latter case that
failure.permanent=truewould be set, i.e. non-retryable implies permanent.
Yes, but if I want to create alerts on retryable errors when the number of attempts is exhausted, I need to know what errors are retryable. That information is not here. You don't have a way to differentatiate
In the case of #14228, that would mean that the retry_sender code may mark an error as permanent if it's not already, and it's configured to not be retried. Does that sound sensible?
What I meant pointing to that issue is that there is not a unified vision about what errors should be retryable or not. If that information is not clear, it is not easy to create alerts for things like "this was retried but we exhausted the number of retries". You will need to check first if the error is retryable or not and create your alerts, instead of just relying on the metric attributes. I think the failure.retries_exhausted helps with that, completing the escenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I'm being dense, I'm still not seeing it... Here's my thinking, please let me know on which point(s) we disagree:
- If an error is retryable, that means the error would not have
failure.permanent=trueset (or would havefailure.permanent=falseset) - If an error is retryable and still results in a failure (counted in metrics), that implies that either retries were exhausted or retries are disabled (depends on pipeline configuration)
- If you know that retries are enabled in your pipeline, and you know that an error is retryable, then you can infer that the operation would have been retried and that retries would have been exhausted.
- Hence, it's enough to know that the error is retryable/non-permanent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I'm being dense, I'm still not seeing it...
It's fine. Very likely I'm not explaining it well (or I'm terribly mistaken 😅).
- If you know that retries are enabled in your pipeline, and you know that an error is retryable, then you can infer that the operation would have been retried and that retries would have been exhausted.
That's the thing: how a user can know an error is retryable without the needing of reading the source code or waiting until the collector emits some metrics with failure.permanent and failure.permanent?
AFAIK, there is no current way of knowing that. As a user, very likely I don't want to go as far and just create some metrics or alerts to detect when a transient error is not recoverable. But I don't have a way to know if an error is transient or not until I run the collector for some time and, after that, infer this kind of error is one of the ones you retry.
Signed-off-by: Israel Blancas <iblancasa@gmail.com>
|
Sorry for the force-push but I got some issues with the CI after merging |
axw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, I thought I had hit send already
| type httpStatusCoder interface { | ||
| HTTPStatusCode() int | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What implements this? I think mostly we go the other way, propagate gRPC status through errors, and convert them to HTTP status codes.
| if strings.Contains(err.Error(), "no more retries left") { | ||
| return "retries_exhausted" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you do that by filtering on non-permanent errors? Even after retries are exhausted, the failure should be considered non-permanent.
Based on your collector configuration you can infer that those ones would have been retried; and as you noted in the linked comment, the metric will only be updated after all retries are exhausted.
Let's consider two scenarios:
- Export fails due to an authentication error, which gets classified as a permanent error.
- Export fails due to a network error, which gets classified as a temporary error.
In (1), only a single attempt will be made, and the metric will be updated with error.type=Unauthenticated, failure.permanent=true.
In (2), multiple attempts will be made, and only after all retries are exhausted will the metric be updated with (say) error.type=Unavailable, failure.permanent=false.
Signed-off-by: Israel Blancas <iblancasa@gmail.com>
Signed-off-by: Israel Blancas <iblancasa@gmail.com>
Signed-off-by: Israel Blancas <iblancasa@gmail.com>
Description
failure.reasonandfailure.permanentattributes in detailed mode tootelcol_exporter_send_failed_<signal>metricsSuggested here #13957 (comment)
Link to tracking issue
Fixes #13956