-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amend Pipeline Component Telemetry RFC to add a "rejected" outcome #11956
base: main
Are you sure you want to change the base?
Amend Pipeline Component Telemetry RFC to add a "rejected" outcome #11956
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #11956 +/- ##
=======================================
Coverage 91.67% 91.67%
=======================================
Files 455 455
Lines 24039 24039
=======================================
Hits 22038 22038
Misses 1629 1629
Partials 372 372 ☔ View full report in Codecov by Sentry. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
The upstream component which called `ConsumeX` will have this `outcome` attribute applied to its produced measurements, and the downstream | ||
component that `ConsumeX` was called on will have the attribute applied to its consumed measurements. | ||
|
||
Errors should be "tagged as coming from downstream" the same way permanent errors are currently handled: they can be wrapped in a `type downstreamError struct { err error }` wrapper error type, then checked with `errors.As`. Note that care may need to be taken when dealing with the `multiError`s returned by the `fanoutconsumer`. (If PR #11085 introducing a single generic `Error` type is merged, an additional `downstream bool` field can be added to it to serve the same purpose.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this may be a breaking change for some components, IF they are checking for types of errors using something other than errors.As
. I think it's ok though and those components should update to use errors.As
instead anyways. However, we should be aware of this when implementing changes in case it is a widespread problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right. I think those components would already be broken anyway, because of the permanentError
and multiError
wrappers we already use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will we need / want to update this once #11085 is merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If #11085 gets merged before this PR, I'll update this paragraph to only include the parenthetical. If this PR gets merged first, I think presenting the two alternatives is probably good enough? But we could make a second amendment if we feel the need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor suggestion for simpler language, otherwise looks great, thank you!
corresponding to whether or not the corresponding function call returned an error, and whether the error originates from the next | ||
component(s) in the pipeline, or from one further downstream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corresponding to whether or not the corresponding function call returned an error, and whether the error originates from the next | |
component(s) in the pipeline, or from one further downstream. | |
according to the corresponding function call returning a success, error, or propagating an error from further components downstream respectively. |
Some simpler language to describe this.
The upstream component which called `ConsumeX` will have this `outcome` attribute applied to its produced measurements, and the downstream | ||
component that `ConsumeX` was called on will have the attribute applied to its consumed measurements. | ||
|
||
Errors should be "tagged as coming from downstream" the same way permanent errors are currently handled: they can be wrapped in a `type downstreamError struct { err error }` wrapper error type, then checked with `errors.As`. Note that care may need to be taken when dealing with the `multiError`s returned by the `fanoutconsumer`. (If PR #11085 introducing a single generic `Error` type is merged, an additional `downstream bool` field can be added to it to serve the same purpose.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will we need / want to update this once #11085 is merged?
Context
The Pipeline Component Telemetry RFC was recently accepted (#11406). The document states the following regarding error monitoring:
Observability requirements for stable pipeline components were also recently merged (#11772). The document states the following regarding error monitoring:
Because errors are typically propagated across
ConsumeX
calls in a pipeline (except for components with an internal queue likeprocessor/batch
), the error observability mechanism proposed by the RFC implies that Pipeline Telemetry will record failures for every component interface upstream of the component that actually emitted the error, which does not match the goals set out in the observability requirements, and makes it much harder to tell which component errors are coming from from the emitted telemetry.Description
This PR amends the Pipeline Component Telemetry RFC with the following:
outcome=failure
value to cases where the error comes from the very next component (the component on whichConsumeX
was called);outcome
attribute:rejected
, for cases where an error observed at an interface comes from further downstream (the component did not "fail", but its output was "rejected");The current proposal for the mechanism is for the pipeline instrumentation layer to wrap errors in an unexported
downstream
struct, which upstream layers could check for witherrors.As
to check whether the error has already been "attributed" to a component. This is the same mechanism currently used for tracking permanent vs. retryable errors. Please check the diff for details.Possible alternatives
There are a few alternatives to this amendment, which were discussed as part of the observability requirements PR:
Consumer
API to no longer propagate errors upstream → prevents proper propagation of backpressure through the pipeline (although this is likely already a problem with thebatch
prcessor);