Amend Pipeline Component Telemetry RFC to add a "rejected" outcome #11956

jade-guiton-dd · 2024-12-18T14:36:28Z

Context

The Pipeline Component Telemetry RFC was recently accepted (#11406). The document states the following regarding error monitoring:

For both [consumed and produced] metrics, an outcome attribute with possible values success and failure should be automatically recorded, corresponding to whether or not the corresponding function call returned an error. Specifically, consumed measurements will be recorded with outcome as failure when a call from the previous component the ConsumeX function returns an error, and success otherwise. Likewise, produced measurements will be recorded with outcome as failure when a call to the next consumer's ConsumeX function returns an error, and success otherwise.

Observability requirements for stable pipeline components were also recently merged (#11772). The document states the following regarding error monitoring:

The goal is to be able to easily pinpoint the source of data loss in the Collector pipeline, so this should either:

only include errors internal to the component, or;

allow distinguishing said errors from ones originating in an external service, or propagated from downstream Collector components.

Because errors are typically propagated across ConsumeX calls in a pipeline (except for components with an internal queue like processor/batch), the error observability mechanism proposed by the RFC implies that Pipeline Telemetry will record failures for every component interface upstream of the component that actually emitted the error, which does not match the goals set out in the observability requirements, and makes it much harder to tell which component errors are coming from from the emitted telemetry.

Description

This PR amends the Pipeline Component Telemetry RFC with the following:

restrict the outcome=failure value to cases where the error comes from the very next component (the component on which ConsumeX was called);
add a third possible value for the outcome attribute: rejected, for cases where an error observed at an interface comes from further downstream (the component did not "fail", but its output was "rejected");
propose a mechanism to determine which of the two values should be used.

The current proposal for the mechanism is for the pipeline instrumentation layer to wrap errors in an unexported downstream struct, which upstream layers could check for with errors.As to check whether the error has already been "attributed" to a component. This is the same mechanism currently used for tracking permanent vs. retryable errors. Please check the diff for details.

Possible alternatives

There are a few alternatives to this amendment, which were discussed as part of the observability requirements PR:

loosen the observability requirements for stable components to not require distinguishing internal errors from downstream ones → makes it harder to identify the source of an error;
modify the way we use the Consumer API to no longer propagate errors upstream → prevents proper propagation of backpressure through the pipeline (although this is likely already a problem with the batch prcessor);
let component authors make their own custom telemetry to solve the problem → higher barrier to entry, especially for people wanting to opensource existing components.

codecov · 2024-12-18T15:01:10Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.67%. Comparing base (ced38e8) to head (5623b8b).

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #11956   +/-   ##
=======================================
  Coverage   91.67%   91.67%           
=======================================
  Files         455      455           
  Lines       24039    24039           
=======================================
  Hits        22038    22038           
  Misses       1629     1629           
  Partials      372      372

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2025-01-03T03:16:55Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

djaglowski · 2025-01-06T14:59:52Z

docs/rfcs/component-universal-telemetry.md

+The upstream component which called `ConsumeX` will have this `outcome` attribute applied to its produced measurements, and the downstream
+component that `ConsumeX` was called on will have the attribute applied to its consumed measurements.
+
+Errors should be "tagged as coming from downstream" the same way permanent errors are currently handled: they can be wrapped in a `type downstreamError struct { err error }` wrapper error type, then checked with `errors.As`. Note that care may need to be taken when dealing with the `multiError`s returned by the `fanoutconsumer`. (If PR #11085 introducing a single generic `Error` type is merged, an additional `downstream bool` field can be added to it to serve the same purpose.)


If I understand correctly, this may be a breaking change for some components, IF they are checking for types of errors using something other than errors.As. I think it's ok though and those components should update to use errors.As instead anyways. However, we should be aware of this when implementing changes in case it is a widespread problem.

You're absolutely right. I think those components would already be broken anyway, because of the permanentError and multiError wrappers we already use.

will we need / want to update this once #11085 is merged?

If #11085 gets merged before this PR, I'll update this paragraph to only include the parenthetical. If this PR gets merged first, I think presenting the two alternatives is probably good enough? But we could make a second amendment if we feel the need to.

jaronoff97

One minor suggestion for simpler language, otherwise looks great, thank you!

jaronoff97 · 2025-01-07T14:38:47Z

docs/rfcs/component-universal-telemetry.md

+corresponding to whether or not the corresponding function call returned an error, and whether the error originates from the next
+component(s) in the pipeline, or from one further downstream.


Suggested change

corresponding to whether or not the corresponding function call returned an error, and whether the error originates from the next

component(s) in the pipeline, or from one further downstream.

according to the corresponding function call returning a success, error, or propagating an error from further components downstream respectively.

Some simpler language to describe this.

jaronoff97 · 2025-01-07T14:40:25Z

docs/rfcs/component-universal-telemetry.md

+The upstream component which called `ConsumeX` will have this `outcome` attribute applied to its produced measurements, and the downstream
+component that `ConsumeX` was called on will have the attribute applied to its consumed measurements.
+
+Errors should be "tagged as coming from downstream" the same way permanent errors are currently handled: they can be wrapped in a `type downstreamError struct { err error }` wrapper error type, then checked with `errors.As`. Note that care may need to be taken when dealing with the `multiError`s returned by the `fanoutconsumer`. (If PR #11085 introducing a single generic `Error` type is merged, an additional `downstream bool` field can be added to it to serve the same purpose.)


will we need / want to update this once #11085 is merged?

Updated Pipeline Instrumentation RFC to add a "rejected" outcome

9b2707a

jade-guiton-dd added Skip Changelog PRs that do not require a CHANGELOG.md entry Skip Contrib Tests labels Dec 18, 2024

jade-guiton-dd requested a review from djaglowski December 18, 2024 14:37

github-actions bot added the Stale label Jan 3, 2025

jade-guiton-dd removed the Stale label Jan 6, 2025

djaglowski approved these changes Jan 6, 2025

View reviewed changes

jade-guiton-dd marked this pull request as ready for review January 7, 2025 09:58

jade-guiton-dd requested a review from a team as a code owner January 7, 2025 09:58

jade-guiton-dd requested a review from mx-psi January 7, 2025 09:58

Merge branch 'main' into pipeline-instr-outcomes

5623b8b

mx-psi approved these changes Jan 7, 2025

View reviewed changes

jade-guiton-dd mentioned this pull request Jan 7, 2025

RFC - Pipeline Component Telemetry #11406

Merged

jaronoff97 approved these changes Jan 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Amend Pipeline Component Telemetry RFC to add a "rejected" outcome #11956

Amend Pipeline Component Telemetry RFC to add a "rejected" outcome #11956

jade-guiton-dd commented Dec 18, 2024 •

edited

Loading

codecov bot commented Dec 18, 2024 •

edited

Loading

github-actions bot commented Jan 3, 2025

djaglowski Jan 6, 2025

jade-guiton-dd Jan 6, 2025

jaronoff97 Jan 7, 2025

jade-guiton-dd Jan 7, 2025 •

edited

Loading

jaronoff97 left a comment

jaronoff97 Jan 7, 2025

jaronoff97 Jan 7, 2025

		corresponding to whether or not the corresponding function call returned an error, and whether the error originates from the next
		component(s) in the pipeline, or from one further downstream.

	corresponding to whether or not the corresponding function call returned an error, and whether the error originates from the next
	component(s) in the pipeline, or from one further downstream.
	according to the corresponding function call returning a success, error, or propagating an error from further components downstream respectively.

Amend Pipeline Component Telemetry RFC to add a "rejected" outcome #11956

Are you sure you want to change the base?

Amend Pipeline Component Telemetry RFC to add a "rejected" outcome #11956

Conversation

jade-guiton-dd commented Dec 18, 2024 • edited Loading

Context

Description

Possible alternatives

codecov bot commented Dec 18, 2024 • edited Loading

Codecov Report

github-actions bot commented Jan 3, 2025

djaglowski Jan 6, 2025

Choose a reason for hiding this comment

jade-guiton-dd Jan 6, 2025

Choose a reason for hiding this comment

jaronoff97 Jan 7, 2025

Choose a reason for hiding this comment

jade-guiton-dd Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

jaronoff97 left a comment

Choose a reason for hiding this comment

jaronoff97 Jan 7, 2025

Choose a reason for hiding this comment

jaronoff97 Jan 7, 2025

Choose a reason for hiding this comment

jade-guiton-dd commented Dec 18, 2024 •

edited

Loading

codecov bot commented Dec 18, 2024 •

edited

Loading

jade-guiton-dd Jan 7, 2025 •

edited

Loading