Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TBS: Request for Policy to Enable Sampling of Slower Traces in Tail-Based Sampling #15041

Open
yh-kwak opened this issue Dec 24, 2024 · 6 comments

Comments

@yh-kwak
Copy link

yh-kwak commented Dec 24, 2024

Hello,

We actively use Elastic Observability to monitor several distributed services in production. Cost optimization is an essential priority in operating such observability platforms, and sampling is an effective tool to achieve this.

When choosing an observability platform, users naturally focus on how efficiently they can optimize costs while quickly identifying problematic traces.

Problem Statement

Currently, the sampling methods provided by Elastic Observability are quite generic and basic. We aim to use this tool more effectively to identify and resolve issues.

As of the current version (8.17.0), the Tail-Based Sampling (TBS) policies only offer the ability to sample traces based on trace.outcome, which is useful but limited. Other policies are primarily designed for filtering static, known information and do not significantly enhance the ability to capture interesting traces.

For customers like us who use APM for monitoring, there are common patterns we observe, such as tracking failed transactions or slower traces. Adding the ability to sample traces based on trace duration or the duration of a specific transaction (root span) exceeding a certain threshold would be extremely beneficial. This would help us identify and resolve problematic traces more effectively.

Request

We propose adding a policy to Tail-Based Sampling that enables sampling of slower traces based on their duration.

Given that the duration of a trace or transaction can be calculated using the transaction.duration.us field, implementing this feature in the TBS mechanism should not involve overly complex logic. Elastic Observability’s official documentation has even mentioned this feature, reinforcing its importance.

Unlike head-based sampling, each trace does not have an equal probability of being sampled. Because slower traces are more interesting than faster ones, tail-based sampling uses weighted random sampling — so traces with a longer root transaction duration are more likely to be sampled than traces with a fast root transaction duration.

We are confident that this feature would be extremely valuable not just for us but for many other users as well.

Additional Notes

  • If this feature is already planned, we would appreciate an estimated timeline or version.
  • If not, we hope it can be positively considered for inclusion in future roadmaps.

Thank you for your support and consideration!

@carsonip
Copy link
Member

carsonip commented Jan 6, 2025

Thanks for the detailed feature request.

I am under the impression that we're already doing TBS based on transaction duration (see code), it is implicit and should work out of the box, i.e. slower traces are more likely to be sampled, without additional configuration.

Are you aware of this, and do you still find the need to specify transaction duration in sampling policies?

@yh-kwak
Copy link
Author

yh-kwak commented Jan 6, 2025

@carsonip I was not aware of this feature. It’s difficult to understand how the sampling weight is determined just from the code alone. If there is any documentation explaining this feature, I would appreciate it if you could share the link.

Separately from this feature, users would like to have more explicit control over sampling. Even if a transaction is slow, the threshold for what is considered “slow” can vary depending on the service. Therefore, it would be helpful if users could specify a precise time duration for sampling.

Thank you for your attention to this feature request. I look forward to your response.

@BruceGao19
Copy link

sorry, any update on this? we expect the same, for example, if trace takes longer than 10 seconds, we prefer the tail sampling rate as 1, to record the slow trace for analysis in the future. thanks.

@simitt
Copy link
Contributor

simitt commented Jan 9, 2025

@BruceGao19 , @yh-kwak thanks for the writeup and feature request. We understand that in some cases more configuration might make the feature more flexible, but we also want to find the balance for not over-complicating it.
We don't have any plans at the moment to add support for this, but pinging @mlunadia for future consideration.

@stevejsyu
Copy link

Hi @carsonip ,

Based on the comment above, would you explain how slow transactions are weighted in the sampling process. Additionally, I couldn't find any documentation explaining this. I would greatly appreciate a more detailed explanation and a link to the relevant documentation, if available.

cc: @yh-kwak

@simitt
Copy link
Contributor

simitt commented Jan 17, 2025

There is some high level explanation available in https://github.com/elastic/apm-server/blob/main/dev_docs/tbs.md#weighted-sampling.
It's not part of the public tail based sampling docs as the concrete implementation cannot be customized and might change at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants