Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outlier Detection Success Rate Ejection Threshold Setting #37723

Open
MengyingLiDD opened this issue Dec 18, 2024 · 6 comments
Open

Outlier Detection Success Rate Ejection Threshold Setting #37723

MengyingLiDD opened this issue Dec 18, 2024 · 6 comments
Labels
area/outlier_detection enhancement Feature requests. Not bugs or questions.

Comments

@MengyingLiDD
Copy link

MengyingLiDD commented Dec 18, 2024

Description:
We are currently using the failure percentage mode for externally generated errors, and the success rate mode for locally generated errors in our service mesh's default outlier detection policy to identify bad pods. For externally generated errors, we use 99% for the error rate threshold as the default policy, and it works very well. However, for locally generated errors, we have observed a significant number of false positives, where pods with success rates as high as 99% are being ejected.
To mitigate this, we increased the standard deviation factor from the default value of 1.9 to 5. While this reduced the frequency of false positives, the issue persists. Increasing the standard deviation factor further is not a viable option for us, as it could lead to more false negatives.
In our environment, the issue occurs more frequently under the following conditions, which results in very low standard deviations:

  • Services with a large number of hosts/pods.
  • High RPS per pod (e.g., ~1k RPS per pod).
  • High availability requirements (e.g., four 9s SLA).

We believe the current standard deviation-based approach could be more useful for high-RPS and highly available microservices by introducing an additional threshold. Specifically, outlier ejection should only occur if a pod's success rate falls below a static user-configurable threshold. Specifically in our case, we’d like to only eject a host when the success rate is lower than 90% AND it is X standard deviation lower than the average.
We'd like to get the community's thoughts on this. If we agree this is a reasonable ask, we'd be happy to contribute to upstream. Thanks!

[optional Relevant Links:]
Related issues: #18752

Any extra documentation required to understand the issue.

@MengyingLiDD MengyingLiDD added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Dec 18, 2024
@adisuissa adisuissa added area/outlier_detection and removed triage Issue requires triage labels Dec 18, 2024
@adisuissa
Copy link
Contributor

cc @alyssawilk who may know who this request should be routed to

@alyssawilk
Copy link
Contributor

cc @cpakulski is the active dev with this most history here (though it's been a while and may be swapped out =P) and @Pawan-Bishnoi has touched it most recently

@cpakulski
Copy link
Contributor

@MengyingLiDD Would you mind giving example of a config which includes proposed new parameter?

Thanks.

@MengyingLiDD MengyingLiDD changed the title Outlier Detection Success Rate Ejection Threshold Setting False Positives in Outlier Detection with Success Rate Mode Dec 24, 2024
@MengyingLiDD
Copy link
Author

@cpakulski Of course. I've also updated the description a bit. I want to have an extra parameter like success_rate_minimal_ejection_threshold (name subject to discussion).

Here's an example:

    outlier_detection:
      interval: 60s
      base_ejection_time: 120s
      max_ejection_percent: 10
      max_ejection_time: 300s
      max_ejection_time_jitter: 1s
      split_external_local_origin_errors: True
      enforcing_consecutive_gateway_failure: 0
      enforcing_consecutive_local_origin_failure: 0
      enforcing_consecutive_5xx: 0
      enforcing_success_rate: 0
      enforcing_local_origin_success_rate: 100
      enforcing_failure_percentage: 100
      enforcing_failure_percentage_local_origin: 0
      success_rate_minimum_hosts: 6
      success_rate_request_volume: 10
      success_rate_stdev_factor: 5000
      success_rate_minimal_ejection_threshold: 90 # this is the new parameter I would like to have where it only eject in success rate mode when the threshold is below 90.
      failure_percentage_minimum_hosts: 6
      failure_percentage_request_volume: 10
      failure_percentage_threshold: 99
      consecutive_gateway_failure: 1000000
      consecutive_local_origin_failure: 100000
      consecutive_5xx: 100000
      

@MengyingLiDD MengyingLiDD changed the title False Positives in Outlier Detection with Success Rate Mode Outlier Detection Success Rate Ejection Threshold Setting Dec 24, 2024
@MengyingLiDD
Copy link
Author

@cpakulski any update?

@cpakulski
Copy link
Contributor

@MengyingLiDD Thanks for providing the example. I need to go to your original description to understand the need for this parameter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/outlier_detection enhancement Feature requests. Not bugs or questions.
Projects
None yet
Development

No branches or pull requests

4 participants