-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outlier Detection Success Rate Ejection Threshold Setting #37723
Comments
cc @alyssawilk who may know who this request should be routed to |
cc @cpakulski is the active dev with this most history here (though it's been a while and may be swapped out =P) and @Pawan-Bishnoi has touched it most recently |
@MengyingLiDD Would you mind giving example of a config which includes proposed new parameter? Thanks. |
@cpakulski Of course. I've also updated the description a bit. I want to have an extra parameter like success_rate_minimal_ejection_threshold (name subject to discussion). Here's an example:
|
@cpakulski any update? |
@MengyingLiDD Thanks for providing the example. I need to go to your original description to understand the need for this parameter. |
Description:
We are currently using the failure percentage mode for externally generated errors, and the success rate mode for locally generated errors in our service mesh's default outlier detection policy to identify bad pods. For externally generated errors, we use 99% for the error rate threshold as the default policy, and it works very well. However, for locally generated errors, we have observed a significant number of false positives, where pods with success rates as high as 99% are being ejected.
To mitigate this, we increased the standard deviation factor from the default value of 1.9 to 5. While this reduced the frequency of false positives, the issue persists. Increasing the standard deviation factor further is not a viable option for us, as it could lead to more false negatives.
In our environment, the issue occurs more frequently under the following conditions, which results in very low standard deviations:
We believe the current standard deviation-based approach could be more useful for high-RPS and highly available microservices by introducing an additional threshold. Specifically, outlier ejection should only occur if a pod's success rate falls below a static user-configurable threshold. Specifically in our case, we’d like to only eject a host when the success rate is lower than 90% AND it is X standard deviation lower than the average.
We'd like to get the community's thoughts on this. If we agree this is a reasonable ask, we'd be happy to contribute to upstream. Thanks!
[optional Relevant Links:]
Related issues: #18752
The text was updated successfully, but these errors were encountered: