Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change handling of graceful case of LoadStatsReporting onRemoteClose #37076

Merged
merged 7 commits into from
Nov 20, 2024

Conversation

briansonnenberg
Copy link
Contributor

Reopening this PR

Currently, we treat all remote grpc stream closes as errors, and log warnings and increment failure metrics for every instance. A remote grpc stream close with a 0 code is not a failure, but instead a graceful termination, so it should be a lower log level and should not increment failure metrics.

Commit Message:
Additional Description:
Risk Level: Low
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:

…tsReporting, and reduce retry rate.

Signed-off-by: Brian Sonnenberg <[email protected]>
Signed-off-by: Brian Sonnenberg <[email protected]>
Signed-off-by: Brian Sonnenberg <[email protected]>
Signed-off-by: Brian Sonnenberg <[email protected]>
Signed-off-by: Brian Sonnenberg <[email protected]>
Copy link
Contributor

@yanavlasov yanavlasov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait-any

@@ -26,6 +26,7 @@ LoadStatsReporter::LoadStatsReporter(const LocalInfo::LocalInfo& local_info,
}

void LoadStatsReporter::setRetryTimer() {
ENVOY_LOG(info, "Load reporter stats stream/connection will retry in {} ms.", RETRY_DELAY_MS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a counter for retry attempts? Otherwise this change will cause a loss of visibility.

Copy link
Contributor Author

@briansonnenberg briansonnenberg Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the "errors" stat tracks all instances of onRemoteClose (which is the caller of retries), regardless of graceful or not. Do you think we should add a separate stat for graceful closes specifically, or just continue including the graceful close in the errors stat? I suppose from a client perspective, a graceful close could be considered an error. I do think the warning log level should be reduced regardless.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be good to add a stat for establishing a new stream. Otherwise it would be difficult to diagnose the situation where the server keeps closing new streams for some reason.

Copy link
Contributor

@yanavlasov yanavlasov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait-any

@@ -26,6 +26,7 @@ LoadStatsReporter::LoadStatsReporter(const LocalInfo::LocalInfo& local_info,
}

void LoadStatsReporter::setRetryTimer() {
ENVOY_LOG(info, "Load reporter stats stream/connection will retry in {} ms.", RETRY_DELAY_MS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be good to add a stat for establishing a new stream. Otherwise it would be difficult to diagnose the situation where the server keeps closing new streams for some reason.

Signed-off-by: Brian Sonnenberg <[email protected]>
yanavlasov
yanavlasov previously approved these changes Nov 19, 2024
@yanavlasov yanavlasov enabled auto-merge (squash) November 19, 2024 01:35
@yanavlasov
Copy link
Contributor

@briansonnenberg please fix format errors. https://github.com/envoyproxy/envoy/actions/runs/11900732486/job/33162213299#step:16:598

/wait

Signed-off-by: Brian Sonnenberg <[email protected]>
auto-merge was automatically disabled November 19, 2024 18:47

Head branch was pushed to by a user without write access

@briansonnenberg
Copy link
Contributor Author

Looks like some unrelated test is failing the CI?

IpVersionsClientType/NetworkExtensionDiscoveryIntegrationTest

https://github.com/envoyproxy/envoy/actions/runs/11919711638/job/33220044570

@yanavlasov yanavlasov enabled auto-merge (squash) November 20, 2024 19:55
@yanavlasov yanavlasov merged commit 4e8706d into envoyproxy:main Nov 20, 2024
26 of 28 checks passed
@briansonnenberg briansonnenberg deleted the brian_log_fix branch November 20, 2024 23:04
@adisuissa
Copy link
Contributor

Fixes #34305.

@ImHackedBy-OctoRepo-HELP

This comment was marked as abuse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants