Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change handling of graceful case of LoadStatsReporting onRemoteClose #37076

Merged
merged 7 commits into from
Nov 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions changelogs/current.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,9 @@ minor_behavior_changes:

bug_fixes:
# *Changes expected to improve the state of the world and are unlikely to have negative effects*
- area: lrs
change: |
Fixes errors stat being incremented and warning log spamming for LoadStatsReporting graceful stream close.
- area: tls
change: |
Support operations on IP SANs when the IP version is not supported by the host operating system, for example
Expand Down
19 changes: 14 additions & 5 deletions source/common/upstream/load_stats_reporter.cc
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,16 @@ LoadStatsReporter::LoadStatsReporter(const LocalInfo::LocalInfo& local_info,
time_source_(dispatcher.timeSource()) {
request_.mutable_node()->MergeFrom(local_info.node());
request_.mutable_node()->add_client_features("envoy.lrs.supports_send_all_clusters");
retry_timer_ = dispatcher.createTimer([this]() -> void { establishNewStream(); });
retry_timer_ = dispatcher.createTimer([this]() -> void {
stats_.retries_.inc();
establishNewStream();
});
response_timer_ = dispatcher.createTimer([this]() -> void { sendLoadStatsRequest(); });
establishNewStream();
}

void LoadStatsReporter::setRetryTimer() {
ENVOY_LOG(info, "Load reporter stats stream/connection will retry in {} ms.", RETRY_DELAY_MS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a counter for retry attempts? Otherwise this change will cause a loss of visibility.

Copy link
Contributor Author

@briansonnenberg briansonnenberg Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the "errors" stat tracks all instances of onRemoteClose (which is the caller of retries), regardless of graceful or not. Do you think we should add a separate stat for graceful closes specifically, or just continue including the graceful close in the errors stat? I suppose from a client perspective, a graceful close could be considered an error. I do think the warning log level should be reduced regardless.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be good to add a stat for establishing a new stream. Otherwise it would be difficult to diagnose the situation where the server keeps closing new streams for some reason.

retry_timer_->enableTimer(std::chrono::milliseconds(RETRY_DELAY_MS));
}

Expand Down Expand Up @@ -150,8 +154,6 @@ void LoadStatsReporter::sendLoadStatsRequest() {
}

void LoadStatsReporter::handleFailure() {
ENVOY_LOG(warn, "Load reporter stats stream/connection failure, will retry in {} ms.",
RETRY_DELAY_MS);
stats_.errors_.inc();
setRetryTimer();
}
Expand Down Expand Up @@ -243,10 +245,17 @@ void LoadStatsReporter::onReceiveTrailingMetadata(Http::ResponseTrailerMapPtr&&
}

void LoadStatsReporter::onRemoteClose(Grpc::Status::GrpcStatus status, const std::string& message) {
ENVOY_LOG(warn, "{} gRPC config stream closed: {}, {}", service_method_.name(), status, message);
response_timer_->disableTimer();
stream_ = nullptr;
handleFailure();
if (status != Grpc::Status::WellKnownGrpcStatus::Ok) {
ENVOY_LOG(warn, "{} gRPC config stream closed: {}, {}", service_method_.name(), status,
message);
handleFailure();
} else {
ENVOY_LOG(debug, "{} gRPC config stream closed gracefully, {}", service_method_.name(),
message);
setRetryTimer();
}
}

} // namespace Upstream
Expand Down
4 changes: 3 additions & 1 deletion source/common/upstream/load_stats_reporter.h
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,8 @@ namespace Upstream {
#define ALL_LOAD_REPORTER_STATS(COUNTER) \
COUNTER(requests) \
COUNTER(responses) \
COUNTER(errors)
COUNTER(errors) \
COUNTER(retries)

/**
* Struct definition for all load reporter stats. @see stats_macros.h
Expand All @@ -43,6 +44,7 @@ class LoadStatsReporter
std::unique_ptr<envoy::service::load_stats::v3::LoadStatsResponse>&& message) override;
void onReceiveTrailingMetadata(Http::ResponseTrailerMapPtr&& metadata) override;
void onRemoteClose(Grpc::Status::GrpcStatus status, const std::string& message) override;
const LoadReporterStats& getStats() { return stats_; };

// TODO(htuch): Make this configurable or some static.
const uint32_t RETRY_DELAY_MS = 5000;
Expand Down
16 changes: 16 additions & 0 deletions test/common/upstream/load_stats_reporter_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -369,8 +369,24 @@ TEST_F(LoadStatsReporterTest, RemoteStreamClose) {
EXPECT_CALL(*async_client_, startRaw(_, _, _, _)).WillOnce(Return(&async_stream_));
expectSendMessage({});
retry_timer_cb_();
EXPECT_EQ(load_stats_reporter_->getStats().errors_.value(), 1);
EXPECT_EQ(load_stats_reporter_->getStats().retries_.value(), 1);
}

// Validate that errors stat is not incremented for a graceful stream termination.
TEST_F(LoadStatsReporterTest, RemoteStreamGracefulClose) {
EXPECT_CALL(*async_client_, startRaw(_, _, _, _)).WillOnce(Return(&async_stream_));
expectSendMessage({});
createLoadStatsReporter();
EXPECT_CALL(*response_timer_, disableTimer());
EXPECT_CALL(*retry_timer_, enableTimer(_, _));
load_stats_reporter_->onRemoteClose(Grpc::Status::WellKnownGrpcStatus::Ok, "");
EXPECT_CALL(*async_client_, startRaw(_, _, _, _)).WillOnce(Return(&async_stream_));
expectSendMessage({});
retry_timer_cb_();
EXPECT_EQ(load_stats_reporter_->getStats().errors_.value(), 0);
EXPECT_EQ(load_stats_reporter_->getStats().retries_.value(), 1);
}
} // namespace
} // namespace Upstream
} // namespace Envoy
Loading