Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide reason string in forced disconnects to allow Zeek to identify backpressure overflow, plus two fixes #436

Merged
merged 3 commits into from
Dec 6, 2024

Conversation

ckreibich
Copy link
Member

@Neverlord could you let me know if this looks okay to you? The reason-string support should hopefully be straightforward.

More interesting is ecb9aea, which I encountered while testing cluster recovery after forced disconnects: around half of the time the while-loop in connector::run_impl() ended up in a situation where it kept trying to read but failed, without realizing it had encountered an error. I'm simply adding a check of the requested-event poll mask, which seems to be resolving it.

I also noticed that Broker's master currently doesn't build with Zeek. I narrowed this down to a part of 4d9aa3b, which I'm simply reverting here. It doesn't seem to have any impact on clang-tidy, but let me know if I'm missing anything.

Thanks!

Copy link
Member

@Neverlord Neverlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits.

Comment on lines 67 to 69
std::string removed_reason() const noexcept {
return removed_reason_;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot be noexcept because it can thow bad_alloc. However, there's no reason to copy the string in the first place:

Suggested change
std::string removed_reason() const noexcept {
return removed_reason_;
}
const std::string& removed_reason() const noexcept {
return removed_reason_;
}

Comment on lines 125 to 128
void peering::force_disconnect(const std::string& reason) {
if (!removed_) {
removed_ = true;
removed_reason_ = reason;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always forces copies, better:

Suggested change
void peering::force_disconnect(const std::string& reason) {
if (!removed_) {
removed_ = true;
removed_reason_ = reason;
void peering::force_disconnect(std::string reason) {
if (!removed_) {
removed_ = true;
removed_reason_ = std::move(reason);

Comment on lines 97 to 100
std::string msg{"removed connection to remote peer"};
std::string reason = ptr_->removed_reason();
if (!reason.empty())
msg += " (" + reason + ")";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a lot of unnecessary temporaries (also should use auto).

Suggested change
std::string msg{"removed connection to remote peer"};
std::string reason = ptr_->removed_reason();
if (!reason.empty())
msg += " (" + reason + ")";
auto msg = "removed connection to remote peer"s;
if (const auto& reason = ptr_->removed_reason(); !reason.empty()) {
msg += " (";
msg += reason;
msg += ')';
}

@@ -38,7 +38,7 @@ public:

/// Forces the peering to shut down its connection without performing the BYE
/// handshake.
void force_disconnect();
void force_disconnect(const std::string& reason = "");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void force_disconnect(const std::string& reason = "");
void force_disconnect(std::string reason);

Not a fan of the empty default. There are two more places where we call force_disconnect that would currently use the default:

  • core_actor_state::finalize_shutdown: can pass something like "shutting down"
  • peering::schedule_bye_timeout: can pass something like "timeout during graceful disconnect"

I think differentiating those with dedicated reasons would make sense as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was just trying to keep this minimal. I like your suggestions, will do. Thanks!

@ckreibich ckreibich force-pushed the topic/christian/disconnect-slow-peers branch from ecb9aea to 41bf6ea Compare December 5, 2024 22:38
Copy link
Member

@Neverlord Neverlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one ill-placed const.

@@ -119,18 +126,20 @@ void peering::on_bye_ack() {
bye_timeout_.dispose();
}

void peering::force_disconnect() {
void peering::force_disconnect(const std::string reason) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void peering::force_disconnect(const std::string reason) {
void peering::force_disconnect(std::string reason) {

Adding const here turns the move below to a copy.

@@ -38,7 +38,7 @@ public:

/// Forces the peering to shut down its connection without performing the BYE
/// handshake.
void force_disconnect();
void force_disconnect(const std::string reason);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a sink argument.

Suggested change
void force_disconnect(const std::string reason);
void force_disconnect(std::string reason);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh, these were silly — thank you. (And, you're welcome to commit to my branches in such cases!)

@ckreibich ckreibich force-pushed the topic/christian/disconnect-slow-peers branch from 41bf6ea to 94e93c4 Compare December 6, 2024 18:37
This includes providing a reason for the new backpressure overflow,
so Zeek can pick up on them.
When connect_manager::continue_reading() encounters a read_result::stop, it
clears the read mask, indicating no more reads should happen, but this flag
isn't checked in connector::run_impl()'s while loop. This could lead to infinite
100% CPU utilization in a tight loop. We now simply consult the read mask in
addition to the other checks.
@ckreibich ckreibich force-pushed the topic/christian/disconnect-slow-peers branch from 94e93c4 to e2d1d41 Compare December 6, 2024 19:14
@ckreibich
Copy link
Member Author

Rebased once more just to get it to run with latest CI updates.

Copy link
Member

@timwoj timwoj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@ckreibich ckreibich merged commit 28cdb75 into master Dec 6, 2024
20 checks passed
@ckreibich ckreibich deleted the topic/christian/disconnect-slow-peers branch December 6, 2024 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants