spv: delay reconnect attempts #2442

norwnd · 2024-10-15T21:26:02Z

Closes #2433.

I didn't test it, but it's simple enough that I hope it works.

norwnd · 2024-10-15T21:53:13Z

spv/sync.go

+// connectAndRunPeer returns backoff flag indicating whether it is advised to
+// wait a bit before attempting to connect to this peer again.
+func (s *Syncer) connectAndRunPeer(ctx context.Context, raddr string, persistent bool) (backoff bool) {


Essentially, backoff flag will help the caller to decide whether to retry immediately or wait (seems like there are cases for both).

JoeGruffins · 2024-10-17T05:56:23Z

spv/sync.go

+		default:
+		}
+
+		backoff := s.connectAndRunPeer(ctx, raddr, true)
+		if backoff {
+			// Delay next connect attempt to save resources and not overwhelm peers,
+			// it's unlikely to succeed anyway if we do retry immediately.
+			time.Sleep(5 * time.Second)


This was already sleeping 5 seconds. The change would make it spin more? Also blocking shutdown here. I don't think anything needs to change here.

This was already sleeping 5 seconds. The change would make it spin more?

I believe it was sleeping in some previous code version, then it got updated/rewritten and I'm essentially adding this sleep back here.

Also blocking shutdown here.

Shutdown on context canceled ? In that case connectAndRunPeer I believe will return false for backoff (telling us there is no need to back off).

I'm using Sleep instead of time.After because I think we want conditional general-purpose strategy for back off here, just off the top of my head,

we want to back off when:

there is an error when establishing connection

we don't want to back off when:

we were connected, but it broke at some point (meaning it won't hurt retrying reconnect immediately 1 time)

when wallet shuts down (ctx is canceled)

That's the general idea. It might be overcomplicating it and I might not have implemented it 100% correctly, but I don't see obvious mistakes with it, perhaps you can point those out.

These are persistent peers which I don't guess many people set. I think it can stay simple. It won't spin up the cpu with the sleep that was already there.

sleep and what was already there do the same thing, except one does not block shutdown. Not blocking shutdown is the correct thing to do here.

Not blocking shutdown is the correct thing to do here.

I don't disagree with that, I've adjusted the way backoff is returned from connectAndRunPeer and now I believe it should never sleep on ctx canceled (which is what happens on shutdown for example).

sleep and what was already there do the same thing, except one does not block shutdown.

We could just ignore backoff value in connectToPersistent and wait with time.After (like it was before), but why not just use backoff param with sleep (assuming I've implemented it correctly) just to be uniform with how it is done in connectToCandidates now.

We could also just remove backoff and do it all with unconditional time.After like it was in the past. But that way we'll sleep for no reason when switching from one peer to another, for example, in connectToCandidates.

JoeGruffins · 2024-10-17T06:13:35Z

spv/sync.go

+			backoff := s.connectAndRunPeer(ctx, raddr, false)
+			if backoff {
+				// Delay next connect attempt to save resources and not overwhelm peers,
+				// it's unlikely to succeed anyway if we do retry immediately.
+				time.Sleep(5 * time.Second)
+			}


This is blocking. If you want to sleep you need to use a pattern like above:

select { case <-ctx.Done(): return case <-time.After(5 * time.Second): }

I think a better way to solve this is to detect that we are not connected and pause this loop until we think we are connected. Unsure how to go about that though. Maybe a combination of sniffing out the error network is unreachable and sleeps is the easiest (which is basically this). Unless there's a better way to know if connected to the network.

Yeah, maybe what you are doing is best. If you could try to sniff out that particular error and change the sleeps I think this is ok. ofc jrick may have better ideas.

If you could try to sniff out that particular error and change the sleeps I think this is ok

Yeah, essentially it would be best to classify all the errors that can happen in connectAndRunPeer in two groups (backoff == true/false) like I expanded on above.

It seems to me like this should mirror the backoff functionality for permanent (persistent) peers in dcrd, which is actually a backoff versus a static retry every 5 seconds.

The code structure is obviously a bit different here, but the key point is that it keeps a retry count and increases the bacoff timeout by the configured number of seconds (5 seconds here) times the retry count and is clamped to a maximum so it doesn't grow indefinitely (that max is 5 minutes in dcrd).

In other words, it'll sleep 5 seconds, then 10 seconds, then 15 seconds, and continue increasing by 5 seconds all the way to 5 minutes.

Another thing that dcrd does is keep a global failed attempts counter for non-persistent peers that is reset to 0 any time there is a successful connection. Whenever that global failed attempt counter reaches or exceeds a maximum number of failed attempts (dcrd uses 25), it imposes a static delay with the retry duration (5 seconds) on all non-persistent connection attempts too.

That ensures when the network goes down, it doesn't go nuts trying to connect thousands of times per second due to the connection attempts immediately failing.

In other words, it'll sleep 5 seconds, then 10 seconds, then 15 seconds, and continue increasing by 5 seconds all the way to 5 minutes.

Yep, essentially this protects us from overwhelming our peer(s), so that they don't ban us (or just have too much connections/requests which slows them) ?

And yes, if timeous keep increasing, at some point we want to reset them or something to not starve ourselves of connections for too long.

I certainly can implement that, but probably better get some feedback on this from @jrick first to see if I implemented the backoff return value (from connectAndRunPeer) correctly.

jrick · 2025-01-27T16:09:59Z

Let me know if #2465 fixes this issue, I suspect that is how we will want to approach this.

jrick · 2025-01-27T16:14:32Z

I think 2465 will turn out to be a partial fix to this problem. If all peers are lost, then the syncer will be destroyed and it will wait 5s in the main package. But if at least one peer remains connected, it may spin on connecting to other non-permanent peers if there are few enough of them to not find 8 valid peers, and if they are not accepting connections.

Keeping a map of peer addresses to an incrementing backoff timeout is how we should deal with that remaining case.

jrick · 2025-01-27T16:37:20Z

An exponential increase to the backoff timeout, capped at 1m30s, looks appropriate: https://go.dev/play/p/bXfC0dS-j6A

jrick · 2025-01-27T16:42:07Z

Or an exponentially decaying increase capped to 1m30s: https://go.dev/play/p/sL-APIY_rSf

jrick · 2025-01-27T17:37:53Z

Please also test #2466 which addresses the backoff/spinning issue for candidate peers.

norwnd · 2025-01-27T19:04:58Z

Please also test #2466 which addresses the backoff/spinning issue for candidate peers.

I don't really have a separate setup for "testing" dcrwallet - and I'm not sure if it'd be trivial or not to pull this change directly into Bison Wallet (which is what I'm typically testing with), so it's probably better to wait for the next Bison Wallet release with upgraded dcrwallet - for me to tell if #2433 is gone or not,

appreciate you looking into it, closing PR in favor of the ones you've opened.

spv: delay reconnect attempts

7b6f069

norwnd force-pushed the spv-delay-reconnect-attempts branch from 4e54ee9 to 7b6f069 Compare October 15, 2024 21:50

norwnd commented Oct 15, 2024

View reviewed changes

JoeGruffins reviewed Oct 17, 2024

View reviewed changes

dont back off when context canceled

91dff34

norwnd closed this Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spv: delay reconnect attempts #2442

spv: delay reconnect attempts #2442

norwnd commented Oct 15, 2024

norwnd Oct 15, 2024

JoeGruffins Oct 17, 2024

norwnd Oct 17, 2024 •

edited

Loading

JoeGruffins Oct 18, 2024

norwnd Oct 18, 2024 •

edited

Loading

JoeGruffins Oct 17, 2024 •

edited

Loading

JoeGruffins Oct 17, 2024

norwnd Oct 17, 2024 •

edited

Loading

davecgh Oct 19, 2024 •

edited

Loading

davecgh Oct 19, 2024

norwnd Oct 19, 2024

jrick commented Jan 27, 2025

jrick commented Jan 27, 2025 •

edited

Loading

jrick commented Jan 27, 2025 •

edited

Loading

jrick commented Jan 27, 2025

jrick commented Jan 27, 2025

norwnd commented Jan 27, 2025

spv: delay reconnect attempts #2442

spv: delay reconnect attempts #2442

Conversation

norwnd commented Oct 15, 2024

norwnd Oct 15, 2024

Choose a reason for hiding this comment

JoeGruffins Oct 17, 2024

Choose a reason for hiding this comment

norwnd Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

JoeGruffins Oct 18, 2024

Choose a reason for hiding this comment

norwnd Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

JoeGruffins Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

JoeGruffins Oct 17, 2024

Choose a reason for hiding this comment

norwnd Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

davecgh Oct 19, 2024 • edited Loading

Choose a reason for hiding this comment

davecgh Oct 19, 2024

Choose a reason for hiding this comment

norwnd Oct 19, 2024

Choose a reason for hiding this comment

jrick commented Jan 27, 2025

jrick commented Jan 27, 2025 • edited Loading

jrick commented Jan 27, 2025 • edited Loading

jrick commented Jan 27, 2025

jrick commented Jan 27, 2025

norwnd commented Jan 27, 2025

norwnd Oct 17, 2024 •

edited

Loading

norwnd Oct 18, 2024 •

edited

Loading

JoeGruffins Oct 17, 2024 •

edited

Loading

norwnd Oct 17, 2024 •

edited

Loading

davecgh Oct 19, 2024 •

edited

Loading

jrick commented Jan 27, 2025 •

edited

Loading

jrick commented Jan 27, 2025 •

edited

Loading