fixed race condition in tests assuming TEST_EVENT_OBSERVER_SKIP_RETRY… #5669

rdeioris · 2025-01-08T12:53:51Z

Description

This patch fixes a race condition in the event_dispatcher tests, specifically:

event_dispatcher::test::test_process_pending_payloads
event_dispatcher::test::test_send_payload_timeout
event_dispatcher::test::test_send_payload_with_db

When executed in parallel, the TEST_EVENT_OBSERVER_SKIP_RETRY global mutex is used for enabling/disabling the retry system for events (by storing them in the db for eventually retry their sending to the observer).

This global is only used in the tests and assumed to be off. The problem is for those tests that are executed while TEST_EVENT_OBSERVER_SKIP_RETRY is set to true by another test and do not lock it before reading.

The patch simply enforces locking (where missing) before sending the payload in those tests.

Applicable issues

Additional info (benefits, drawbacks, caveats)

This originally was part (in a more complex form) of #5570 .

Checklist

Test coverage for new or modified code paths
Changelog is updated
Required documentation changes (e.g., docs/rpc/openapi.yaml and rpc-endpoints.md for v2 endpoints, event-dispatcher.md for new events)
New clarity functions have corresponding PR in clarity-benchmarking repo
New integration test(s) added to bitcoin-tests.yml

… is disabled

obycode · 2025-01-10T21:52:59Z

Ah, interesting. I hadn't considered this. Doesn't this change still have the potential for flakiness though? Do we need to run these tests with --test-threads=1 instead?

jbencin

So these tests are running in different threads in the same memory space?

If that's the case, I agree with Brice, doing this could just cause falkeyness in the test setting TEST_EVENT_OBSERVER_SKIP_RETRY to true

Seems like you'd need a structure like this:

static TEST_EVENT_OBSERVER_SKIP_RETRY: std::sync::Mutex<HashMap<TestId, bool>> = std::sync::Mutex::new(HashMap::new());

To keep track of which test set the variable, so none of them interfere with eachother

rdeioris · 2025-01-12T17:33:11Z

So these tests are running in different threads in the same memory space?

If that's the case, I agree with Brice, doing this could just cause falkeyness in the test setting TEST_EVENT_OBSERVER_SKIP_RETRY to true

Seems like you'd need a structure like this:
static TEST_EVENT_OBSERVER_SKIP_RETRY: std::sync::Mutex<HashMap<TestId, bool>> = std::sync::Mutex::new(HashMap::new());
To keep track of which test set the variable, so none of them interfere with eachother

The point of the patch is to simplify the previous attempt in #5570 where i used thread locals. Using a hashmap for this IMHO seems overkill. As this pattern (having a global lazy static for hijacking test-specific behaviours) is pretty common in the codebase, maybe we should agree on a "blessed" approach for it (and probably the elephant in the room is that we should avoid it to reduce the amount of test-only code that diverges from the base codepath).

rdeioris · 2025-01-12T17:37:03Z

Ah, interesting. I hadn't considered this. Doesn't this change still have the potential for flakiness though? Do we need to run these tests with --test-threads=1 instead?

Actually there are very few parts of the code where this specific logic applies and are mostly test-specific code. Before the patch i used to run them single threaded to make them pass but i think it is worthy to support the default rust test behaviour.

rdeioris · 2025-01-22T17:35:53Z

@jbencin @obycode given that we now have LazyLock<TestFlag<T>> used in various areas, I have converted the patch to use it

obycode

I think that this change to use TestFlag is good, but I still don't think this really solves the problem that you described.

rdeioris · 2025-01-22T18:36:43Z

I think that this change to use TestFlag is good, but I still don't think this really solves the problem that you described.

Well, the tests are now serialized, so the problem is basically masked (with each test resetting the state). If we want to fix it in a "more elegant" way we need to change the whole approach, but given that this would be something that can potentially helps other areas in the code maybe it is worthy to discuss a common solution in the naka meetings?

obycode · 2025-01-22T19:10:44Z

Well, the tests are now serialized, so the problem is basically masked (with each test resetting the state). If we want to fix it in a "more elegant" way we need to change the whole approach, but given that this would be something that can potentially helps other areas in the code maybe it is worthy to discuss a common solution in the naka meetings?

How are they serialized?

rdeioris · 2025-01-22T19:51:00Z

@obycode i looked better at the LazyLock<TestFlag<T>> implementation and it definitely does look like it will not help with the issue. Given that TEST_EVENT_OBSERVER_SKIP_RETRY is only used here and with a simple pattern, would be preferable to just get rid of it and add an argument to the send/process payload functions ? in this way every test will get its own behaviour

obycode · 2025-01-22T20:34:56Z

Good point. Looks like something like that could be sufficient here. Just make sure that it cannot be enabled when not in test mode. Thanks!

rdeioris · 2025-01-23T09:47:44Z

Ok, back to square 1 :( although adding test-only parameters was pretty easy, the code became a mess (basically all of the functions get filled with #[cfg(test)] everywhere.

Thread locals (my initial approach) are flawed again, as stacks-core has various threads in place that will end with various copy of the variable.

The @jbencin approach looks definitely the saner one (using a HashMap with the TestId). Unfortunately TestId is an experimental api (https://doc.rust-lang.org/test/struct.TestId.html), so we need some other "key". (actually i am not even sure that TestId is copied to child threads)

The rust test runner sets the name of the thread to the test name, so we could eventually use std::thread::current().name() as the string of the HashMap, the problem is again that we have various threads running in stacks-core and their name will not match the test one.

The only thing i can think of, is to truly serialize those kind of tests by having a global lock that we acquire at the beginning of each "critical test" and release at the end. Any thoughts/ideas?

aldur · 2025-01-23T11:25:02Z

If all we care about is test serialization (possibly for a subset of tests) we could also give nextest a shot.

sBTC uses it. This issue proposed it.

obycode · 2025-01-23T14:50:10Z

We also could use something like https://crates.io/crates/serial_test to specify individual tests that need to be serialized.

…ispatcher tests

rdeioris · 2025-01-24T07:09:06Z

Updated the patch with serial_test. Worked definitely well. I have kept the LazyLock<TestFlag<T>> usage for consistency with the rest of the codebase.

testnet/stacks-node/Cargo.toml

obycode

👍

jbencin

LazyLock<TestFlag<T>> is really overkill here. It's a Mutex inside an Arc inside a LazyLock (which is a spinlock I think). You can have a mutable thread-safe variable with just a static mutex.

I know we use this pattern a lot and it should be refactored in a separate PR, so I'm going to approve this one

blockstack-devops · 2025-02-04T00:17:33Z

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

fixed race condition in tests assuming TEST_EVENT_OBSERVER_SKIP_RETRY…

154ffba

… is disabled

rdeioris requested a review from a team as a code owner January 8, 2025 12:53

stacks-fmt

e8f003a

aldur requested review from obycode and jbencin January 10, 2025 15:44

jbencin reviewed Jan 10, 2025

View reviewed changes

rdeioris added 2 commits January 22, 2025 18:31

use LazyLock + TestFlag

4585774

Merge branch 'develop' into fix/event_observer_skip_retry_tests

333f943

obycode reviewed Jan 22, 2025

View reviewed changes

added serial_test to dependancies, use test serialization for event_d…

fe244d9

…ispatcher tests

rdeioris requested a review from a team as a code owner January 24, 2025 07:07

fixed formatting

0aae8cb

obycode reviewed Jan 24, 2025

View reviewed changes

testnet/stacks-node/Cargo.toml Outdated Show resolved Hide resolved

moved serial_test to dev_dependencies

64733db

obycode approved these changes Jan 24, 2025

View reviewed changes

jbencin approved these changes Jan 24, 2025

View reviewed changes

obycode added this pull request to the merge queue Jan 27, 2025

Merged via the queue into develop with commit dd635f0 Jan 27, 2025
178 of 181 checks passed

obycode deleted the fix/event_observer_skip_retry_tests branch January 27, 2025 21:56

blockstack-devops added the locked label Feb 4, 2025

stacks-network locked as resolved and limited conversation to collaborators Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixed race condition in tests assuming TEST_EVENT_OBSERVER_SKIP_RETRY… #5669

fixed race condition in tests assuming TEST_EVENT_OBSERVER_SKIP_RETRY… #5669

rdeioris commented Jan 8, 2025

obycode commented Jan 10, 2025

jbencin left a comment

rdeioris commented Jan 12, 2025

rdeioris commented Jan 12, 2025

rdeioris commented Jan 22, 2025 •

edited

Loading

obycode left a comment

rdeioris commented Jan 22, 2025

obycode commented Jan 22, 2025

rdeioris commented Jan 22, 2025 •

edited

Loading

obycode commented Jan 22, 2025

rdeioris commented Jan 23, 2025

aldur commented Jan 23, 2025

obycode commented Jan 23, 2025

rdeioris commented Jan 24, 2025

obycode left a comment

jbencin left a comment

blockstack-devops commented Feb 4, 2025

fixed race condition in tests assuming TEST_EVENT_OBSERVER_SKIP_RETRY… #5669

fixed race condition in tests assuming TEST_EVENT_OBSERVER_SKIP_RETRY… #5669

Conversation

rdeioris commented Jan 8, 2025

Description

Applicable issues

Additional info (benefits, drawbacks, caveats)

Checklist

obycode commented Jan 10, 2025

jbencin left a comment

Choose a reason for hiding this comment

rdeioris commented Jan 12, 2025

rdeioris commented Jan 12, 2025

rdeioris commented Jan 22, 2025 • edited Loading

obycode left a comment

Choose a reason for hiding this comment

rdeioris commented Jan 22, 2025

obycode commented Jan 22, 2025

rdeioris commented Jan 22, 2025 • edited Loading

obycode commented Jan 22, 2025

rdeioris commented Jan 23, 2025

aldur commented Jan 23, 2025

obycode commented Jan 23, 2025

rdeioris commented Jan 24, 2025

obycode left a comment

Choose a reason for hiding this comment

jbencin left a comment

Choose a reason for hiding this comment

blockstack-devops commented Feb 4, 2025

rdeioris commented Jan 22, 2025 •

edited

Loading

rdeioris commented Jan 22, 2025 •

edited

Loading