Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed race condition in tests assuming TEST_EVENT_OBSERVER_SKIP_RETRY… #5669

Merged
merged 7 commits into from
Jan 27, 2025

Conversation

rdeioris
Copy link
Contributor

@rdeioris rdeioris commented Jan 8, 2025

Description

This patch fixes a race condition in the event_dispatcher tests, specifically:

event_dispatcher::test::test_process_pending_payloads
event_dispatcher::test::test_send_payload_timeout
event_dispatcher::test::test_send_payload_with_db

When executed in parallel, the TEST_EVENT_OBSERVER_SKIP_RETRY global mutex is used for enabling/disabling the retry system for events (by storing them in the db for eventually retry their sending to the observer).

This global is only used in the tests and assumed to be off. The problem is for those tests that are executed while TEST_EVENT_OBSERVER_SKIP_RETRY is set to true by another test and do not lock it before reading.

The patch simply enforces locking (where missing) before sending the payload in those tests.

Applicable issues

Additional info (benefits, drawbacks, caveats)

This originally was part (in a more complex form) of #5570 .

Checklist

  • Test coverage for new or modified code paths
  • Changelog is updated
  • Required documentation changes (e.g., docs/rpc/openapi.yaml and rpc-endpoints.md for v2 endpoints, event-dispatcher.md for new events)
  • New clarity functions have corresponding PR in clarity-benchmarking repo
  • New integration test(s) added to bitcoin-tests.yml

@rdeioris rdeioris requested a review from a team as a code owner January 8, 2025 12:53
@aldur aldur requested review from obycode and jbencin January 10, 2025 15:44
@obycode
Copy link
Contributor

obycode commented Jan 10, 2025

Ah, interesting. I hadn't considered this. Doesn't this change still have the potential for flakiness though? Do we need to run these tests with --test-threads=1 instead?

Copy link
Contributor

@jbencin jbencin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these tests are running in different threads in the same memory space?

If that's the case, I agree with Brice, doing this could just cause falkeyness in the test setting TEST_EVENT_OBSERVER_SKIP_RETRY to true

Seems like you'd need a structure like this:

static TEST_EVENT_OBSERVER_SKIP_RETRY: std::sync::Mutex<HashMap<TestId, bool>> = std::sync::Mutex::new(HashMap::new());

To keep track of which test set the variable, so none of them interfere with eachother

@rdeioris
Copy link
Contributor Author

So these tests are running in different threads in the same memory space?

If that's the case, I agree with Brice, doing this could just cause falkeyness in the test setting TEST_EVENT_OBSERVER_SKIP_RETRY to true

Seems like you'd need a structure like this:

static TEST_EVENT_OBSERVER_SKIP_RETRY: std::sync::Mutex<HashMap<TestId, bool>> = std::sync::Mutex::new(HashMap::new());

To keep track of which test set the variable, so none of them interfere with eachother

The point of the patch is to simplify the previous attempt in #5570 where i used thread locals. Using a hashmap for this IMHO seems overkill. As this pattern (having a global lazy static for hijacking test-specific behaviours) is pretty common in the codebase, maybe we should agree on a "blessed" approach for it (and probably the elephant in the room is that we should avoid it to reduce the amount of test-only code that diverges from the base codepath).

@rdeioris
Copy link
Contributor Author

Ah, interesting. I hadn't considered this. Doesn't this change still have the potential for flakiness though? Do we need to run these tests with --test-threads=1 instead?

Actually there are very few parts of the code where this specific logic applies and are mostly test-specific code. Before the patch i used to run them single threaded to make them pass but i think it is worthy to support the default rust test behaviour.

@rdeioris
Copy link
Contributor Author

rdeioris commented Jan 22, 2025

@jbencin @obycode given that we now have LazyLock<TestFlag<T>> used in various areas, I have converted the patch to use it

Copy link
Contributor

@obycode obycode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this change to use TestFlag is good, but I still don't think this really solves the problem that you described.

@rdeioris
Copy link
Contributor Author

I think that this change to use TestFlag is good, but I still don't think this really solves the problem that you described.

Well, the tests are now serialized, so the problem is basically masked (with each test resetting the state). If we want to fix it in a "more elegant" way we need to change the whole approach, but given that this would be something that can potentially helps other areas in the code maybe it is worthy to discuss a common solution in the naka meetings?

@obycode
Copy link
Contributor

obycode commented Jan 22, 2025

Well, the tests are now serialized, so the problem is basically masked (with each test resetting the state). If we want to fix it in a "more elegant" way we need to change the whole approach, but given that this would be something that can potentially helps other areas in the code maybe it is worthy to discuss a common solution in the naka meetings?

How are they serialized?

@rdeioris
Copy link
Contributor Author

rdeioris commented Jan 22, 2025

@obycode i looked better at the LazyLock<TestFlag<T>> implementation and it definitely does look like it will not help with the issue. Given that TEST_EVENT_OBSERVER_SKIP_RETRY is only used here and with a simple pattern, would be preferable to just get rid of it and add an argument to the send/process payload functions ? in this way every test will get its own behaviour

@obycode
Copy link
Contributor

obycode commented Jan 22, 2025

Good point. Looks like something like that could be sufficient here. Just make sure that it cannot be enabled when not in test mode. Thanks!

@rdeioris
Copy link
Contributor Author

Ok, back to square 1 :( although adding test-only parameters was pretty easy, the code became a mess (basically all of the functions get filled with #[cfg(test)] everywhere.

Thread locals (my initial approach) are flawed again, as stacks-core has various threads in place that will end with various copy of the variable.

The @jbencin approach looks definitely the saner one (using a HashMap with the TestId). Unfortunately TestId is an experimental api (https://doc.rust-lang.org/test/struct.TestId.html), so we need some other "key". (actually i am not even sure that TestId is copied to child threads)

The rust test runner sets the name of the thread to the test name, so we could eventually use std::thread::current().name() as the string of the HashMap, the problem is again that we have various threads running in stacks-core and their name will not match the test one.

The only thing i can think of, is to truly serialize those kind of tests by having a global lock that we acquire at the beginning of each "critical test" and release at the end. Any thoughts/ideas?

@aldur
Copy link
Contributor

aldur commented Jan 23, 2025

If all we care about is test serialization (possibly for a subset of tests) we could also give nextest a shot.

sBTC uses it. This issue proposed it.

@obycode
Copy link
Contributor

obycode commented Jan 23, 2025

We also could use something like https://crates.io/crates/serial_test to specify individual tests that need to be serialized.

@rdeioris rdeioris requested a review from a team as a code owner January 24, 2025 07:07
@rdeioris
Copy link
Contributor Author

Updated the patch with serial_test. Worked definitely well. I have kept the LazyLock<TestFlag<T>> usage for consistency with the rest of the codebase.

Copy link
Contributor

@obycode obycode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@jbencin jbencin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LazyLock<TestFlag<T>> is really overkill here. It's a Mutex inside an Arc inside a LazyLock (which is a spinlock I think). You can have a mutable thread-safe variable with just a static mutex.

I know we use this pattern a lot and it should be refactored in a separate PR, so I'm going to approve this one

@obycode obycode added this pull request to the merge queue Jan 27, 2025
Merged via the queue into develop with commit dd635f0 Jan 27, 2025
178 of 181 checks passed
@obycode obycode deleted the fix/event_observer_skip_retry_tests branch January 27, 2025 21:56
@blockstack-devops
Copy link
Contributor

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@stacks-network stacks-network locked as resolved and limited conversation to collaborators Feb 4, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants