Refactor: cleanup nakamoto/signer test commit logic #5770

kantai · 2025-01-29T21:29:52Z

I opened two PRs to resolve some flakiness in two different tests (#5733 and #5769). In both cases, the test flake was caused by the "wait for a block commit" logic. The issue in one is that the block commit being waited on wasn't submitted for the correct burn block and in the other, the commit was pointing at the wrong stacks tip. There is a function which attempts to do the "right thing" when mining a tenure:

pub fn next_block_and_wait_for_commits(
    btc_controller: &mut BitcoinRegtestController,
    timeout_secs: u64,
    coord_channels: &[&Arc<Mutex<CoordinatorChannels>>],
    commits_submitted: &[&Arc<AtomicU64>],
    wait_for_stacks_block: bool,
) -> Result<(), String> {
    let commits_submitted: Vec<_> = commits_submitted.to_vec();
    let blocks_processed_before: Vec<_> = coord_channels
        .iter()
        .map(|x| {
            x.lock()
                .expect("Mutex poisoned")
                .get_stacks_blocks_processed()
        })
        .collect();
    let commits_before: Vec<_> = commits_submitted
        .iter()
        .map(|x| x.load(Ordering::SeqCst))
        .collect();

    let mut block_processed_time: Vec<Option<Instant>> = vec![None; commits_before.len()];
    let mut commit_sent_time: Vec<Option<Instant>> = vec![None; commits_before.len()];
    next_block_and(btc_controller, timeout_secs, || {
        for i in 0..commits_submitted.len() {
            let commits_sent = commits_submitted[i].load(Ordering::SeqCst);
            let blocks_processed = coord_channels[i]
                .lock()
                .expect("Mutex poisoned")
                .get_stacks_blocks_processed();
            let now = Instant::now();
            if blocks_processed > blocks_processed_before[i] && block_processed_time[i].is_none() {
                block_processed_time[i].replace(now);
            }
            if commits_sent > commits_before[i] && commit_sent_time[i].is_none() {
                commit_sent_time[i].replace(now);
            }
        }

        if !wait_for_stacks_block {
            for i in 0..commits_submitted.len() {
                // just wait for the commit
                let commits_sent = commits_submitted[i].load(Ordering::SeqCst);
                if commits_sent <= commits_before[i] {
                    return Ok(false);
                }

                // if two commits have been sent, one of them must have been after
                if commits_sent >= commits_before[i] + 1 {
                    continue;
                }
                return Ok(false);
            }
            return Ok(true);
        }

        // waiting for both commit and stacks block
        for i in 0..commits_submitted.len() {
            let blocks_processed = coord_channels[i]
                .lock()
                .expect("Mutex poisoned")
                .get_stacks_blocks_processed();
            let commits_sent = commits_submitted[i].load(Ordering::SeqCst);

            if blocks_processed > blocks_processed_before[i] {
                // either we don't care about the stacks block count, or the block count advanced.
                // Check the block-commits.
                let block_processed_time = block_processed_time[i]
                    .as_ref()
                    .ok_or("TEST-ERROR: Processed block time wasn't set")?;
                if commits_sent <= commits_before[i] {
                    return Ok(false);
                }
                let commit_sent_time = commit_sent_time[i]
                    .as_ref()
                    .ok_or("TEST-ERROR: Processed commit time wasn't set")?;
                // try to ensure the commit was sent after the block was processed
                if commit_sent_time > block_processed_time {
                    continue;
                }
                // if two commits have been sent, one of them must have been after
                if commits_sent >= commits_before[i] + 2 {
                    continue;
                }
                // otherwise, just timeout if the commit was sent and its been long enough
                //  for a new commit pass to have occurred
                if block_processed_time.elapsed() > Duration::from_secs(10) {
                    continue;
                }
                return Ok(false);
            } else {
                return Ok(false);
            }
        }
        Ok(true)
    })
}

That logic is prone to flakiness (the Duration::from_secs(10) being a dead give away). But many tests don't even rely on that function, instead embedding some of the commit waiting logic themselves. So, a refactor here would probably replace that function with one that relied on the new Counters variables that actually capture the committed to stacks height and burn heights, and then we'd also need to go through the other tests and figure out some common logic that could be factored (or just be lazy about it, and fix their logic whenever flakiness manifests).

The text was updated successfully, but these errors were encountered:

github-project-automation bot added this to Stacks Core Eng Jan 29, 2025

github-project-automation bot moved this to Status: 🆕 New in Stacks Core Eng Jan 29, 2025

kantai added refactoring testing flaky labels Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: cleanup nakamoto/signer test commit logic #5770

Refactor: cleanup nakamoto/signer test commit logic #5770

kantai commented Jan 29, 2025

Refactor: cleanup nakamoto/signer test commit logic #5770

Refactor: cleanup nakamoto/signer test commit logic #5770

Comments

kantai commented Jan 29, 2025