Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Go tests job is flaky #32627

Closed
github-actions bot opened this issue Oct 2, 2024 · 5 comments
Closed

The Go tests job is flaky #32627

github-actions bot opened this issue Oct 2, 2024 · 5 comments

Comments

@github-actions
Copy link
Contributor

github-actions bot commented Oct 2, 2024

The Go tests is failing over 50% of the time.
Please visit https://github.com/apache/beam/actions/workflows/go_tests.yml?query=is%3Afailure+branch%3Amaster to see all failed workflow runs.
See also Grafana statistics: http://metrics.beam.apache.org/d/CTYdoxP4z/ga-post-commits-status?orgId=1&viewPanel=10&var-Workflow=Go%20tests

@damccorm
Copy link
Contributor

damccorm commented Oct 3, 2024

@lostluck @damondouglas would you mind taking a look at this one?

@damondouglas damondouglas self-assigned this Oct 3, 2024
@lostluck
Copy link
Contributor

lostluck commented Oct 3, 2024

There are a bunch that were failing due to staticcheck being out of date, but that's already fixed by #32614. This is probably why this issue was filed, since that made all runs fail for a day or two.

At least one run was failing due to timing out after 10m. We can always extend that.

https://github.com/apache/beam/actions/runs/11144947852/job/30973377990

TestMatchAll/Error_-_no_matches_for_glob_without_wildcard I haven't seen this one before. It's supposed to fail with an error and didn't for some reason. Not sure what happened there. Worth looking into where prism dropped the ball here.

https://github.com/apache/beam/actions/runs/11136655545/job/30948804004

--- FAIL: TestElementChan (0.00s)
--- FAIL: TestElementChan/FillBufferThenAbortThenRead (0.00s)
datamgr_test.go:412: got sum 13, count 13, want sum 20, count 20

This one is known to be flaky. It's trying to test the harness/DataManager logic, (eg how data gets to the actual DataSource for dealing with certain weirder failure conditions that can't be exercised at a higher abstraction level. This leads to inconsistent data.

I can't remember the specifics but those tests could probably be re-written if possible, to not rely on the precise counts. They should be deleted otherwise, since it's unlikely we'd take another look at them.

https://github.com/apache/beam/actions/runs/11050905935/job/30699726834

--- FAIL: TestElementChan (0.00s)
--- FAIL: TestElementChan/SomeTimersAndADataThenReaderThenCleanup (0.00s)
datamgr_test.go:412: got sum 3, count 2, want sum 6, count 3

https://github.com/apache/beam/actions/runs/10926875777/job/30331762994

--- FAIL: TestServer_RunThenCancel (0.00s)
server_test.go:142: server.GetState() = CANCELLING, want CANCELLED

Neat. this is the cancellation test.


Recommended actions:

We should bump the test timeout in the coverage action to 25m (it defaults to 10m).
And then investigate the ElementChan and Cancelation flakes a bit.

@kennknowles
Copy link
Member

@damondouglas I noticed after I opened my PR that you hold the mutex on this. Apologies. Hopefully my trivial change does not negatively impact you. If you actually aren't active on it, you could release it.

@damondouglas
Copy link
Contributor

damondouglas commented Nov 26, 2024

I had assigned myself from the last interrupts and then got pulled into another area. I'll take a look. If I can't get any insight by next week, I will release the ticket but either way submit any notes on my findings / solution here.

@kennknowles
Copy link
Member

Increasing the timeout appears to have deflaked it.

@github-actions github-actions bot added this to the 2.62.0 Release milestone Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants