-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job timeouts in "CI - sharktank / Data-dependent Tests" after updating IREE versions #888
Comments
The test that is timing out may be https://github.com/nod-ai/shark-ai/blob/main/sharktank/tests/models/flux/flux_test.py. If that is runnable locally, someone could try debugging it using different versions of the IREE packages. |
Locally to avoid using just device 0 I made this change #891 to allow me to specify other device for the tests. With it I am hitting another earlier error coming from HIP's
The native stack trace is
Interestingly this happens only when I run multiple tests one after another. If I run just the offending test all is well. |
The offending IREE change seems to be iree-org/iree#19826. Without it I don't get the error and the CLIP tests run fine. The particular order of IREE runtime calls when tests are ordered in this way seem to cause the problem.
|
@AWoloszyn, have you encountered something like this before? |
The non-reproducibility of the hang was caused by some modification I made to be able to specify a concrete device for the tests. Right now the tests will run on device 0. I have missed a place where device 0 was still create but unused in the After making only 1 device being used I was able to hit the hang at
stack trace:
This is on top of iree-org/iree@1bf7249. |
I was able to more reliably reproduce the hang on top of this branch that has some unmerged commits into main.
|
Would it be okay to disable the test that hangs and continue to update the IREE versions in shark-ai, then fix the test (with either changes to shark-ai or IREE) later? Or do we want this to block the updates until it is resolved? |
What's the latest status here? shark-ai is still pinned to IREE versions from two weeks ago and our next stable release across all projects is scheduled for 1 week from now. We would like to cut a release candidate by ~Wednesday of this week, using all the latest code. Need to continue updating and get visibility into any new issues with the release ASAP. |
I'm not exactly up to date on all things @sogartar mentioned. But I'm getting various problems that look like iree-shortfin compatibility issues. I have 0 familiarity with the IREE codebase so all I can do is paste a catalogue of the things I'm encountering. Some of these replicate on iree 0120 too though. Here's where I'm tracking them: #904 |
Recent attempts to update the versions of IREE used in shark-ai have resulted in 6 hour job timeouts in the "CI - sharktank / Data-dependent Tests" job, source in this file: https://github.com/nod-ai/shark-ai/blob/main/.github/workflows/ci-sharktank.yml.
First observed on version
3.2.0rc20250124
with #867.Logs with IREE version
3.2.0rc20250129
can be seen on #879, for example https://github.com/nod-ai/shark-ai/actions/runs/13035503409/job/36364799644?pr=879Suggested actions
Other details
I added pytest-timeout in #868 and that seemed to stop tests as expected with a 10 second timeout, but a 600 second timeout is clearly not working. The runner itself could be unhealthy, or the tests could be stalled in a way that avoids the timeout (pytest-xdist and pytest-timeout are sometimes not compatible).
The text was updated successfully, but these errors were encountered: