Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job timeouts in "CI - sharktank / Data-dependent Tests" after updating IREE versions #888

Open
5 tasks
ScottTodd opened this issue Jan 30, 2025 · 9 comments
Open
5 tasks
Assignees
Labels
bug Something isn't working

Comments

@ScottTodd
Copy link
Member

Recent attempts to update the versions of IREE used in shark-ai have resulted in 6 hour job timeouts in the "CI - sharktank / Data-dependent Tests" job, source in this file: https://github.com/nod-ai/shark-ai/blob/main/.github/workflows/ci-sharktank.yml.

First observed on version 3.2.0rc20250124 with #867.

Logs with IREE version 3.2.0rc20250129 can be seen on #879, for example https://github.com/nod-ai/shark-ai/actions/runs/13035503409/job/36364799644?pr=879

  source ${VENV_DIR}/bin/activate
  pytest \
    --with-clip-data \
    --with-flux-data \
    --with-t5-data \
    --with-vae-data \
    sharktank/tests/models/clip/clip_test.py \
    sharktank/tests/models/t5/t5_test.py \
    sharktank/tests/models/flux/flux_test.py \
    sharktank/tests/models/vae/vae_test.py \
    --durations=0 \
    --timeout=600
  shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
  env:
    VENV_DIR: /shark-dev/actions-runner-llama/_work/shark-ai/shark-ai/.venv
    HF_HOME: /data/huggingface
    pythonLocation: /shark-dev/actions-runner-llama/_work/_tool/Python/3.11.11/x64
    PKG_CONFIG_PATH: /shark-dev/actions-runner-llama/_work/_tool/Python/3.11.11/x64/lib/pkgconfig
    Python_ROOT_DIR: /shark-dev/actions-runner-llama/_work/_tool/Python/3.11.11/x64
    Python2_ROOT_DIR: /shark-dev/actions-runner-llama/_work/_tool/Python/3.11.11/x64
    Python3_ROOT_DIR: /shark-dev/actions-runner-llama/_work/_tool/Python/3.11.11/x64
    LD_LIBRARY_PATH: /shark-dev/actions-runner-llama/_work/_tool/Python/3.11.11/x64/lib
============================= test session starts ==============================
platform linux -- Python 3.11.11, pytest-8.0.0, pluggy-1.5.0
rootdir: /shark-dev/actions-runner-llama/_work/shark-ai/shark-ai/sharktank
configfile: pyproject.toml
plugins: metadata-3.1.1, timeout-2.3.1, html-4.1.1, anyio-4.8.0, xdist-3.5.0
timeout: 600.0s
timeout method: signal
timeout func_only: False
collected 53 items

sharktank/tests/models/clip/clip_test.py ....................            [ 37%]
sharktank/tests/models/t5/t5_test.py .xx.x..x.x..........                [ 75%]
Error: The operation was canceled.

Suggested actions

  • Figure out which test is never completing
  • Figure out why that test regressed
  • Make it so the test job hits a reasonable timeout (30 minutes?) reliably
  • Make the test job logs what is happening (may need to switch off of pytest-xdist)
  • Add upstream tests and behavior changes to IREE so this doesn't happen again

Other details

I added pytest-timeout in #868 and that seemed to stop tests as expected with a 10 second timeout, but a 600 second timeout is clearly not working. The runner itself could be unhealthy, or the tests could be stalled in a way that avoids the timeout (pytest-xdist and pytest-timeout are sometimes not compatible).

@ScottTodd ScottTodd added the bug Something isn't working label Jan 30, 2025
@ScottTodd
Copy link
Member Author

The test that is timing out may be https://github.com/nod-ai/shark-ai/blob/main/sharktank/tests/models/flux/flux_test.py. If that is runnable locally, someone could try debugging it using different versions of the IREE packages.

@sogartar
Copy link
Contributor

sogartar commented Jan 31, 2025

Locally to avoid using just device 0 I made this change #891 to allow me to specify other device for the tests. With it I am hitting another earlier error coming from HIP's hipHostFree.

Memobj map does not have ptr: 0x0
sharktank/tests/models/clip/clip_test.py::ClipTextIreeTest::testCompareLargeIreeBf16AgainstTorchEagerF32 PASSED
sharktank/tests/models/clip/clip_test.py::ClipTextIreeTest::testCompareLargeIreeF32AgainstTorchEagerF32 PASSED
sharktank/tests/models/clip/clip_test.py::ClipTextIreeTest::testCompareToyModelIreeBf16AgainstTorchEagerF32 PASSED
sharktank/tests/models/clip/clip_test.py::ClipTextIreeTest::testCompareToyModelIreeF32AgainstTorchEagerF32 :0:/long_pathname_so_that_rpms_can_package_the_debug_info/src/clr/rocclr/device/device.cpp:327 : 801744095788d us:  Memobj map does not have ptr: 0x0
Fatal Python error: Aborted

Current thread 0x00007f1dbefaf000 (most recent call first):
  File "/home/bpetkant/ws/sharktank/repo/sharktank/tests/models/clip/clip_test.py", line 198 in runTestCompareIreeAgainstTorchEagerWithInputTokens
  File "/home/bpetkant/ws/sharktank/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context
  File "/home/bpetkant/ws/sharktank/repo/sharktank/tests/models/clip/clip_test.py", line 233 in runTestCompareRandomModelIreeAgainstTorch
  File "/home/bpetkant/ws/sharktank/repo/sharktank/tests/models/clip/clip_test.py", line 247 in runTestCompareToyModelIreeAgainstTorch
  File "/home/bpetkant/ws/sharktank/repo/sharktank/tests/models/clip/clip_test.py", line 131 in testCompareToyModelIreeF32AgainstTorchEagerF32

The native stack trace is

    ...
    frame #9: 0x00007ffed88aea7d libamdhip64.so`___lldb_unnamed_symbol3509 + 653
    frame #10: 0x00007fff373d1ff8 _runtime.cpython-311-x86_64-linux-gnu.so`iree_hal_hip_buffer_free(hip_symbols=0x00000000264b5470, buffer_type=<unavailable>, device_ptr=<unavailable>, host_ptr=<unavailable>) at hip_allocator.c:401:7
    frame #11: 0x00007fff373d1f57 _runtime.cpython-311-x86_64-linux-gnu.so`iree_hal_hip_buffer_release_callback(user_data=0x0000000022b1f3e0, buffer=0x0000000022b1ebe0) at hip_allocator.c:579:3
    frame #12: 0x00007fff373d232d _runtime.cpython-311-x86_64-linux-gnu.so`iree_hal_hip_buffer_destroy(base_buffer=0x0000000022b1ebe0) at hip_buffer.c:110:5
    frame #13: 0x00007fff373cd8f8 _runtime.cpython-311-x86_64-linux-gnu.so`iree_hal_hip_device_destroy [inlined] iree_hal_resource_release(any_resource=<unavailable>) at resource.h:90:5
    frame #14: 0x00007fff373cd8e8 _runtime.cpython-311-x86_64-linux-gnu.so`iree_hal_hip_device_destroy(base_device=0x0000000022b19dc0) at hip_device.c:529:5
    frame #15: 0x00007fff3734ea45 _runtime.cpython-311-x86_64-linux-gnu.so`iree::python::ApiRefCounted<iree::python::HalDevice, iree_hal_device_t>::~ApiRefCounted() [inlined] iree::python::ApiPtrAdapter<iree_hal_device_t>::Release(d=<unavailable>) at hal.h:57:47
    frame #16: 0x00007fff3734ea40 _runtime.cpython-311-x86_64-linux-gnu.so`iree::python::ApiRefCounted<iree::python::HalDevice, iree_hal_device_t>::~ApiRefCounted() [inlined] iree::python::ApiRefCounted<iree::python::HalDevice, iree_hal_device_t>::Release(this=<unavailable>) at binding.h:109:7
    frame #17: 0x00007fff3734ea38 _runtime.cpython-311-x86_64-linux-gnu.so`iree::python::ApiRefCounted<iree::python::HalDevice, iree_hal_device_t>::~ApiRefCounted(this=<unavailable>) at binding.h:61:22
    frame #18: 0x00007fff373811f3 _runtime.cpython-311-x86_64-linux-gnu.so`nanobind::detail::inst_dealloc(self=0x00007fff292e7ed0) at nb_type.cpp:241:13
    ...

Interestingly this happens only when I run multiple tests one after another. If I run just the offending test all is well.

@sogartar
Copy link
Contributor

sogartar commented Jan 31, 2025

The offending IREE change seems to be iree-org/iree#19826. Without it I don't get the error and the CLIP tests run fine.

The particular order of IREE runtime calls when tests are ordered in this way seem to cause the problem.
When on top of #891 the failing test command is

IREE_DEVICE="hip://1" \
pytest \
    --verbose \
    --capture=no \
    --with-clip-data \
    --durations=0 \
    sharktank/tests/models/clip/clip_test.py

@sogartar
Copy link
Contributor

@AWoloszyn, have you encountered something like this before?

@sogartar
Copy link
Contributor

The non-reproducibility of the hang was caused by some modification I made to be able to specify a concrete device for the tests. Right now the tests will run on device 0. I have missed a place where device 0 was still create but unused in the iree.runtime.FunctionInvoker. The above error seems still a real bug though.

After making only 1 device being used I was able to hit the hang at

sharktank/tests/models/flux/flux_test.py::FluxTest::testCompareDevIreeF32AgainstHuggingFaceF32

stack trace:

    ....
    frame #1: 0x00007f8f2031758e _runtime.cpython-311-x86_64-linux-gnu.so`iree_notification_commit_wait [inlined] iree_futex_wait(address=0x000000004a2ca904, expected_value=1, deadline_ns=9223372036854775807) at synchronization.c:178:12
    frame #2: 0x00007f8f20317534 _runtime.cpython-311-x86_64-linux-gnu.so`iree_notification_commit_wait(notification=0x000000004a2ca900, wait_token=1, spin_ns=0, deadline_ns=9223372036854775807) at synchronization.c:816:11
    frame #3: 0x00007f8f202d463a _runtime.cpython-311-x86_64-linux-gnu.so`iree_hal_hip_semaphore_wait(base_semaphore=0x000000004a2ca4e0, value=1, timeout=(type = IREE_TIMEOUT_ABSOLUTE, nanos = 9223372036854775807)) at event_semaphore.c:851:11
    frame #4: 0x00007f8f202969a5 _runtime.cpython-311-x86_64-linux-gnu.so`iree_hal_semaphore_list_wait [inlined] iree_hal_semaphore_wait(semaphore=<unavailable>, value=<unavailable>, timeout=(type = IREE_TIMEOUT_ABSOLUTE, nanos = 9223372036854775807)) at semaphore.c:77:7
    frame #5: 0x00007f8f20296998 _runtime.cpython-311-x86_64-linux-gnu.so`iree_hal_semaphore_list_wait(semaphore_list=iree_hal_semaphore_list_t @ 0x00007ffd5f2a6270, timeout=(type = IREE_TIMEOUT_ABSOLUTE, nanos = 9223372036854775807)) at semaphore.c:201:14
    frame #6: 0x00007f8f202964a2 _runtime.cpython-311-x86_64-linux-gnu.so`iree_hal_fence_wait(fence=<unavailable>, timeout=<unavailable>) at fence.c:251:26
    frame #7: 0x00007f8f202a2363 _runtime.cpython-311-x86_64-linux-gnu.so`iree_hal_module_fence_await(stack=0x00007ffd5f2a67a0, module=<unavailable>, state=0x0000000038a16620, args=<unavailable>, rets=0x00007ffd5f2a64c0) at module.c:1777:25
    frame #8: 0x00007f8f20313cbd _runtime.cpython-311-x86_64-linux-gnu.so`iree_vm_native_module_issue_call(module=0x0000000031e7fdb0, stack=0x00007ffd5f2a67a0, callee_frame=0x000000003f1695e0, flags=<unavailable>, args_storage=<unavailable>, rets_storage=(data = "", data_length = 4)) at native_module.c:364:7
    frame #9: 0x00007f8f20313bb4 _runtime.cpython-311-x86_64-linux-gnu.so`iree_vm_native_module_begin_call(self=0x0000000031e7fdb0, stack=0x00007ffd5f2a67a0, call=iree_vm_function_call_t @ 0x00007ffd5f2a6400) at native_module.c:420:10
    frame #10: 0x00007f8f202ae7b3 _runtime.cpython-311-x86_64-linux-gnu.so`iree_vm_bytecode_issue_import_call(stack=0x00007ffd5f2a67a0, call=iree_vm_function_call_t @ 0x00007ffd5f2a6490, cconv_results=<unavailable>, dst_reg_list=0x00007f8e1405123a, out_caller_frame=<unavailable>, out_caller_registers=<unavailable>) at dispatch.c:452:7
    frame #11: 0x00007f8f202ae17d _runtime.cpython-311-x86_64-linux-gnu.so`iree_vm_bytecode_call_import_variadic(stack=0x00007ffd5f2a67a0, module_state=0x0000000038a166c0, import_ordinal=<unavailable>, caller_registers=iree_vm_registers_t @ 0x00005565461860a0, segment_size_list=0x00007f8e1405122e, src_reg_list=0x00007f8e14051234, dst_reg_list=0x00007f8e1405123a, out_caller_frame=0x00007ffd5f2a65d0, out_caller_registers=0x00007ffd5f2a65b8) at dispatch.c:609:10
    frame #12: 0x00007f8f202aa2bc _runtime.cpython-311-x86_64-linux-gnu.so`iree_vm_bytecode_dispatch(stack=0x00007ffd5f2a67a0, module=0x0000000038a16320, current_frame=0x000000003f165660, regs=iree_vm_registers_t @ 0x00007ffd5f2a65b8, call_results=(data = 0x0000000000000000, data_length = 0)) at dispatch.c:1721:5
    frame #13: 0x00007f8f202a6b75 _runtime.cpython-311-x86_64-linux-gnu.so`iree_vm_bytecode_dispatch_begin(stack=0x00007ffd5f2a67a0, module=0x0000000038a16320, call=iree_vm_function_call_t @ 0x00007ffd5f2a66d0, cconv_arguments=(data = "v_v", size = 0), cconv_results=<unavailable>) at dispatch.c:636:26
    frame #14: 0x00007f8f202a632d _runtime.cpython-311-x86_64-linux-gnu.so`iree_vm_bytecode_module_begin_call(self=0x0000000038a16320, stack=0x00007ffd5f2a67a0, call=iree_vm_function_call_t @ 0x00007ffd5f2a6770) at module.c:845:10
    frame #15: 0x00007f8f2030e89d _runtime.cpython-311-x86_64-linux-gnu.so`iree_vm_context_register_modules [inlined] iree_vm_context_run_function(context=<unavailable>, stack=0x00007ffd5f2a67a0, module=0x0000000038a16320, function_name=(data = 0x0000000000000000, size = 6)) at context.c:91:12
    frame #16: 0x00007f8f2030e818 _runtime.cpython-311-x86_64-linux-gnu.so`iree_vm_context_register_modules(context=0x0000000031e825c0, module_count=3, modules=0x000000003d82b7f0) at context.c:596:14
    frame #17: 0x00007f8f2030e3b4 _runtime.cpython-311-x86_64-linux-gnu.so`iree_vm_context_create_with_modules(instance=0x00000000422ba9e0, flags=0, module_count=3, modules=0x000000003d82b7f0, allocator=iree_allocator_t @ 0x0000556546186260, out_context=0x00007ffd5f2a8930) at context.c:340:7
    frame #18: 0x00007f8f2026e498 _runtime.cpython-311-x86_64-linux-gnu.so`iree::python::VmContext::Create(instance=0x00007f8e6a028428, modules= Has Value=true ) at vm.cc:194:19
    frame #19: 0x00007f8f20276f17 _runtime.cpython-311-x86_64-linux-gnu.so`_object* nanobind::detail::func_create<false, true, iree::python::SetupVmBindings(nanobind::module_)::$_9, void, iree::python::VmContext*, iree::python::VmInstance*, std::optional<std::vector<iree::python::VmModule*, std::allocator<iree::python::VmModule*>>>, 0ul, 1ul, 2ul, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::arg, nanobind::arg_v>(iree::python::SetupVmBindings(nanobind::module_)::$_9&&, void (*)(iree::python::VmContext*, iree::python::VmInstance*, std::optional<std::vector<iree::python::VmModule*, std::allocator<iree::python::VmModule*>>>), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul>, nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::arg const&, nanobind::arg_v const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) [inlined] iree::python::SetupVmBindings(nanobind::module_)::$_9::operator()(this=<unavailable>, self=0x00007f8d91648b28, instance=<unavailable>, modules= Has Value=true ) const at vm.cc:928:21
    frame #20: 0x00007f8f20276f03 _runtime.cpython-311-x86_64-linux-gnu.so`_object* nanobind::detail::func_create<false, true, iree::python::SetupVmBindings(nanobind::module_)::$_9, void, iree::python::VmContext*, iree::python::VmInstance*, std::optional<std::vector<iree::python::VmModule*, std::allocator<iree::python::VmModule*>>>, 0ul, 1ul, 2ul, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::arg, nanobind::arg_v>(iree::python::SetupVmBindings(nanobind::module_)::$_9&&, void (*)(iree::python::VmContext*, iree::python::VmInstance*, std::optional<std::vector<iree::python::VmModule*, std::allocator<iree::python::VmModule*>>>), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul>, nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::arg const&, nanobind::arg_v const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) [inlined] _object* nanobind::detail::func_create<false, true, iree::python::SetupVmBindings(nanobind::module_)::$_9, void, iree::python::VmContext*, iree::python::VmInstance*, std::optional<std::vector<iree::python::VmModule*, std::allocator<iree::python::VmModule*>>>, 0ul, 1ul, 2ul, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::arg, nanobind::arg_v>(this=<unavailable>, p=<unavailable>, args=<unavailable>, args_flags=<unavailable>, policy=<unavailable>, cleanup=<unavailable>)::$_9&&, void (*)(iree::python::VmContext*, iree::python::VmInstance*, std::optional<std::vector<iree::python::VmModule*, std::allocator<iree::python::VmModule*>>>), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul>, nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::arg const&, nanobind::arg_v const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::operator()(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) const at nb_func.h:264:13
    frame #21: 0x00007f8f20276e5a _runtime.cpython-311-x86_64-linux-gnu.so`_object* nanobind::detail::func_create<false, true, iree::python::SetupVmBindings(nanobind::module_)::$_9, void, iree::python::VmContext*, iree::python::VmInstance*, std::optional<std::vector<iree::python::VmModule*, std::allocator<iree::python::VmModule*>>>, 0ul, 1ul, 2ul, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::arg, nanobind::arg_v>(iree::python::SetupVmBindings(nanobind::module_)::$_9&&, void (*)(iree::python::VmContext*, iree::python::VmInstance*, std::optional<std::vector<iree::python::VmModule*, std::allocator<iree::python::VmModule*>>>), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul>, nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::arg const&, nanobind::arg_v const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(p=<unavailable>, args=<unavailable>, args_flags=<unavailable>, policy=<unavailable>, cleanup=<unavailable>) at nb_func.h:219:14
    frame #22: 0x00007f8f2027c668 _runtime.cpython-311-x86_64-linux-gnu.so`nanobind::detail::nb_func_vectorcall_complex(self=0x00007f8f2072abd0, args_in=0x00007f8d91649cb8, nargsf=9223372036854775809, kwargs_in=0x00007f8da07c19c0) at nb_func.cpp:822:26
    ....

This is on top of iree-org/iree@1bf7249.

@sogartar
Copy link
Contributor

I was able to more reliably reproduce the hang on top of this branch that has some unmerged commits into main.
The hang happens on top of iree-org/iree@1bf7249 and its parent commit as well.

export IREE_DEVICE="hip://5" # Pick your own device to not interfere with others.
export HF_HOME="/data/huggingface" # Downloads Hugging Face models here.

pytest \
    --verbose \
    --capture=no \
    --with-flux-data \
    --with-t5-data \
    sharktank/tests/models/t5/t5_test.py \
    sharktank/tests/models/flux/flux_test.py

@ScottTodd
Copy link
Member Author

Would it be okay to disable the test that hangs and continue to update the IREE versions in shark-ai, then fix the test (with either changes to shark-ai or IREE) later? Or do we want this to block the updates until it is resolved?

@ScottTodd
Copy link
Member Author

What's the latest status here? shark-ai is still pinned to IREE versions from two weeks ago and our next stable release across all projects is scheduled for 1 week from now. We would like to cut a release candidate by ~Wednesday of this week, using all the latest code. Need to continue updating and get visibility into any new issues with the release ASAP.

@renxida
Copy link
Contributor

renxida commented Feb 3, 2025

I'm not exactly up to date on all things @sogartar mentioned.

But I'm getting various problems that look like iree-shortfin compatibility issues. I have 0 familiarity with the IREE codebase so all I can do is paste a catalogue of the things I'm encountering. Some of these replicate on iree 0120 too though.

Here's where I'm tracking them: #904

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants