Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Temporarily skip CUDA 11 wheel CI" #601

Open
wants to merge 1 commit into
base: branch-25.02
Choose a base branch
from

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Jan 22, 2025

Reverts #599 now that rapidsai/raft#2548 has landed.

@bdice bdice requested a review from a team as a code owner January 22, 2025 11:34
@bdice bdice requested a review from jameslamb January 22, 2025 11:34
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.30%. Comparing base (9b7bb97) to head (0702f89).

Additional details and impacted files
@@              Coverage Diff              @@
##           branch-25.02     #601   +/-   ##
=============================================
  Coverage         72.30%   72.30%           
=============================================
  Files                14       14           
  Lines                65       65           
=============================================
  Hits                 47       47           
  Misses               18       18           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jameslamb jameslamb added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jan 22, 2025
@jameslamb
Copy link
Member

good news: the wheel tests that had been failing because of the cuBLAS issues are passing!

bad news: 1 wheel test

=========================== short test summary info ============================
FAILED python/cuvs/cuvs/test/test_distance.py::test_distance[float16-F-True-euclidean-50-100] - assert False
 +  where False = <function allclose at 0xfffee65add70>(array([[0.        , 2.94198351, 2.11872091, ..., 2.73895706, 2.80186958,\n        2.62724569],\n       [2.94198351, 0.  ...272 ],\n       [2.62724569, 2.84470779, 2.48090272, ..., 2.65241563, 2.7694272 ,\n        0.        ]], shape=(100, 100)), array([[0.       , 2.939494 , 2.1176343, ..., 2.738613 , 2.8034577,\n        2.625    ],\n       [2.939494 , 0.       , ...     [2.625    , 2.8449516, 2.4811792, ..., 2.6516504, 2.769815 ,\n        0.       ]], shape=(100, 100), dtype=float32), atol=0.1, rtol=0.1)
 +    where <function allclose at 0xfffee65add70> = np.allclose
====== 1 failed, 1917 passed, 116 skipped, 2 xfailed in 105.19s (0:01:45) ======

(build link)

That looks like a numerical-precision thing (which can sometimes show up as a flaky test), but I observed it on consecutive runs.

@bdice
Copy link
Contributor Author

bdice commented Jan 22, 2025

#596 looks like it could be related to the precision error. @rhdong Can you confirm if your PR is expected to fix this failure?

@rhdong
Copy link
Member

rhdong commented Jan 22, 2025

good news: the wheel tests that had been failing because of the cuBLAS issues are passing!

bad news: 1 wheel test

=========================== short test summary info ============================
FAILED python/cuvs/cuvs/test/test_distance.py::test_distance[float16-F-True-euclidean-50-100] - assert False
 +  where False = <function allclose at 0xfffee65add70>(array([[0.        , 2.94198351, 2.11872091, ..., 2.73895706, 2.80186958,\n        2.62724569],\n       [2.94198351, 0.  ...272 ],\n       [2.62724569, 2.84470779, 2.48090272, ..., 2.65241563, 2.7694272 ,\n        0.        ]], shape=(100, 100)), array([[0.       , 2.939494 , 2.1176343, ..., 2.738613 , 2.8034577,\n        2.625    ],\n       [2.939494 , 0.       , ...     [2.625    , 2.8449516, 2.4811792, ..., 2.6516504, 2.769815 ,\n        0.       ]], shape=(100, 100), dtype=float32), atol=0.1, rtol=0.1)
 +    where <function allclose at 0xfffee65add70> = np.allclose
====== 1 failed, 1917 passed, 116 skipped, 2 xfailed in 105.19s (0:01:45) ======

(build link)

That looks like a numerical-precision thing (which can sometimes show up as a flaky test), but I observed it on consecutive runs.

Hi @jameslamb , this PR will resolve the issue, pls rerun your tests to ignore it temporarily.

@vyasr
Copy link
Contributor

vyasr commented Jan 22, 2025

How many times should we try a rerun? Looks like it's failed three times now.

@cjnolet
Copy link
Member

cjnolet commented Jan 22, 2025

@vyasr @jameslamb cuVS CI started failing when the script to run the python tests was fixed. I’m not sure which tests were/weren’t running prior to that because I verified myself that there were Python tests running in CI prior to that fix. However, i suspect these tests hadn’t been running since October timeframe and that’s why we are now seeing failures.

One failure seems related to CUBLAS, another seems related to precision or a bug in a distance function/computation.

@jameslamb
Copy link
Member

Oh wow! Thanks for that context.

One failure seems related to CUBLAS,

Take a look at "cuVS CI failures" in rapidsai/build-planning#137. If what you're referring to is the same as those logs, then that issue is now fixed.

another seems related to precision or a bug in a distance function/computation

Ok yep, that's the one we're running into here, I think: #601 (comment)

@rhdong
Copy link
Member

rhdong commented Jan 22, 2025

How many times should we try a rerun? Looks like it's failed three times now.

Well... it looks like getting consecutive Aces in a poker game.. I just rerun it, let's see and the #596 is close to pass all CI tests, at least we can count on merge it in advance..

@vyasr
Copy link
Contributor

vyasr commented Jan 22, 2025

Ha yes at this point I think we'll probably wind up waiting for #596 to finish CI, but since the wheel tests are fast no harm in attempting a rerun and seeing what happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants