-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] - cuML LabelEncoder is 200x slower with cuDF-Pandas vs cuDF #6232
Comments
So cuml/python/cuml/cuml/preprocessing/LabelEncoder.py Lines 235 to 241 in 8753e76
The slowest step is equivalent to doing %load_ext cudf.pandas
import pandas as pd
import cudf
ser = pd.Series(["a" * 1_000_000])
cudf.Series(ser) # the slow step |
We've generally said that using cudf and pandas together in a script that loads cudf.pandas is out of scope for us (see our known limitations). I'm not sure that we want to try and relax that. I haven't thought about the problems in a while, but I recall there being quite a few. Do you feel better about that possibility now, Matt? Perhaps what we really want, especially based on the various other issues that we've seen regarding cuml+cudf.pandas compatibility, is to help cuml develop a better internal decision-making on whether to use pandas or cudf based on whether the accelerator is active. That is what rapidsai/cudf#17524 was opened for as well. CC @galipremsagar |
I'll work on reviving rapidsai/cudf#17629 & include a fix to this issue that should fix rapidsai/cudf#17524 and remove this slowdown. |
Yeah I suspect this slowdown case affects quite a few RAPIDS libraries that have built on top of cudf and also support pandas objects, so I think rapidsai/cudf#17629 would be a quicker win than to go around to each library and audit their pandas with cudf usage (though I think the latter is the cleaner solution in the end). I'm not sure if this causes circular dependency issues, but I think it would be worthwhile for cudf classic to use the APIs developed in rapidsai/cudf#17629 so other RAPIDS libraries don't have to worry about "what if the user have cudf.pandas enabled" and just always extract the fast object if a RAPIDS libraries pass a cudf.pandas object to a cudf classic API |
That is my plan too. I'm working on it now. |
rapidsai/cudf#17629 will fix this issue. Before fix:
After fix:
Now we are at 13x slowdown. But if we alter the ordering of
So at this point 13x slowdown between |
Describe the bug
When Label Encoding a column of strings, if we input a cuDF-Pandas dataframe then cuML Label Encoder is 200x slower than inputting a pure cuDF dataframe
Steps/Code to reproduce bug
Expected behavior
We would expect cuDF-Pandas to have same speed as pure cuDF. Not 200x slower.
Environment details (please complete the following information):
RAPIDS 24.12
The text was updated successfully, but these errors were encountered: