Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - cuML LabelEncoder is 200x slower with cuDF-Pandas vs cuDF #6232

Open
cdeotte opened this issue Jan 17, 2025 · 6 comments · May be fixed by rapidsai/cudf#17629
Open

[BUG] - cuML LabelEncoder is 200x slower with cuDF-Pandas vs cuDF #6232

cdeotte opened this issue Jan 17, 2025 · 6 comments · May be fixed by rapidsai/cudf#17629
Assignees
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@cdeotte
Copy link

cdeotte commented Jan 17, 2025

Describe the bug
When Label Encoding a column of strings, if we input a cuDF-Pandas dataframe then cuML Label Encoder is 200x slower than inputting a pure cuDF dataframe

Steps/Code to reproduce bug

%load_ext cudf.pandas

from time import time
import random, string
import pandas as pd, numpy as np, cudf
from cuml.preprocessing import LabelEncoder

def generate_unique_strings(count, length):
    chars = string.ascii_letters + string.digits
    unique_strings = set()

    while len(unique_strings) < count:
        new_string = ''.join(random.choices(chars, k=length))
        unique_strings.add(new_string)

    return list(unique_strings)

unique_strings = generate_unique_strings(1000, 13)
strings = np.random.choice(unique_strings,1_000_000,replace=True)

df_cudf_pandas = pd.DataFrame(strings)
LE = LabelEncoder()
start = time()
df_cudf_pandas[0] = LE.fit_transform(df_cudf_pandas[0])
elapsed_cudf_pandas = time()-start
print(f"cuML Label Encoder with cuDF-Pandas took {elapsed_cudf_pandas:.5f} seconds")

df_cudf = cudf.DataFrame(strings)
LE = LabelEncoder()
start = time()
df_cudf[0] = LE.fit_transform(df_cudf[0])
elapsed_cudf = time()-start
print(f"cuML Label Encoder with pure cuDF took {elapsed_cudf:.5f} seconds")

slowdown = elapsed_cudf_pandas / elapsed_cudf
print(f"cuML Label Encoder is slowdowned by a factor of {slowdown:.0f}x using cuDF-Pandas")

Expected behavior
We would expect cuDF-Pandas to have same speed as pure cuDF. Not 200x slower.

Environment details (please complete the following information):
RAPIDS 24.12

@cdeotte cdeotte added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jan 17, 2025
@mroeschke
Copy link
Contributor

So fit_transform is equivalent to

y = cudf.Series(y)
self.dtype = y.dtype if y.dtype != cp.dtype("O") else str
y = y.astype("category")
self.classes_ = y.cat.categories
return y.cat.codes

The slowest step is equivalent to doing

%load_ext cudf.pandas

import pandas as pd
import cudf

ser = pd.Series(["a" * 1_000_000])
cudf.Series(ser) # the slow step

@vyasr
Copy link
Contributor

vyasr commented Jan 24, 2025

We've generally said that using cudf and pandas together in a script that loads cudf.pandas is out of scope for us (see our known limitations). I'm not sure that we want to try and relax that. I haven't thought about the problems in a while, but I recall there being quite a few. Do you feel better about that possibility now, Matt?

Perhaps what we really want, especially based on the various other issues that we've seen regarding cuml+cudf.pandas compatibility, is to help cuml develop a better internal decision-making on whether to use pandas or cudf based on whether the accelerator is active. That is what rapidsai/cudf#17524 was opened for as well. CC @galipremsagar

@galipremsagar
Copy link
Contributor

galipremsagar commented Jan 24, 2025

I'll work on reviving rapidsai/cudf#17629 & include a fix to this issue that should fix rapidsai/cudf#17524 and remove this slowdown.

@mroeschke
Copy link
Contributor

mroeschke commented Jan 24, 2025

I'm not sure that we want to try and relax that. I haven't thought about the problems in a while, but I recall there being quite a few. Do you feel better about that possibility now, Matt?

Yeah I suspect this slowdown case affects quite a few RAPIDS libraries that have built on top of cudf and also support pandas objects, so I think rapidsai/cudf#17629 would be a quicker win than to go around to each library and audit their pandas with cudf usage (though I think the latter is the cleaner solution in the end).

I'm not sure if this causes circular dependency issues, but I think it would be worthwhile for cudf classic to use the APIs developed in rapidsai/cudf#17629 so other RAPIDS libraries don't have to worry about "what if the user have cudf.pandas enabled" and just always extract the fast object if a RAPIDS libraries pass a cudf.pandas object to a cudf classic API

@galipremsagar
Copy link
Contributor

I'm not sure if this causes circular dependency issues, but I think it would be worthwhile for cudf classic to use the APIs developed in rapidsai/cudf#17629 so other RAPIDS libraries don't have to worry about "what if the user have cudf.pandas enabled" and just always extract the fast object if a RAPIDS libraries pass a cudf.pandas object to a cudf classic API

That is my plan too. I'm working on it now.

@galipremsagar
Copy link
Contributor

rapidsai/cudf#17629 will fix this issue.

Before fix:

(cudfdev) pgali@dgx19:/datasets/pgali/cudf$ python -m cudf.pandas new.py
cuML Label Encoder with cuDF-Pandas took 2.00794 seconds
cuML Label Encoder with pure cuDF took 0.00703 seconds
cuML Label Encoder is slowdowned by a factor of 286x using cuDF-Pandas

After fix:

(cudfdev) pgali@dgx19:/datasets/pgali/cudf$ python -m cudf.pandas new.py
cuML Label Encoder with cuDF-Pandas took 0.09284 seconds
cuML Label Encoder with pure cuDF took 0.00742 seconds
cuML Label Encoder is slowdowned by a factor of 13x using cuDF-Pandas

Now we are at 13x slowdown. But if we alter the ordering of cudf and cudf.pandas execution steps, here is the perf difference with the fix:

(cudfdev) pgali@dgx19:/datasets/pgali/cudf$ python -m cudf.pandas new.py
cuML Label Encoder with cuDF-Pandas took 0.09284 seconds
cuML Label Encoder with pure cuDF took 0.00742 seconds
cuML Label Encoder is slowdowned by a factor of 13x using cuDF-Pandas
(cudfdev) pgali@dgx19:/datasets/pgali/cudf$ python -m cudf.pandas new.py
cuML Label Encoder with pure cuDF took 0.09928 seconds
cuML Label Encoder with cuDF-Pandas took 0.00810 seconds
cuML Label Encoder is slowdowned by a factor of 0x using cuDF-Pandas

So at this point 13x slowdown between cudf.pandas and cudf is just noise coming from first execution being always slower for some APIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
4 participants