Add public APIs to Access Underlying `cudf` and `pandas` Objects from `cudf.pandas` Proxy Objects #17629

galipremsagar · 2024-12-19T09:27:43Z

Description

Fixes: #17524
Fixes: rapidsai/cuml#6232
This PR introduces methods to access the real underlying cudf and pandas objects from cudf.pandas proxy objects. These methods ensure compatibility with libraries that are cudf or pandas aware.

This PR also gives a performance boost to cudf-pandas workflows, speeds from the script posted in rapidsai/cuml#6232:

branch-25.02:

cuML Label Encoder with cuDF-Pandas took 2.00794 seconds

This PR:

cuML Label Encoder with cuDF-Pandas took 0.09284 seconds

Changes:

Added get_gpu_object() and get_cpu_object() methods.
Updated faq.md with a section explaining how to use these methods.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Matt711

Overall, this looks good to me. But I think we should add a bullet to the "Are there any limitations?" section of this faq.md . And it should describe the implications for users of cudf.pandas and third-party libraries that are "cudf aware." For example, they could get a cupy array (not a numpy array) when working with xgboost. What do you think?

python/cudf/cudf/pandas/fast_slow_proxy.py

galipremsagar · 2025-01-25T00:35:35Z

@vyasr @bdice @Matt711 This is ready for another round of reviews.

bdice

How we document this for users is probably the most important aspect of this PR. I gave some suggestions, let me know what you think.

docs/cudf/source/cudf_pandas/faq.md

Co-authored-by: Bradley Dice <[email protected]>

mroeschke · 2025-01-25T01:43:06Z

python/cudf/cudf/core/dataframe.py

+                and (index is None or index_extracted)
+                and (columns is None or columns_extracted)
+            ) and (dtype is None and copy is None):
+                self.__dict__.update(data.__dict__)


Could we use _mimic_inplace instead?

mroeschke · 2025-01-25T01:44:38Z

python/cudf/cudf/utils/utils.py

@@ -451,3 +451,17 @@ def _datetime_timedelta_find_and_replace(
    except TypeError:
        result_col = original_column.copy(deep=True)
    return result_col  # type: ignore
+
+
+def _extract_from_proxy(proxy, fast=True):


Suggested change

def _extract_from_proxy(proxy, fast=True):

def _extract_from_proxy(proxy: Any, fast: bool=True) -> tuple[Any, bool]:

mroeschke · 2025-01-25T01:56:01Z

python/cudf/cudf/pandas/_wrappers/pandas.py

+def _Series_dtype(self):
+    # Fast-path to extract dtype from the current
+    # object without round-tripping through the slow<->fast
+    return self._fsproxy_wrapped.dtype


I'm a little nervous to start doing this because we need to be sure the dtype between the fast and slow object are equal correct? e.g. If _fsproxy_wrapped is a pandas.Series with a the extension pandas.Float64Dtype, would that break anything?

Add a public api to get fast slow objects

b5eea1f

galipremsagar added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 19, 2024

galipremsagar self-assigned this Dec 19, 2024

galipremsagar requested a review from a team as a code owner December 19, 2024 09:27

galipremsagar requested review from bdice and Matt711 December 19, 2024 09:27

github-actions bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Dec 19, 2024

Matt711 requested changes Dec 19, 2024

View reviewed changes

bdice reviewed Dec 19, 2024

View reviewed changes

python/cudf/cudf/pandas/fast_slow_proxy.py Outdated Show resolved Hide resolved

galipremsagar mentioned this pull request Jan 24, 2025

[BUG] - cuML LabelEncoder is 200x slower with cuDF-Pandas vs cuDF rapidsai/cuml#6232

Open

galipremsagar added 6 commits January 24, 2025 11:19

Merge remote-tracking branch 'upstream/branch-25.02' into 17524

7bc76e5

update names and add fast paths

3cdfe94

centralize logic

34375dc

fix

31f9e99

cleanup

72ba73f

Merge branch 'branch-25.02' into 17524

3fd679f

galipremsagar added the 3 - Ready for Review Ready for review by team label Jan 25, 2025

galipremsagar requested review from Matt711, vyasr and bdice January 25, 2025 00:51

bdice reviewed Jan 25, 2025

View reviewed changes

Apply suggestions from code review

37764c2

Co-authored-by: Bradley Dice <[email protected]>

bdice approved these changes Jan 25, 2025

View reviewed changes

mroeschke reviewed Jan 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add public APIs to Access Underlying `cudf` and `pandas` Objects from `cudf.pandas` Proxy Objects #17629

Add public APIs to Access Underlying `cudf` and `pandas` Objects from `cudf.pandas` Proxy Objects #17629

galipremsagar commented Dec 19, 2024 •

edited

Loading

Matt711 left a comment

galipremsagar commented Jan 25, 2025

bdice left a comment

mroeschke Jan 25, 2025

mroeschke Jan 25, 2025

mroeschke Jan 25, 2025

	def _extract_from_proxy(proxy, fast=True):
	def _extract_from_proxy(proxy: Any, fast: bool=True) -> tuple[Any, bool]:

Add public APIs to Access Underlying cudf and pandas Objects from cudf.pandas Proxy Objects #17629

Are you sure you want to change the base?

Add public APIs to Access Underlying cudf and pandas Objects from cudf.pandas Proxy Objects #17629

Conversation

galipremsagar commented Dec 19, 2024 • edited Loading

Description

Checklist

Matt711 left a comment

Choose a reason for hiding this comment

galipremsagar commented Jan 25, 2025

bdice left a comment

Choose a reason for hiding this comment

mroeschke Jan 25, 2025

Choose a reason for hiding this comment

mroeschke Jan 25, 2025

Choose a reason for hiding this comment

mroeschke Jan 25, 2025

Choose a reason for hiding this comment

Add public APIs to Access Underlying `cudf` and `pandas` Objects from `cudf.pandas` Proxy Objects #17629

Add public APIs to Access Underlying `cudf` and `pandas` Objects from `cudf.pandas` Proxy Objects #17629

galipremsagar commented Dec 19, 2024 •

edited

Loading