-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add public APIs to Access Underlying cudf
and pandas
Objects from cudf.pandas
Proxy Objects
#17629
base: branch-25.02
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks good to me. But I think we should add a bullet to the "Are there any limitations?" section of this faq.md
. And it should describe the implications for users of cudf.pandas
and third-party libraries that are "cudf aware." For example, they could get a cupy array (not a numpy array) when working with xgboost. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How we document this for users is probably the most important aspect of this PR. I gave some suggestions, let me know what you think.
Co-authored-by: Bradley Dice <[email protected]>
and (index is None or index_extracted) | ||
and (columns is None or columns_extracted) | ||
) and (dtype is None and copy is None): | ||
self.__dict__.update(data.__dict__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use _mimic_inplace
instead?
@@ -451,3 +451,17 @@ def _datetime_timedelta_find_and_replace( | |||
except TypeError: | |||
result_col = original_column.copy(deep=True) | |||
return result_col # type: ignore | |||
|
|||
|
|||
def _extract_from_proxy(proxy, fast=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _extract_from_proxy(proxy, fast=True): | |
def _extract_from_proxy(proxy: Any, fast: bool=True) -> tuple[Any, bool]: |
def _Series_dtype(self): | ||
# Fast-path to extract dtype from the current | ||
# object without round-tripping through the slow<->fast | ||
return self._fsproxy_wrapped.dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little nervous to start doing this because we need to be sure the dtype between the fast and slow object are equal correct? e.g. If _fsproxy_wrapped
is a pandas.Series
with a the extension pandas.Float64Dtype
, would that break anything?
Description
Fixes: #17524
Fixes: rapidsai/cuml#6232
This PR introduces methods to access the real underlying
cudf
andpandas
objects fromcudf.pandas
proxy objects. These methods ensure compatibility with libraries that arecudf
orpandas
aware.This PR also gives a performance boost to
cudf-pandas
workflows, speeds from the script posted in rapidsai/cuml#6232:branch-25.02
:This PR
:Changes:
get_gpu_object()
andget_cpu_object()
methods.Checklist