[NIGHTLY] v25.02.00
Pre-release
Pre-release
rapids-bot
released this
21 Nov 23:18
·
305 commits
to branch-25.02
since this release
π Links
π¨ Breaking Changes
- Add seed parameter to hash_character_ngrams (#17643) @davidwendt
- Performance improvements and simplifications for fixed size row-based rolling windows (#17623) @wence-
- Refactor distinct hash join to handle multiple probes with the same build table (#17609) @PointKernel
- Deprecate cudf::grouped_time_range_rolling_window (#17589) @wence-
- Remove "legacy" Dask DataFrame support from Dask cuDF (#17558) @rjzamora
- Return empty result for segmented_reduce if input and offsets are both empty (#17437) @davidwendt
- Rework minhash APIs for deprecation cycle (#17421) @davidwendt
- Change indices for dictionary column to signed integer type (#17390) @davidwendt
π Bug Fixes
- Resolve race-condition in
disable_module_accelerator
(#17811) @galipremsagar - Make Series(dtype=object) raise in mode.pandas_compat with non string data (#17804) @mroeschke
- Disable intended disabled ORC tests (#17790) @davidwendt
- Fix empty DataFrame construction not returning RangeIndex columns (#17784) @mroeschke
- Fix various
.str
methods for pandas compatability (#17782) @mroeschke - Fix
count
API issue about ignoring nan values (#17779) @galipremsagar - Add
numba
pinning tocudf
repo (#17777) @galipremsagar - Allow .sort_values(na_position=) to include NaNs in mode.pandas_compatible (#17776) @mroeschke
- allow deselecting nvcomp wheels (#17774) @jameslamb
- Use the
aligned_resource_adaptor
to allocate bloom filter device buffers (#17758) @mhaseeb123 - Avoid instantiating bloom filter query function for nested and bool types (#17753) @mhaseeb123
- Fix DataFrame.merge(Series, how="left"/"right") on column and index not resulting in a RangeIndex (#17739) @mroeschke
- [BUG] xfail Polars excel test (#17731) @Matt711
- Require to implement
AutoCloseable
for the classes derived fromHostUDFWrapper
(#17727) @ttnghia - Remove jlowe as a java committer since he retired (#17725) @tgravescs
- Prevent use of invalid grid sizes in ORC reader and writer (#17709) @vuule
- Fix writing of compressed ORC files with large stripe footers (#17700) @vuule
- Fix cudf.polars sum of empty not equalling zero (#17685) @mroeschke
- Fix formatting in logging (#17680) @vuule
- convert all nulls to nans in a specific scenario (#17677) @galipremsagar
- Define cudf repr methods on the Column (#17675) @mroeschke
- Fix groupby.len with null values in cudf.polars (#17671) @mroeschke
- Fix: DataFrameGroupBy.get_group was raising with length>1 tuples (#17653) @MarcoGorelli
- Fix possible int overflow in compute_mixed_join_output_size (#17633) @davidwendt
- Fix a minor potential i32 overflow in
thrust::transform_exclusive_scan
in PQ reader preprocessing (#17617) @mhaseeb123 - Fix failing xgboost test in the cudf.pandas third-party integration tests (#17616) @Matt711
- Fix
dask_cudf.read_csv
(#17612) @rjzamora - Fix memcheck error in ReplaceTest.NormalizeNansAndZerosMutable gtest (#17610) @davidwendt
- Correctly accept a
pandas.CategoricalDtype(pandas.IntervalDtype(...), ...)
type (#17604) @mroeschke - Add ability to modify and propagate
names
ofcolumns
object (#17597) @galipremsagar - Ignore NaN correctly in .quantile (#17593) @mroeschke
- Fix groupby argmin/max gather of sorted-order indices (#17591) @davidwendt
- Fix ctest fail running libcudf tests in a Debug build (#17576) @davidwendt
- Specify a version for rapids_logger dependency (#17573) @jlowe
- Fix the ORC decoding bug for the timestamp data (#17570) @kingcrimsontianyu
- [JNI] remove rmm argument to set rw access for fabric handles (#17553) @abellina
- Document undefined behavior in div_rounding_up_safe (#17542) @davidwendt
- Fix nvcc-imposed UB in
constexpr
functions (#17534) @vuule - Add anonymous namespace to libcudf test source (#17529) @davidwendt
- Propagate failures in pandas integration tests and Skip failing tests (#17521) @Matt711
- Fix libcudf compile error when logging is disabled (#17512) @davidwendt
- Fix Dask-cuDF
clip
APIs (#17509) @rjzamora - Fix pylibcudf to_arrow with multiple nested data types (#17504) @mroeschke
- Fix groupby(as_index=False).size not reseting index (#17499) @mroeschke
- Revert "Temporarily skip tests due to dask/distributed#8953" (#17492) @Matt711
- Workaround for a misaligned access in
read_csv
on some CUDA versions (#17477) @vuule - Fix some possible thread-id overflow calculations (#17473) @davidwendt
- Temporarily skip tests due to dask/distributed#8953 (#17472) @wence-
- Detect mismatches in begin and end tokens returned by JSON tokenizer FST (#17471) @shrshi
- Support dask>=2024.11.2 in Dask cuDF (#17439) @rjzamora
- Fix write_json failure for zero columns in table/struct (#17414) @karthikeyann
- Fix Debug-mode failing Arrow test (#17405) @zeroshade
- Fix all null list column with missing child column in JSON reader (#17348) @karthikeyann
π Documentation
- Cross-link cudf.pandas profiler documentation. (#17668) @bdice
- Document interpreter install command for cudf.pandas (#17358) @bdice
- add comment to Series.tolist method (#17350) @tequilayu
π New Features
- Add public interop functions between pylibcudf and cudf classic (#17730) @Matt711
- Support
dask_expr
migration intodask.dataframe
(#17704) @rjzamora - Make tests build without relaxed constexpr (#17691) @PointKernel
- Set default logger level to warn (#17684) @vyasr
- Support multithreaded reading of compressed buffers in JSON reader (#17670) @shrshi
- Control pinned memory use with environment variables (#17657) @vuule
- Host compression (#17656) @vuule
- Enable text build without relying on relaxed constexpr (#17647) @PointKernel
- Implement
HOST_UDF
aggregation for reduction and segmented reduction (#17645) @ttnghia - Add JSON reader options structs to pylibcudf (#17614) @Matt711
- Refactor distinct hash join to handle multiple probes with the same build table (#17609) @PointKernel
- Add JSON Writer options classes to pylibcudf (#17606) @Matt711
- Add ORC reader options structs to pylibcudf (#17601) @Matt711
- Add Avro Reader options classes to pylibcudf (#17599) @Matt711
- Enable binaryop build without relying on relaxed constexpr (#17598) @PointKernel
- Implement
HOST_UDF
aggregation for groupby (#17592) @ttnghia - Plumb pylibcudf.io.parquet options classes through cudf python (#17506) @Matt711
- Add partition-wise
Select
support to cuDF-Polars (#17495) @rjzamora - Add multi-partition
Scan
support to cuDF-Polars (#17494) @rjzamora - Migrate
cudf::io::merge_row_group_metadata
to pylibcudf (#17491) @Matt711 - Add Parquet Reader options classes to pylibcudf (#17464) @Matt711
- Add multi-partition
DataFrameScan
support to cuDF-Polars (#17441) @rjzamora - Return empty result for segmented_reduce if input and offsets are both empty (#17437) @davidwendt
- Abstract polars function expression nodes to ensure they are serializable (#17418) @pentschev
- Add CSV Reader options classes to pylibcudf (#17412) @Matt711
- Add support for
pylibcudf.DataType
serialization (#17352) @pentschev - Enable rounding for Decimal32 and Decimal64 in cuDF (#17332) @a-hirota
- Remove upper bounds on cuda-python to allow 12.6.2 and 11.8.5 (#17326) @bdice
- Expose stream-ordering to groupby APIs (#17324) @shrshi
- Migrate ORC Writer to pylibcudf (#17310) @Matt711
- Support reading bloom filters from Parquet files and filter row groups using them (#17289) @mhaseeb123
π οΈ Improvements
- Remove incorrect calls to set architectures (#17813) @vyasr
- Add support for
pyarrow-19
(#17794) @galipremsagar - Reduce libcudf memcheck tests output (#17791) @davidwendt
- Make cudf build with latest CCCL (#17788) @miscco
- Update how to manage host UDF instance (#17770) @res-life
- Add getInts api for HostMemoryBuffer and UnsafeMemoryAccessor (#17767) @liurenjie1024
- Standarize methods used from
cudf.core._internals
(#17765) @mroeschke - Deprecate dataframe protocol (#17736) @vyasr
- Add parquet reader long row test (#17735) @pmattione-nvidia
- Update kvikio call due to upstream changes (#17733) @kingcrimsontianyu
- Delay setting MultiIndex.level/codes until needed (#17728) @mroeschke
- Bounding pool size in multi-batch JSON reader (#17724) @shrshi
- Use GCC 13 in CUDA 12 conda builds. (#17721) @bdice
- Update minimal sphinx theme version so that we can use parallel doc builds (#17719) @vyasr
- Add more aggregation methods in pylibcudf (#17717) @mroeschke
- Make cudf._lib.string_udf work with pylibcudf Columns instead of cudf._lib Columns (#17715) @mroeschke
- Add special orc test data: timestamp interspersed with null values (#17713) @kingcrimsontianyu
- Add pylibcudf.null_mask.null_count (#17711) @mroeschke
- Ensure pyarrow.Scalar to pylibcudf.Scalar is cached (#17707) @mroeschke
- Adapt cudf numba config for numba 0.61 removal (#17705) @mroeschke
- Remove cudf._lib.scalar in favor of pylibcudf (#17701) @mroeschke
- Fix parquet reader list bug (#17699) @pmattione-nvidia
- Migrated Dynamic AST Expression Trees in Benchmarks and Tests to use AST Tree (#17697) @lamarrr
- Skip polars test that can generate timezones that chrono_tz doesn't know (#17694) @wence-
- Use 64-bit offsets only if the current strings column output chunk size exceeds threshold (#17693) @mhaseeb123
- Use latest ci-conda images (#17690) @bdice
- Convert cudf.Scalar usage to pylibcudf and pyarrow usage (#17686) @mroeschke
- remove find_package(Python) in libcudf build (#17683) @jameslamb
- Fix build metrics report format with long placehold filenames (#17679) @davidwendt
- Use rapids-cmake for the logger (#17674) @vyasr
- Java Parquet reads via multiple host buffers (#17673) @jlowe
- Remove cudf._libs.types.pyx (#17665) @mroeschke
- Add support for
Groupby.cumprod
(#17661) @galipremsagar - Implement
.dt.total_seconds
(#17659) @galipremsagar - Avoid shallow copies in groupby methods (#17646) @mroeschke
- Avoid double MultiIndex factorization in groupby index result (#17644) @mroeschke
- Add seed parameter to hash_character_ngrams (#17643) @davidwendt
- Fix possible overflow in WriteCoalescingCallbackWrapper::TearDown (#17642) @davidwendt
- Remove pragma GCC diagnostic from source files (#17637) @davidwendt
- Move unnecessary utilities from cudf._lib.scalar (#17636) @mroeschke
- Support compression= in DataFrame.to_json (#17634) @mroeschke
- Bump Polars version to <1.18 (#17632) @Matt711
- Use Numba Config to turn on Pynvjitlink Features (#17628) @isVoid
- Use PyNVML 12 (#17627) @jakirkham
- Remove cudf._lib.utils in favor of python APIs (#17625) @mroeschke
- Performance improvements and simplifications for fixed size row-based rolling windows (#17623) @wence-
- Fix return types for MurmurHash3_x86_32 template specializations (#17622) @davidwendt
- Clean up namespaces and improve compression-related headers (#17621) @vuule
- Use more pylibcudf.types instead of cudf._lib.types (#17619) @mroeschke
- Remove patch that is only needed for clang-tidy to run on test files (#17618) @vyasr
- update telemetry actions to fluent-bit friendly style (#17615) @msarahan
- Introduce some simple benchmarks for rolling window aggregations (#17613) @wence-
- Bump the oldest
pyarrow
version to14.0.2
in test matrix (#17611) @galipremsagar - Use
[[nodiscard]]
attribute before__device__
(#17608) @vuule - Use
host_vector
inflatten_single_pass_aggs
(#17605) @vuule - Stop memory_resource.hpp from including itself (#17603) @vyasr
- Replace the outdated cuco window concept with buckets (#17602) @PointKernel
- Check if nightlies have succeeded recently enough (#17596) @vyasr
- Deprecate cudf::grouped_time_range_rolling_window (#17589) @wence-
- A couple of fixes in rapids-logger usage (#17588) @vyasr
- Simplify expression transformer in Parquet predicate pushdown with
ast::tree
(#17587) @mhaseeb123 - Remove unused functionality in cudf._lib.utils.pyx (#17586) @mroeschke
- Use cuda-python
cuda.bindings
import names. (#17585) @bdice - Use no-sync copy for fixed-width types in cudf::concatenate (#17584) @davidwendt
- Remove cudf._lib.groupby in favor of inlining pylibcudf (#17582) @mroeschke
- Remove unused code of json schema in JSON reader (#17581) @karthikeyann
- Expose Scalar's constructor and
Scalar#getScalarHandle()
to public (#17580) @ttnghia - Allow large strings in nvtext benchmarks (#17579) @davidwendt
- Remove cudf._lib.reduce in favor of inlining pylibcudf (#17574) @mroeschke
- Use batched memcpy when writing ORC statistics (#17572) @vuule
- Allow large strings in nvbench strings benchmarks (#17571) @davidwendt
- Update version references in workflow (#17568) @AyodeAwe
- Enable all json reader options in pylibcudf read_json (#17563) @karthikeyann
- Remove cudf._lib.parquet in favor of inlining pylibcudf (#17562) @mroeschke
- Fix CMake format in cudf/_lib/CMakeLists.txt (#17559) @mroeschke
- Remove "legacy" Dask DataFrame support from Dask cuDF (#17558) @rjzamora
- Replace direct
cudaMemcpyAsync
calls with utility functions (within/include
) (#17557) @vuule - Remove cudf._lib.interop in favor of inlining pylibcudf (#17555) @mroeschke
- gate telemetry dispatch calls on TELEMETRY_ENABLED env var (#17551) @msarahan
- Replace direct
cudaMemcpyAsync
calls with utility functions (within/src
) (#17550) @vuule - Remove unused
BufferArrayFromVector
(#17549) @Matt711 - Move cudf._lib.copying to cudf.core._internals (#17548) @mroeschke
- Update cuda-python lower bounds to 12.6.2 / 11.8.5 (#17547) @bdice
- Fix typos, rename types, and add null_probability benchmark axis for distinct (#17546) @PointKernel
- Mark more constexpr functions as device-available (#17545) @vyasr
- Use cooperative-groups instead of cub warp-reduce for strings contains (#17540) @davidwendt
- Remove cudf._lib.nvtext in favor of inlining pylibcudf (#17535) @mroeschke
- Add XXHash_32 hasher (#17533) @PointKernel
- Remove unused masked keyword in column_empty (#17530) @mroeschke
- Remove Thrust patch in favor of CMake definition for Thrust 32-bit offset types. (#17527) @bdice
- [JNI] Enables fabric handles for CUDA async memory pools (#17526) @abellina
- Force Thrust to use 32-bit offset type. (#17523) @bdice
- Replace cudf::detail::copy_if logic with thrust::copy_if and gather (#17520) @davidwendt
- Replaces uses of
cudf._lib.Column.from_unique_ptr
withpylibcudf.Column.from_libcudf
(#17517) @Matt711 - Move cudf._lib.aggregation to cudf.core._internals (#17516) @mroeschke
- Migrate copy_column and Column.from_scalar to pylibcudf (#17513) @Matt711
- Remove cudf._lib.transform in favor of inlining pylibcudf (#17505) @mroeschke
- Remove cudf._lib.string.convert/split in favor of inlining pylibcudf (#17496) @mroeschke
- Move cudf._lib.sort to cudf.core._internals (#17488) @mroeschke
- Remove cudf._lib.csv in favor in inlining pylibcudf (#17485) @mroeschke
- Update PyTorch to >=2.4.0 to get fix for CUDA array interface bug, and drop CUDA 11 PyTorch tests. (#17475) @bdice
- Remove cudf._lib.binops in favor of inlining pylibcudf (#17468) @mroeschke
- Remove cudf._lib.orc in favor of inlining pylibcudf (#17466) @mroeschke
- skip most CI on devcontainer-only changes (#17465) @jameslamb
- Set build type for all examples (#17463) @vyasr
- Update the hook versions in pre-commit (#17462) @wence-
- Remove cudf._lib.string_casting in favor of inlining pylibcudf (#17460) @mroeschke
- Remove cudf._lib.filling in favor of inlining pylibcudf (#17459) @mroeschke
- Update MurmurHash3_x64_128 to use the cuco equivalent implementation (#17457) @PointKernel
- Move cudf._lib.stream_compaction to cudf.core._internals (#17456) @mroeschke
- Clean up xxhash_64 implementations (#17455) @PointKernel
- Update Hadoop dependency in Java pom (#17454) @jlowe
- Adapt to rmm logger changes (#17451) @vyasr
- Require approval to run CI on draft PRs (#17450) @bdice
- Expose stream-ordering in nvtext API (#17446) @shrshi
- Use exec_policy_nosync in write_json (#17445) @karthikeyann
- Remove cudf._lib.json in favor of inlining pylibcudf (#17443) @mroeschke
- Remove cudf._lib.null_mask in favor of inlining pylibcudf (#17440) @mroeschke
- Expose stream-ordering in replace API (#17436) @shrshi
- Expose stream-ordering in copying APIs (#17435) @shrshi
- Expose stream-ordering in column view APIs (#17434) @shrshi
- Apply clang-tidy autofixes from new rules (#17431) @vyasr
- Remove cudf._lib.round in favor of inlining pylibcudf (#17430) @mroeschke
- Update MurmurHash3_x86_32 to use the cuco equivalent implementation (#17429) @PointKernel
- Remove cudf._lib.replace in favor of inlining pylibcudf (#17428) @mroeschke
- Remove nvtx/ranges.hpp include from cuda.cuh (#17427) @davidwendt
- Remove the unused detail
int_fastdiv.h
header (#17426) @PointKernel - Remove cudf._lib.lists in favor of inlining pylibcudf (#17425) @mroeschke
- Remove cudf._lib.quantile (#17424) @mroeschke
- Remove cudf._lib.rolling in favor of inlining pylibcudf (#17423) @mroeschke
- Rework minhash APIs for deprecation cycle (#17421) @davidwendt
- Use thread_index_type in binary-ops jit kernel.cu (#17420) @davidwendt
- Change binops for-each kernel to thrust::for_each_n (#17419) @davidwendt
- Move cudf._lib.search to cudf.core._internals (#17411) @mroeschke
- Use grid_1d utilities in copy_range.cuh (#17409) @davidwendt
- Remove cudf._lib.text in favor of inlining pylibcudf (#17408) @mroeschke
- Run clang-tidy checks in PR CI (#17407) @bdice
- Update strings/text source to use grid_1d for thread/block/stride calculations (#17404) @davidwendt
- Expose stream-ordering to strings attribute APIs (#17398) @shrshi
- Expose stream-ordering to interop APIs (#17397) @shrshi
- Remove unused type aliases (#17396) @PointKernel
- Remove some cudf._lib.strings files in favor of inlining pylibcudf (#17394) @mroeschke
- Update xxhash_64 to utilize the cuco equivalent implementation (#17393) @PointKernel
- Change indices for dictionary column to signed integer type (#17390) @davidwendt
- Return categorical values in to_numpy/to_cupy (#17388) @mroeschke
- Forward-merge branch-24.12 to branch-25.02 (#17379) @bdice
- Remove unused IO utilities from cudf python (#17374) @Matt711
- Remove cudf._lib.datetime in favor of inlining pylibcudf (#17372) @mroeschke
- Remove cudf._lib.join in favor of inlining pylibcudf (#17371) @mroeschke
- Remove cudf._lib.merge in favor of inlining pylibcudf (#17370) @mroeschke
- Remove cudf._lib.partitioning in favor of inlining pylibcudf (#17369) @mroeschke
- Remove cudf._lib.reshape in favor of inlining pylibcudf (#17368) @mroeschke
- Remove cudf._lib.timezone in favor of inlining pylibcudf (#17366) @mroeschke
- Remove cudf._lib.transpose in favor of inlining pylibcudf (#17365) @mroeschke
- Move make_strings_column benchmark to nvbench (#17340) @davidwendt
- Improve strings contains/find performance for smaller strings (#17330) @davidwendt
- Use rapids-logger to generate the cudf logger (#17307) @vyasr
- Mukernels strings (#17286) @pmattione-nvidia
- Add write_parquet to pylibcudf (#17263) @mroeschke
- Single-partition Dask executor for cuDF-Polars (#17262) @rjzamora
- Add breaking change workflow trigger (#17248) @AyodeAwe
- Precompute AST arity (#17234) @bdice
- Update to CCCL 2.7.0-rc2. (#17233) @bdice
- Make
column_empty
mask buffer creation consistent with libcudf (#16715) @mroeschke