Release [NIGHTLY] v25.02.00 · rapidsai/cudf

🔗 Links

🚨 Breaking Changes

Add seed parameter to hash_character_ngrams (#17643) @davidwendt
Performance improvements and simplifications for fixed size row-based rolling windows (#17623) @wence-
Refactor distinct hash join to handle multiple probes with the same build table (#17609) @PointKernel
Deprecate cudf::grouped_time_range_rolling_window (#17589) @wence-
Remove "legacy" Dask DataFrame support from Dask cuDF (#17558) @rjzamora
Return empty result for segmented_reduce if input and offsets are both empty (#17437) @davidwendt
Rework minhash APIs for deprecation cycle (#17421) @davidwendt
Change indices for dictionary column to signed integer type (#17390) @davidwendt

🐛 Bug Fixes

Resolve race-condition in disable_module_accelerator (#17811) @galipremsagar
Make Series(dtype=object) raise in mode.pandas_compat with non string data (#17804) @mroeschke
Disable intended disabled ORC tests (#17790) @davidwendt
Fix empty DataFrame construction not returning RangeIndex columns (#17784) @mroeschke
Fix various .str methods for pandas compatability (#17782) @mroeschke
Fix count API issue about ignoring nan values (#17779) @galipremsagar
Add numba pinning to cudf repo (#17777) @galipremsagar
Allow .sort_values(na_position=) to include NaNs in mode.pandas_compatible (#17776) @mroeschke
allow deselecting nvcomp wheels (#17774) @jameslamb
Use the aligned_resource_adaptor to allocate bloom filter device buffers (#17758) @mhaseeb123
Avoid instantiating bloom filter query function for nested and bool types (#17753) @mhaseeb123
Fix DataFrame.merge(Series, how="left"/"right") on column and index not resulting in a RangeIndex (#17739) @mroeschke
[BUG] xfail Polars excel test (#17731) @Matt711
Require to implement AutoCloseable for the classes derived from HostUDFWrapper (#17727) @ttnghia
Remove jlowe as a java committer since he retired (#17725) @tgravescs
Prevent use of invalid grid sizes in ORC reader and writer (#17709) @vuule
Fix writing of compressed ORC files with large stripe footers (#17700) @vuule
Fix cudf.polars sum of empty not equalling zero (#17685) @mroeschke
Fix formatting in logging (#17680) @vuule
convert all nulls to nans in a specific scenario (#17677) @galipremsagar
Define cudf repr methods on the Column (#17675) @mroeschke
Fix groupby.len with null values in cudf.polars (#17671) @mroeschke
Fix: DataFrameGroupBy.get_group was raising with length>1 tuples (#17653) @MarcoGorelli
Fix possible int overflow in compute_mixed_join_output_size (#17633) @davidwendt
Fix a minor potential i32 overflow in thrust::transform_exclusive_scan in PQ reader preprocessing (#17617) @mhaseeb123
Fix failing xgboost test in the cudf.pandas third-party integration tests (#17616) @Matt711
Fix dask_cudf.read_csv (#17612) @rjzamora
Fix memcheck error in ReplaceTest.NormalizeNansAndZerosMutable gtest (#17610) @davidwendt
Correctly accept a pandas.CategoricalDtype(pandas.IntervalDtype(...), ...) type (#17604) @mroeschke
Add ability to modify and propagate names of columns object (#17597) @galipremsagar
Ignore NaN correctly in .quantile (#17593) @mroeschke
Fix groupby argmin/max gather of sorted-order indices (#17591) @davidwendt
Fix ctest fail running libcudf tests in a Debug build (#17576) @davidwendt
Specify a version for rapids_logger dependency (#17573) @jlowe
Fix the ORC decoding bug for the timestamp data (#17570) @kingcrimsontianyu
[JNI] remove rmm argument to set rw access for fabric handles (#17553) @abellina
Document undefined behavior in div_rounding_up_safe (#17542) @davidwendt
Fix nvcc-imposed UB in constexpr functions (#17534) @vuule
Add anonymous namespace to libcudf test source (#17529) @davidwendt
Propagate failures in pandas integration tests and Skip failing tests (#17521) @Matt711
Fix libcudf compile error when logging is disabled (#17512) @davidwendt
Fix Dask-cuDF clip APIs (#17509) @rjzamora
Fix pylibcudf to_arrow with multiple nested data types (#17504) @mroeschke
Fix groupby(as_index=False).size not reseting index (#17499) @mroeschke
Revert "Temporarily skip tests due to dask/distributed#8953" (#17492) @Matt711
Workaround for a misaligned access in read_csv on some CUDA versions (#17477) @vuule
Fix some possible thread-id overflow calculations (#17473) @davidwendt
Temporarily skip tests due to dask/distributed#8953 (#17472) @wence-
Detect mismatches in begin and end tokens returned by JSON tokenizer FST (#17471) @shrshi
Support dask>=2024.11.2 in Dask cuDF (#17439) @rjzamora
Fix write_json failure for zero columns in table/struct (#17414) @karthikeyann
Fix Debug-mode failing Arrow test (#17405) @zeroshade
Fix all null list column with missing child column in JSON reader (#17348) @karthikeyann

📖 Documentation

Cross-link cudf.pandas profiler documentation. (#17668) @bdice
Document interpreter install command for cudf.pandas (#17358) @bdice
add comment to Series.tolist method (#17350) @tequilayu

🚀 New Features

Add public interop functions between pylibcudf and cudf classic (#17730) @Matt711
Support dask_expr migration into dask.dataframe (#17704) @rjzamora
Make tests build without relaxed constexpr (#17691) @PointKernel
Set default logger level to warn (#17684) @vyasr
Support multithreaded reading of compressed buffers in JSON reader (#17670) @shrshi
Control pinned memory use with environment variables (#17657) @vuule
Host compression (#17656) @vuule
Enable text build without relying on relaxed constexpr (#17647) @PointKernel
Implement HOST_UDF aggregation for reduction and segmented reduction (#17645) @ttnghia
Add JSON reader options structs to pylibcudf (#17614) @Matt711
Refactor distinct hash join to handle multiple probes with the same build table (#17609) @PointKernel
Add JSON Writer options classes to pylibcudf (#17606) @Matt711
Add ORC reader options structs to pylibcudf (#17601) @Matt711
Add Avro Reader options classes to pylibcudf (#17599) @Matt711
Enable binaryop build without relying on relaxed constexpr (#17598) @PointKernel
Implement HOST_UDF aggregation for groupby (#17592) @ttnghia
Plumb pylibcudf.io.parquet options classes through cudf python (#17506) @Matt711
Add partition-wise Select support to cuDF-Polars (#17495) @rjzamora
Add multi-partition Scan support to cuDF-Polars (#17494) @rjzamora
Migrate cudf::io::merge_row_group_metadata to pylibcudf (#17491) @Matt711
Add Parquet Reader options classes to pylibcudf (#17464) @Matt711
Add multi-partition DataFrameScan support to cuDF-Polars (#17441) @rjzamora
Return empty result for segmented_reduce if input and offsets are both empty (#17437) @davidwendt
Abstract polars function expression nodes to ensure they are serializable (#17418) @pentschev
Add CSV Reader options classes to pylibcudf (#17412) @Matt711
Add support for pylibcudf.DataType serialization (#17352) @pentschev
Enable rounding for Decimal32 and Decimal64 in cuDF (#17332) @a-hirota
Remove upper bounds on cuda-python to allow 12.6.2 and 11.8.5 (#17326) @bdice
Expose stream-ordering to groupby APIs (#17324) @shrshi
Migrate ORC Writer to pylibcudf (#17310) @Matt711
Support reading bloom filters from Parquet files and filter row groups using them (#17289) @mhaseeb123

🛠️ Improvements

Remove incorrect calls to set architectures (#17813) @vyasr
Add support for pyarrow-19 (#17794) @galipremsagar
Reduce libcudf memcheck tests output (#17791) @davidwendt
Make cudf build with latest CCCL (#17788) @miscco
Update how to manage host UDF instance (#17770) @res-life
Add getInts api for HostMemoryBuffer and UnsafeMemoryAccessor (#17767) @liurenjie1024
Standarize methods used from cudf.core._internals (#17765) @mroeschke
Deprecate dataframe protocol (#17736) @vyasr
Add parquet reader long row test (#17735) @pmattione-nvidia
Update kvikio call due to upstream changes (#17733) @kingcrimsontianyu
Delay setting MultiIndex.level/codes until needed (#17728) @mroeschke
Bounding pool size in multi-batch JSON reader (#17724) @shrshi
Use GCC 13 in CUDA 12 conda builds. (#17721) @bdice
Update minimal sphinx theme version so that we can use parallel doc builds (#17719) @vyasr
Add more aggregation methods in pylibcudf (#17717) @mroeschke
Make cudf._lib.string_udf work with pylibcudf Columns instead of cudf._lib Columns (#17715) @mroeschke
Add special orc test data: timestamp interspersed with null values (#17713) @kingcrimsontianyu
Add pylibcudf.null_mask.null_count (#17711) @mroeschke
Ensure pyarrow.Scalar to pylibcudf.Scalar is cached (#17707) @mroeschke
Adapt cudf numba config for numba 0.61 removal (#17705) @mroeschke
Remove cudf._lib.scalar in favor of pylibcudf (#17701) @mroeschke
Fix parquet reader list bug (#17699) @pmattione-nvidia
Migrated Dynamic AST Expression Trees in Benchmarks and Tests to use AST Tree (#17697) @lamarrr
Skip polars test that can generate timezones that chrono_tz doesn't know (#17694) @wence-
Use 64-bit offsets only if the current strings column output chunk size exceeds threshold (#17693) @mhaseeb123
Use latest ci-conda images (#17690) @bdice
Convert cudf.Scalar usage to pylibcudf and pyarrow usage (#17686) @mroeschke
remove find_package(Python) in libcudf build (#17683) @jameslamb
Fix build metrics report format with long placehold filenames (#17679) @davidwendt
Use rapids-cmake for the logger (#17674) @vyasr
Java Parquet reads via multiple host buffers (#17673) @jlowe
Remove cudf._libs.types.pyx (#17665) @mroeschke
Add support for Groupby.cumprod (#17661) @galipremsagar
Implement .dt.total_seconds (#17659) @galipremsagar
Avoid shallow copies in groupby methods (#17646) @mroeschke
Avoid double MultiIndex factorization in groupby index result (#17644) @mroeschke
Add seed parameter to hash_character_ngrams (#17643) @davidwendt
Fix possible overflow in WriteCoalescingCallbackWrapper::TearDown (#17642) @davidwendt
Remove pragma GCC diagnostic from source files (#17637) @davidwendt
Move unnecessary utilities from cudf._lib.scalar (#17636) @mroeschke
Support compression= in DataFrame.to_json (#17634) @mroeschke
Bump Polars version to <1.18 (#17632) @Matt711
Use Numba Config to turn on Pynvjitlink Features (#17628) @isVoid
Use PyNVML 12 (#17627) @jakirkham
Remove cudf._lib.utils in favor of python APIs (#17625) @mroeschke
Performance improvements and simplifications for fixed size row-based rolling windows (#17623) @wence-
Fix return types for MurmurHash3_x86_32 template specializations (#17622) @davidwendt
Clean up namespaces and improve compression-related headers (#17621) @vuule
Use more pylibcudf.types instead of cudf._lib.types (#17619) @mroeschke
Remove patch that is only needed for clang-tidy to run on test files (#17618) @vyasr
update telemetry actions to fluent-bit friendly style (#17615) @msarahan
Introduce some simple benchmarks for rolling window aggregations (#17613) @wence-
Bump the oldest pyarrow version to 14.0.2 in test matrix (#17611) @galipremsagar
Use [[nodiscard]] attribute before __device__ (#17608) @vuule
Use host_vector in flatten_single_pass_aggs (#17605) @vuule
Stop memory_resource.hpp from including itself (#17603) @vyasr
Replace the outdated cuco window concept with buckets (#17602) @PointKernel
Check if nightlies have succeeded recently enough (#17596) @vyasr
Deprecate cudf::grouped_time_range_rolling_window (#17589) @wence-
A couple of fixes in rapids-logger usage (#17588) @vyasr
Simplify expression transformer in Parquet predicate pushdown with ast::tree (#17587) @mhaseeb123
Remove unused functionality in cudf._lib.utils.pyx (#17586) @mroeschke
Use cuda-python cuda.bindings import names. (#17585) @bdice
Use no-sync copy for fixed-width types in cudf::concatenate (#17584) @davidwendt
Remove cudf._lib.groupby in favor of inlining pylibcudf (#17582) @mroeschke
Remove unused code of json schema in JSON reader (#17581) @karthikeyann
Expose Scalar's constructor and Scalar#getScalarHandle() to public (#17580) @ttnghia
Allow large strings in nvtext benchmarks (#17579) @davidwendt
Remove cudf._lib.reduce in favor of inlining pylibcudf (#17574) @mroeschke
Use batched memcpy when writing ORC statistics (#17572) @vuule
Allow large strings in nvbench strings benchmarks (#17571) @davidwendt
Update version references in workflow (#17568) @AyodeAwe
Enable all json reader options in pylibcudf read_json (#17563) @karthikeyann
Remove cudf._lib.parquet in favor of inlining pylibcudf (#17562) @mroeschke
Fix CMake format in cudf/_lib/CMakeLists.txt (#17559) @mroeschke
Remove "legacy" Dask DataFrame support from Dask cuDF (#17558) @rjzamora
Replace direct cudaMemcpyAsync calls with utility functions (within /include) (#17557) @vuule
Remove cudf._lib.interop in favor of inlining pylibcudf (#17555) @mroeschke
gate telemetry dispatch calls on TELEMETRY_ENABLED env var (#17551) @msarahan
Replace direct cudaMemcpyAsync calls with utility functions (within /src) (#17550) @vuule
Remove unused BufferArrayFromVector (#17549) @Matt711
Move cudf._lib.copying to cudf.core._internals (#17548) @mroeschke
Update cuda-python lower bounds to 12.6.2 / 11.8.5 (#17547) @bdice
Fix typos, rename types, and add null_probability benchmark axis for distinct (#17546) @PointKernel
Mark more constexpr functions as device-available (#17545) @vyasr
Use cooperative-groups instead of cub warp-reduce for strings contains (#17540) @davidwendt
Remove cudf._lib.nvtext in favor of inlining pylibcudf (#17535) @mroeschke
Add XXHash_32 hasher (#17533) @PointKernel
Remove unused masked keyword in column_empty (#17530) @mroeschke
Remove Thrust patch in favor of CMake definition for Thrust 32-bit offset types. (#17527) @bdice
[JNI] Enables fabric handles for CUDA async memory pools (#17526) @abellina
Force Thrust to use 32-bit offset type. (#17523) @bdice
Replace cudf::detail::copy_if logic with thrust::copy_if and gather (#17520) @davidwendt
Replaces uses of cudf._lib.Column.from_unique_ptr with pylibcudf.Column.from_libcudf (#17517) @Matt711
Move cudf._lib.aggregation to cudf.core._internals (#17516) @mroeschke
Migrate copy_column and Column.from_scalar to pylibcudf (#17513) @Matt711
Remove cudf._lib.transform in favor of inlining pylibcudf (#17505) @mroeschke
Remove cudf._lib.string.convert/split in favor of inlining pylibcudf (#17496) @mroeschke
Move cudf._lib.sort to cudf.core._internals (#17488) @mroeschke
Remove cudf._lib.csv in favor in inlining pylibcudf (#17485) @mroeschke
Update PyTorch to >=2.4.0 to get fix for CUDA array interface bug, and drop CUDA 11 PyTorch tests. (#17475) @bdice
Remove cudf._lib.binops in favor of inlining pylibcudf (#17468) @mroeschke
Remove cudf._lib.orc in favor of inlining pylibcudf (#17466) @mroeschke
skip most CI on devcontainer-only changes (#17465) @jameslamb
Set build type for all examples (#17463) @vyasr
Update the hook versions in pre-commit (#17462) @wence-
Remove cudf._lib.string_casting in favor of inlining pylibcudf (#17460) @mroeschke
Remove cudf._lib.filling in favor of inlining pylibcudf (#17459) @mroeschke
Update MurmurHash3_x64_128 to use the cuco equivalent implementation (#17457) @PointKernel
Move cudf._lib.stream_compaction to cudf.core._internals (#17456) @mroeschke
Clean up xxhash_64 implementations (#17455) @PointKernel
Update Hadoop dependency in Java pom (#17454) @jlowe
Adapt to rmm logger changes (#17451) @vyasr
Require approval to run CI on draft PRs (#17450) @bdice
Expose stream-ordering in nvtext API (#17446) @shrshi
Use exec_policy_nosync in write_json (#17445) @karthikeyann
Remove cudf._lib.json in favor of inlining pylibcudf (#17443) @mroeschke
Remove cudf._lib.null_mask in favor of inlining pylibcudf (#17440) @mroeschke
Expose stream-ordering in replace API (#17436) @shrshi
Expose stream-ordering in copying APIs (#17435) @shrshi
Expose stream-ordering in column view APIs (#17434) @shrshi
Apply clang-tidy autofixes from new rules (#17431) @vyasr
Remove cudf._lib.round in favor of inlining pylibcudf (#17430) @mroeschke
Update MurmurHash3_x86_32 to use the cuco equivalent implementation (#17429) @PointKernel
Remove cudf._lib.replace in favor of inlining pylibcudf (#17428) @mroeschke
Remove nvtx/ranges.hpp include from cuda.cuh (#17427) @davidwendt
Remove the unused detail int_fastdiv.h header (#17426) @PointKernel
Remove cudf._lib.lists in favor of inlining pylibcudf (#17425) @mroeschke
Remove cudf._lib.quantile (#17424) @mroeschke
Remove cudf._lib.rolling in favor of inlining pylibcudf (#17423) @mroeschke
Rework minhash APIs for deprecation cycle (#17421) @davidwendt
Use thread_index_type in binary-ops jit kernel.cu (#17420) @davidwendt
Change binops for-each kernel to thrust::for_each_n (#17419) @davidwendt
Move cudf._lib.search to cudf.core._internals (#17411) @mroeschke
Use grid_1d utilities in copy_range.cuh (#17409) @davidwendt
Remove cudf._lib.text in favor of inlining pylibcudf (#17408) @mroeschke
Run clang-tidy checks in PR CI (#17407) @bdice
Update strings/text source to use grid_1d for thread/block/stride calculations (#17404) @davidwendt
Expose stream-ordering to strings attribute APIs (#17398) @shrshi
Expose stream-ordering to interop APIs (#17397) @shrshi
Remove unused type aliases (#17396) @PointKernel
Remove some cudf._lib.strings files in favor of inlining pylibcudf (#17394) @mroeschke
Update xxhash_64 to utilize the cuco equivalent implementation (#17393) @PointKernel
Change indices for dictionary column to signed integer type (#17390) @davidwendt
Return categorical values in to_numpy/to_cupy (#17388) @mroeschke
Forward-merge branch-24.12 to branch-25.02 (#17379) @bdice
Remove unused IO utilities from cudf python (#17374) @Matt711
Remove cudf._lib.datetime in favor of inlining pylibcudf (#17372) @mroeschke
Remove cudf._lib.join in favor of inlining pylibcudf (#17371) @mroeschke
Remove cudf._lib.merge in favor of inlining pylibcudf (#17370) @mroeschke
Remove cudf._lib.partitioning in favor of inlining pylibcudf (#17369) @mroeschke
Remove cudf._lib.reshape in favor of inlining pylibcudf (#17368) @mroeschke
Remove cudf._lib.timezone in favor of inlining pylibcudf (#17366) @mroeschke
Remove cudf._lib.transpose in favor of inlining pylibcudf (#17365) @mroeschke
Move make_strings_column benchmark to nvbench (#17340) @davidwendt
Improve strings contains/find performance for smaller strings (#17330) @davidwendt
Use rapids-logger to generate the cudf logger (#17307) @vyasr
Mukernels strings (#17286) @pmattione-nvidia
Add write_parquet to pylibcudf (#17263) @mroeschke
Single-partition Dask executor for cuDF-Polars (#17262) @rjzamora
Add breaking change workflow trigger (#17248) @AyodeAwe
Precompute AST arity (#17234) @bdice
Update to CCCL 2.7.0-rc2. (#17233) @bdice
Make column_empty mask buffer creation consistent with libcudf (#16715) @mroeschke

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NIGHTLY] v25.02.00

🔗 Links

🚨 Breaking Changes

🐛 Bug Fixes

📖 Documentation

🚀 New Features

🛠️ Improvements

Contributors