Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Host compression #17656

Open
wants to merge 41 commits into
base: branch-25.02
Choose a base branch
from
Open

Host compression #17656

wants to merge 41 commits into from

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Dec 23, 2024

Description

Add compression APIs to make the nvCOMP use transparent.
Remove direct dependency on nvCOMP in the ORC and Parquet writers.
Add multi-threaded host-side compression; currently off by default, can only be enabled via LIBCUDF_USE_HOST_COMPRESSION environment variable.

Currently the host compression adds D2H + H2D transfers. Avoiding the H2D transfer requires large changes to the writers.

Also moved handling of the AUTO compression type to the options classes, which should own such defaults (translate AUTO to SNAPPY in this case).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Dec 23, 2024

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 23, 2024
@vuule vuule added feature request New feature or request non-breaking Non-breaking change labels Dec 23, 2024
@@ -342,7 +342,7 @@ class writer::impl {
// Writer options.
stripe_size_limits const _max_stripe_size;
size_type const _row_index_stride;
CompressionKind const _compression_kind;
compression_type const _compression;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping this as cudf's compression type leads to much fewer conversions

@vuule vuule marked this pull request as ready for review January 8, 2025 19:03
@vuule vuule requested a review from a team as a code owner January 8, 2025 19:03
@vuule vuule requested review from mythrocks and PointKernel January 8, 2025 19:03
Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review -

cpp/src/io/comp/comp.cpp Show resolved Hide resolved
cpp/src/io/comp/comp.cpp Show resolved Hide resolved
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

I like how it identifies common utilities between ORC and Parquet to reduce duplication and improve consistency for ease of use.

void set_compression(compression_type comp)
{
_compression = comp;
if (comp == compression_type::AUTO) { _compression = compression_type::SNAPPY; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my education, Is it common sense that AUTO should be just SNAPPY?

stream.synchronize();

std::vector<std::future<size_t>> tasks;
auto streams = cudf::detail::fork_streams(stream, h_comp_pool().get_thread_count());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto streams = cudf::detail::fork_streams(stream, h_comp_pool().get_thread_count());
auto const streams = cudf::detail::fork_streams(stream, h_comp_pool().get_thread_count());

can this be const?

* @param compression Compression type
* @returns required alignment
*/
[[nodiscard]] size_t compress_required_block_alignment(compression_type compression);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I have a better name, but just sharing my first impression: "blocks" seems to be an ORC-specific term, and using it in Parquet code feels a bit odd.

Comment on lines +107 to +110
[[nodiscard]] writer_compression_statistics collect_compression_statistics(
device_span<device_span<uint8_t const> const> inputs,
device_span<compression_result const> results,
rmm::cuda_stream_view stream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the definition for this function is in comp/statistics.cu, should we move it to comp/comp.cu for consistency?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: Burndown
Development

Successfully merging this pull request may close these issues.

3 participants