-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Host compression #17656
base: branch-25.02
Are you sure you want to change the base?
Host compression #17656
Conversation
Co-authored-by: Bradley Dice <[email protected]>
…into comp-headers-cleanup
…high-lvl-comp-api
…high-lvl-comp-api
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
…o high-lvl-comp-api
…high-lvl-comp-api
…o high-lvl-comp-api
…o high-lvl-comp-api
@@ -342,7 +342,7 @@ class writer::impl { | |||
// Writer options. | |||
stripe_size_limits const _max_stripe_size; | |||
size_type const _row_index_stride; | |||
CompressionKind const _compression_kind; | |||
compression_type const _compression; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keeping this as cudf's compression type leads to much fewer conversions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review -
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
I like how it identifies common utilities between ORC and Parquet to reduce duplication and improve consistency for ease of use.
void set_compression(compression_type comp) | ||
{ | ||
_compression = comp; | ||
if (comp == compression_type::AUTO) { _compression = compression_type::SNAPPY; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my education, Is it common sense that AUTO
should be just SNAPPY
?
stream.synchronize(); | ||
|
||
std::vector<std::future<size_t>> tasks; | ||
auto streams = cudf::detail::fork_streams(stream, h_comp_pool().get_thread_count()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto streams = cudf::detail::fork_streams(stream, h_comp_pool().get_thread_count()); | |
auto const streams = cudf::detail::fork_streams(stream, h_comp_pool().get_thread_count()); |
can this be const?
* @param compression Compression type | ||
* @returns required alignment | ||
*/ | ||
[[nodiscard]] size_t compress_required_block_alignment(compression_type compression); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I have a better name, but just sharing my first impression: "blocks" seems to be an ORC-specific term, and using it in Parquet code feels a bit odd.
[[nodiscard]] writer_compression_statistics collect_compression_statistics( | ||
device_span<device_span<uint8_t const> const> inputs, | ||
device_span<compression_result const> results, | ||
rmm::cuda_stream_view stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the definition for this function is in comp/statistics.cu
, should we move it to comp/comp.cu
for consistency?
Description
Add compression APIs to make the nvCOMP use transparent.
Remove direct dependency on nvCOMP in the ORC and Parquet writers.
Add multi-threaded host-side compression; currently off by default, can only be enabled via
LIBCUDF_USE_HOST_COMPRESSION
environment variable.Currently the host compression adds D2H + H2D transfers. Avoiding the H2D transfer requires large changes to the writers.
Also moved handling of the AUTO compression type to the options classes, which should own such defaults (translate AUTO to SNAPPY in this case).
Checklist