Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel hashing for 6-9x speedup #24

Closed
wants to merge 1 commit into from

Conversation

glycerine
Copy link

fixes #22 and #23

@glycerine
Copy link
Author

Nice to see: on my non-AVX512 amd thread ripper, this is almost as fast as the b3sum reference implementation written in rust, generally 1-2% slower.

On my AVX512 enabled mac, this Go version is faster than b3sum.

  Tests are in parallel_test.go.

  The defaults give about a 6x speed up
  when hashing large files on my box.
@glycerine
Copy link
Author

Ah. It turns out I was choking off the available parallelism. With a buffered channel, we are strictly faster than the rust version.

@glycerine
Copy link
Author

I appreciate that authors have limited time to review such changes; and may wish to keep their libraries tightly focused on single core performance.

For users like myself who need the fastest possible file hashing -- using all available cores -- my fork's master branch https://github.com/glycerine/blake3 has these changes applied. My b3 tool makes them available in a b3sum-like command line utility: https://github.com/glycerine/b3

@lukechampine
Copy link
Owner

Thanks for the contribution! The API and overall style is markedly different from the rest of the repo, so I don't think I can merge this as-is -- but I will push a commit that simplifies parallel hashing, with you as a co-author, if that's acceptable.

@glycerine
Copy link
Author

glycerine commented Feb 4, 2025

@lukechampine Feel free to re-mold as you like.

I'll point out that I find getting a Hasher back after a parallel file scan incredibly useful; it expands the usefulness of the functionality a great deal. For example, if I want to track a file's modification time (or other meta data) as well as its content, I can just add on (Write) the few bytes for the timestamp to the existing hasher and call Sum again. No need to repeat the expensive part.

@lukechampine
Copy link
Owner

Superceded by #25 -- thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

recommended approach to parallelizing the hashing of a big file?
2 participants