You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hash tables that store dense metadata in a separate array can usually harness compiler intrinsics that correspond to bit-scan instructions to scan for the next occupied bucket many buckets at a time. Specifically, we can scan a group of 64 / bits_of_metadata_per_bucket buckets at once in the ideal case that starting position is the beginning of a bucket group. This technique is used, for example, by absl, Martinus’ robin_hood, and Verstable/CC (which can sadly only scan four buckets at at time). Because khashl uses only one bit of metadata per bucket, it seems like the perfect candidate for this optimization.
I’ve created a proof of concept for GCC and Clang on little-endian architectures here. The repository is private, but I’ve invited you to it. Here are the results for a uint32_t-key, uint32_t-value table displayed in graph form, courtesy of Google Sheets:
(The horizonal axis also corresponds to the number of keys in the table.)
Evidently, the gains are pretty massive. When the table is almost empty, the intrinsics-accelerated version is about 10 times faster. At a load factor of 0.5, it’s about twice as fast. The gains taper off at about 0.75, which is coincidentally khashl’s default maximum load factor.
Implementing this change involves a few steps:
Using a 64-bit integer, rather than a 32-bit integer, as the datatype stored by the used array. This change isn’t strictly necessary, but it allows us to scan up to 64 buckets, rather than 32 buckets, at a time. In practice, the performance difference would probably only be apparent in the case that the current load factor is extremely low and many 32-bucket groups are entirely vacant.
Allocating an extra unit of metadata at the end of the metadata array and setting the first bit. This acts as a stopper to automatically halt iteration the scanning at kh_end. (A less performant option would be to manually check whether we’ve reached the end of the metadata array and, if so, return an end iterator.)
Actually implementing the accelerated iteration functions for GCC/Clang and MSVC, taking endianness into account, and implementing fallbacks for other compilers.
If you’re interested, I could probably do these things and make a pull request. Testing on a big-endian machine is pretty much impossible in this age, though.
khash (as opposed to khashl) could, I think, also benefit from this optimization, although I'm not sure whether you’re even still developing it (in parallel with khashl).
The text was updated successfully, but these errors were encountered:
Thank you so much! This makes a lot of sense. According to the commit history, I haven't touched khash.h for six years. I have been mostly using khashl in my recent projects, though most users are still on the older khash.
I will incorporate your improvements at some point, along with a fix to the missing malloc checks you mentioned in #5. Thanks again!
BTW, you can now change the load factor of khashl at the compile time by defining kh_max_count before inclusion like:
Hash tables that store dense metadata in a separate array can usually harness compiler intrinsics that correspond to bit-scan instructions to scan for the next occupied bucket many buckets at a time. Specifically, we can scan a group of
64 / bits_of_metadata_per_bucket
buckets at once in the ideal case that starting position is the beginning of a bucket group. This technique is used, for example, by absl, Martinus’ robin_hood, and Verstable/CC (which can sadly only scan four buckets at at time). Because khashl uses only one bit of metadata per bucket, it seems like the perfect candidate for this optimization.I’ve created a proof of concept for GCC and Clang on little-endian architectures here. The repository is private, but I’ve invited you to it. Here are the results for a
uint32_t
-key,uint32_t
-value table displayed in graph form, courtesy of Google Sheets:(The horizonal axis also corresponds to the number of keys in the table.)
Evidently, the gains are pretty massive. When the table is almost empty, the intrinsics-accelerated version is about 10 times faster. At a load factor of 0.5, it’s about twice as fast. The gains taper off at about 0.75, which is coincidentally khashl’s default maximum load factor.
Implementing this change involves a few steps:
used
array. This change isn’t strictly necessary, but it allows us to scan up to 64 buckets, rather than 32 buckets, at a time. In practice, the performance difference would probably only be apparent in the case that the current load factor is extremely low and many 32-bucket groups are entirely vacant.iterationthe scanning atkh_end
. (A less performant option would be to manually check whether we’ve reached the end of the metadata array and, if so, return an end iterator.)If you’re interested, I could probably do these things and make a pull request. Testing on a big-endian machine is pretty much impossible in this age, though.
khash (as opposed to khashl) could, I think, also benefit from this optimization, although I'm not sure whether you’re even still developing it (in parallel with khashl).
The text was updated successfully, but these errors were encountered: