You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary:
We've seen high cpu usage on these two hashing algorithms: std::_Hash_bytes and multifeed/common/Hash.h::hashBytesImpl
For the record, std::_Hash_bytes compiles to ~60 instructions on aarch64 and ~100 instructions on AMD64: https://godbolt.org/z/xeoqf1aaE
hashBytesImpl compiles to slightly over 100 instructions on aarch64 and slightly over 160 instructions on AMD64: https://godbolt.org/z/bTroqGE7o
The diff adds three new hash functions: rapidhash, rapidhashMicro and rapidhashNano
RapidhashNano is designed for situations where keeping a small code size is a top priority.
Clang-19 compiles it to less than 100 instructions without stack usage, both on x86-64 and aarch64.
The fastest for sizes up to 48 bytes, but may be considerably slower for larger inputs.
RapidhashMicro is designed for situations where cache misses make a noticeable performance detriment.
Clang-19 compiles it to ~140 instructions without stack usage, both on x86-64 and aarch64.
Faster for sizes up to 512 bytes, just 15%-20% slower for inputs above 1kb.
rapidhash provides formidable speed across all input sizes
Clang-19 compiles it to ~185 instructions, both on x86-64 and aarch64.
Benchmark results on BGM: P1826606121, and Grace: P1826591223
On AMD64, RapidhashNano should be strictly better than both std::_Hash_bytes and hashBytesImpl
On aarch64, std::_Hash_bytes compiles to fewer instructions. RapidhashNano should still be faster in most situations, given its much higher throughput. It should also be strictly better than hashBytesImpl
In many situations, RapidhashMicro should be a better choice, due to its higher throughput. This diff allows us to analyze workloads on a case by case basis.
rapidhash seems to be the fastest high-quality hash function for aarch64 systems. It may still find usage on large-input cases.
Folly's benchmark results have been updated to include runs from Bergamo and Neoverse-V2
Differential Revision: D66326393
0 commit comments