Improve performance for Vector64.ExtractMostSignificantBits #115288
Unanswered
rindlespot
asked this question in
Ideas
Replies: 1 comment 1 reply
-
It's applicable to most |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Summary
The performance of Vector64.ExtractMostSignificantBits is very poor compared to the other vector classes. Despite having the [Intrinsic] attribute, there is no hardware support enabled for the method on i386 (just ARM). Also, while the existing implementation could serve as a fallback, there's room for improvement there too.
Current Implementation
Proposed Change
Benefits
Using intrinsics instead of 'base code in a loop' greatly improves the performance of this method. The 16bit movemasks require a slightly more advanced instruction set (Ssse3). As this was introduced back in 2006, most modern computers should have this support.
This proposed change also improves the performance of the 'fallback' code. Comparing the existing net9.0 code with updating the net10.0 code gives us (times in seconds):
Potential Considerations
While my test code (which is what I used to produce the stats) exercises all the applicable permutations, I don't have a real-world test to plug this into to check the performance. It's possible that (somehow) this could produce worse results under certain circumstances. Running Vector64Tests.cs shows no changes, but that's hardly real world code either.
Promoting a 64bit value to 128 bits to calculate the move mask might seem counterintuitive. It's true there is an instruction intended to operate on 64bit values (exposed by the library as ParallelBitExtract). However it requires Bmi2, which is a much newer instruction set than Sse2, meaning fewer computers support it. What's more, it has known performance issues on Zen1/Zen2 computers. And lastly, my tests show that it's slower than the 128 bit alternatives (at least on my machine).
Conclusion
It's a small, self-contained change. While there are more "lines of code" here than the original implementation, they mostly drop out at JIT compile time, leaving behind fewer asm instructions. I expect it to be faster in all circumstances.
I have other performance changes in mind for Vector64.cs, but let's see how this one is received first.
Beta Was this translation helpful? Give feedback.
All reactions