Improve performance for Vector64.ExtractMostSignificantBits #115288

rindlespot · 2025-05-04T21:08:54Z

rindlespot
May 4, 2025

Summary

The performance of Vector64.ExtractMostSignificantBits is very poor compared to the other vector classes. Despite having the [Intrinsic] attribute, there is no hardware support enabled for the method on i386 (just ARM). Also, while the existing implementation could serve as a fallback, there's room for improvement there too.

Current Implementation

public static uint ExtractMostSignificantBits<T>(this Vector64<T> vector)
{
    uint result = 0;

    for (int index = 0; index < Vector64<T>.Count; index++)
    {
        uint value = Scalar<T>.ExtractMostSignificantBit(vector.GetElementUnsafe(index));
        result |= (value << index);
    }

    return result;
}

Proposed Change

public static uint ExtractMostSignificantBits<T>(this Vector64<T> vector)
{
    // All x64 implementations support Sse2 by definition. x86 is
    // more variable.  We must be careful here not to call the
    // Vector128 code if it can't use intrinsics, since it falls
    // back to calling Vector64.

    if (Sse2.IsSupported)
    {
        Vector128<ulong> calc = Vector128.Create(vector._00, 0ul);

        // The Sse2 check above as well as all these typeof checks are
        // resolved at JIT compile time.  Likewise the AsXXX is also
        // compiled away (it's just a cast), leaving nothing but the
        // call to the 128bit intrinsic.  Very efficient.

        if (typeof(T) == typeof(byte) || typeof(T) == typeof(sbyte))
            return calc.AsByte().ExtractMostSignificantBits();

        else if (typeof(T) == typeof(int) || typeof(T) == typeof(uint) || typeof(T) == typeof(float))
            return calc.AsUInt32().ExtractMostSignificantBits();

        else if (typeof(T) == typeof(long) || typeof(T) == typeof(ulong) || typeof(T) == typeof(double))
            return calc.AsUInt64().ExtractMostSignificantBits();

        else if (typeof(T) == typeof(nint) || typeof(T) == typeof(nuint))
            return calc.AsNUInt().ExtractMostSignificantBits();

        else if (typeof(T) == typeof(short) || typeof(T) == typeof(ushort))
            // Turns out there is no 16bit mask intrinsic.  But if it
            // can, the compiler simulates one using pshufb from Ssse3.
            if (Ssse3.IsSupported)
                return calc.AsUInt16().ExtractMostSignificantBits();
    }

    int size = (64 / Vector64<T>.Count) - 1;

    // Yes, this same mask works for all types.
    ulong u = vector._00 & 0x8080808080808080;
    ulong result = 0;

    // Because Count is a compile-time variable too, .Net will sometimes
    // unroll this loop.
    for (int index = 0; index < Vector64<T>.Count; index++)
    {
        u >>= size;
        result |= u;
    }

    return (uint)result & 0xff;
}

Benefits

Using intrinsics instead of 'base code in a loop' greatly improves the performance of this method. The 16bit movemasks require a slightly more advanced instruction set (Ssse3). As this was introduced back in 2006, most modern computers should have this support.

This proposed change also improves the performance of the 'fallback' code. Comparing the existing net9.0 code with updating the net10.0 code gives us (times in seconds):

x86:
net9.0  26.146
net10.0 using intrinsics 15.931 (saves 40%)
net10.0 using fallback: 17.316 (saves 33%)

x64:
net9.0  16.168
net10.0 using intrinsics 5.236 (saves 68%)
net10.0 no Ssse3 5.567 (saves 66%)

Potential Considerations

While my test code (which is what I used to produce the stats) exercises all the applicable permutations, I don't have a real-world test to plug this into to check the performance. It's possible that (somehow) this could produce worse results under certain circumstances. Running Vector64Tests.cs shows no changes, but that's hardly real world code either.

Promoting a 64bit value to 128 bits to calculate the move mask might seem counterintuitive. It's true there is an instruction intended to operate on 64bit values (exposed by the library as ParallelBitExtract). However it requires Bmi2, which is a much newer instruction set than Sse2, meaning fewer computers support it. What's more, it has known performance issues on Zen1/Zen2 computers. And lastly, my tests show that it's slower than the 128 bit alternatives (at least on my machine).

Conclusion

It's a small, self-contained change. While there are more "lines of code" here than the original implementation, they mostly drop out at JIT compile time, leaving behind fewer asm instructions. I expect it to be faster in all circumstances.

I have other performance changes in mind for Vector64.cs, but let's see how this one is received first.

huoyaoyuan · 2025-05-05T06:40:10Z

huoyaoyuan
May 5, 2025
Collaborator

It's applicable to most Vector64 operations to wrap the operand into Vector128. I don't think any specialization is worth it.

1 reply

tannergooding May 5, 2025
Collaborator

This.

You should, in general, be checking Vector64.IsHardwareAccelerated (or the corresponding IsHardwareAccelerated for other vector sizes) prior to its use.

If it returns false, then it is likely going to be executing a less efficient path for most, if not all, APIs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance for Vector64.ExtractMostSignificantBits #115288

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Improve performance for Vector64.ExtractMostSignificantBits #115288

rindlespot May 4, 2025

Summary

Current Implementation

Proposed Change

Benefits

Potential Considerations

Conclusion

Replies: 1 comment · 1 reply

huoyaoyuan May 5, 2025 Collaborator

tannergooding May 5, 2025 Collaborator

rindlespot
May 4, 2025

Replies: 1 comment 1 reply

huoyaoyuan
May 5, 2025
Collaborator

tannergooding May 5, 2025
Collaborator