Unnecessary vmovaps emitted depending on order of addends? #114884

poizan42 · 2025-04-21T23:55:14Z

poizan42
Apr 21, 2025

While doing some benchmarks on some SIMD code I noticed something peculiar about the codegen (only checked in .net 10.0 Preview 3 and "trunk" version used by godbolt for now)

The following

    public static uint SumWidened(Vector256<byte> a, Vector256<byte> b)
    {
        var zero = Vector256<byte>.Zero;
        Vector256<ushort> sadLo = Avx2.SumAbsoluteDifferences(a, zero);
        Vector256<ushort> sadHi = Avx2.SumAbsoluteDifferences(b, zero);
        var sadSum = Avx2.Add(sadLo.AsUInt32(), sadHi.AsUInt32());
        Vector128<uint> s1 = sadSum.GetUpper() + sadSum.GetLower();
        const byte control =
            (1 << 6) |
            (0 << 4) |
            (3 << 2) |
            2;
        var s1Hi = Sse2.Shuffle(s1, control);
        var s2 = Sse2.Add(s1, s1Hi);
        var sum = s2[0];
        return sum;
    }

    public static uint SumWidened2(Vector256<byte> a, Vector256<byte> b)
    {
        var zero = Vector256<byte>.Zero;
        Vector256<ushort> sadLo = Avx2.SumAbsoluteDifferences(a, zero);
        Vector256<ushort> sadHi = Avx2.SumAbsoluteDifferences(b, zero);
        var sadSum = Avx2.Add(sadLo.AsUInt32(), sadHi.AsUInt32());
        Vector128<uint> s1 = sadSum.GetLower() + sadSum.GetUpper(); // <-- SWAPPED
        const byte control =
            (1 << 6) |
            (0 << 4) |
            (3 << 2) |
            2;
        var s1Hi = Sse2.Shuffle(s1, control);
        var s2 = Sse2.Add(s1, s1Hi);
        var sum = s2[0];
        return sum;
    }

Emits the following

// coreclr trunk-20250421+a4848033324be5a3e0f7f0a901e222cdcadce07b

Program:.ctor():this (FullOpts):
       ret      

Program:SumWidened(System.Runtime.Intrinsics.Vector256`1[ubyte],System.Runtime.Intrinsics.Vector256`1[ubyte]):uint (FullOpts):
       vxorps   ymm0, ymm0, ymm0
       vmovups  ymm1, ymmword ptr [rsp+0x08]
       vpsadbw  ymm1, ymm1, ymm0
       vmovups  ymm2, ymmword ptr [rsp+0x28]
       vpsadbw  ymm0, ymm2, ymm0
       vpaddd   ymm0, ymm0, ymm1
       vextracti128 xmm1, ymm0, 1
       vpaddd   xmm0, xmm0, xmm1
       vpshufd  xmm1, xmm0, 78
       vpaddd   xmm0, xmm1, xmm0
       vmovd    eax, xmm0
       vzeroupper 
       ret      

Program:SumWidened2(System.Runtime.Intrinsics.Vector256`1[ubyte],System.Runtime.Intrinsics.Vector256`1[ubyte]):uint (FullOpts):
       vxorps   ymm0, ymm0, ymm0
       vmovups  ymm1, ymmword ptr [rsp+0x08]
       vpsadbw  ymm1, ymm1, ymm0
       vmovups  ymm2, ymmword ptr [rsp+0x28]
       vpsadbw  ymm0, ymm2, ymm0
       vpaddd   ymm0, ymm0, ymm1
       vmovaps  ymm1, ymm0 ; <--- Mov needed due to changed register allocation
       vextracti128 xmm0, ymm0, 1
       vpaddd   xmm0, xmm0, xmm1
       vpshufd  xmm1, xmm0, 78
       vpaddd   xmm0, xmm1, xmm0
       vmovd    eax, xmm0
       vzeroupper 
       ret

I would have expected that the JIT could figure the optimal register allocation out, but maybe it doesn't understand that Vector256 addition is commutative? Is this something that should be reported as a potentiel codegen improvement, or is there a deeper reason for this?

Also shouldn't it use vmovdqa and not vmovaps? I have seen some vague claims that it may incur a "cross-domain" penalty if integer operations are performed afterwards, but I couldn't quickly find it in the Intel® 64 and IA-32 Architectures Software Developer’s Manuals, so I don't know if that is true. But if it isn't, then why would there be any need for different mov instruction variants for integers and singles/doubles? Maybe it's only relevant when loading from memory and not copying from a register?

Answered by tannergooding

Apr 22, 2025

Also shouldn't it use vmovdqa and not vmovaps? I have seen some vague claims that it may incur a "cross-domain" penalty if integer operations are performed afterwards, but I couldn't quickly find it in the Intel® 64 and IA-32 Architectures Software Developer’s Manuals, so I don't know if that is true. But if it isn't, then why would there be any need for different mov instruction variants for integers and singles/doubles?

Moves haven't had a cross domain penalty for well over 15 years at this point, they are just touching raw bits (much like shuffles and a few other instructions). They don't even use an execution port in typical scenarios and are fully elided by the register renamer as …

View full answer

tannergooding · 2025-04-22T16:08:52Z

tannergooding
Apr 22, 2025
Collaborator

Also shouldn't it use vmovdqa and not vmovaps? I have seen some vague claims that it may incur a "cross-domain" penalty if integer operations are performed afterwards, but I couldn't quickly find it in the Intel® 64 and IA-32 Architectures Software Developer’s Manuals, so I don't know if that is true. But if it isn't, then why would there be any need for different mov instruction variants for integers and singles/doubles?

Moves haven't had a cross domain penalty for well over 15 years at this point, they are just touching raw bits (much like shuffles and a few other instructions). They don't even use an execution port in typical scenarios and are fully elided by the register renamer as part of instruction decoding producing the microcode stream.

The different moves namely exist for historical reasons and because it can make a difference for things like embedding masking, embedded broadcast, or even atomicity guarantees a given piece of hardware may want to make. Some of AVX512 correspondingly "renamed" various instructions to just indicate bit width instead of "floating-point vs integer" (but sometimes both still exist to maintain historical compat).

I would have expected that the JIT could figure the optimal register allocation out

This is a relatively minor edge case scenario that is unlikely to impact perf (as its fully elided on most computers from the past 15 years), It's just a scenario that hasn't been specially handled yet since the pattern is a bit atypical.

2 replies

poizan42 Apr 23, 2025
Author

I think that was mostly what I suspected.

This is a relatively minor edge case scenario that is unlikely to impact perf (as its fully elided on most computers from the past 15 years), It's just a scenario that hasn't been specially handled yet since the pattern is a bit atypical.

I couldn't measure any performance impact on any of the machines I tried benchmarking on. All else being equal, smaller code size is still better but I guess it doesn't really matter if it isn't a very simple fix to RyuJIT. Do you think it makes sense to create an issue for this?

saucecontrol Apr 24, 2025

I believe #433 covers what you're seeing. In both cases, the extra movaps is codegen for GetLower. I plan on getting in a PR to deal with it soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unnecessary vmovaps emitted depending on order of addends? #114884

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Unnecessary vmovaps emitted depending on order of addends? #114884

poizan42 Apr 21, 2025

Replies: 1 comment · 2 replies

tannergooding Apr 22, 2025 Collaborator

poizan42 Apr 23, 2025 Author

saucecontrol Apr 24, 2025

poizan42
Apr 21, 2025

Replies: 1 comment 2 replies

tannergooding
Apr 22, 2025
Collaborator

poizan42 Apr 23, 2025
Author