Unnecessary vmovaps emitted depending on order of addends? #114884
-
While doing some benchmarks on some SIMD code I noticed something peculiar about the codegen (only checked in .net 10.0 Preview 3 and "trunk" version used by godbolt for now) The following public static uint SumWidened(Vector256<byte> a, Vector256<byte> b)
{
var zero = Vector256<byte>.Zero;
Vector256<ushort> sadLo = Avx2.SumAbsoluteDifferences(a, zero);
Vector256<ushort> sadHi = Avx2.SumAbsoluteDifferences(b, zero);
var sadSum = Avx2.Add(sadLo.AsUInt32(), sadHi.AsUInt32());
Vector128<uint> s1 = sadSum.GetUpper() + sadSum.GetLower();
const byte control =
(1 << 6) |
(0 << 4) |
(3 << 2) |
2;
var s1Hi = Sse2.Shuffle(s1, control);
var s2 = Sse2.Add(s1, s1Hi);
var sum = s2[0];
return sum;
}
public static uint SumWidened2(Vector256<byte> a, Vector256<byte> b)
{
var zero = Vector256<byte>.Zero;
Vector256<ushort> sadLo = Avx2.SumAbsoluteDifferences(a, zero);
Vector256<ushort> sadHi = Avx2.SumAbsoluteDifferences(b, zero);
var sadSum = Avx2.Add(sadLo.AsUInt32(), sadHi.AsUInt32());
Vector128<uint> s1 = sadSum.GetLower() + sadSum.GetUpper(); // <-- SWAPPED
const byte control =
(1 << 6) |
(0 << 4) |
(3 << 2) |
2;
var s1Hi = Sse2.Shuffle(s1, control);
var s2 = Sse2.Add(s1, s1Hi);
var sum = s2[0];
return sum;
} Emits the following
I would have expected that the JIT could figure the optimal register allocation out, but maybe it doesn't understand that Vector256 addition is commutative? Is this something that should be reported as a potentiel codegen improvement, or is there a deeper reason for this? Also shouldn't it use |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Moves haven't had a cross domain penalty for well over 15 years at this point, they are just touching raw bits (much like shuffles and a few other instructions). They don't even use an execution port in typical scenarios and are fully elided by the register renamer as part of instruction decoding producing the microcode stream. The different moves namely exist for historical reasons and because it can make a difference for things like embedding masking, embedded broadcast, or even atomicity guarantees a given piece of hardware may want to make. Some of AVX512 correspondingly "renamed" various instructions to just indicate bit width instead of "floating-point vs integer" (but sometimes both still exist to maintain historical compat).
This is a relatively minor edge case scenario that is unlikely to impact perf (as its fully elided on most computers from the past 15 years), It's just a scenario that hasn't been specially handled yet since the pattern is a bit atypical. |
Beta Was this translation helpful? Give feedback.
Moves haven't had a cross domain penalty for well over 15 years at this point, they are just touching raw bits (much like shuffles and a few other instructions). They don't even use an execution port in typical scenarios and are fully elided by the register renamer as …