Skip to content

Conversation

DiamonDinoia
Copy link
Contributor

Dear @serge-sans-paille,

I made a rough implementation of masked load/store. Before I hash it out. Can I have some early feedback?

Thanks,
Marco

@serge-sans-paille
Copy link
Contributor

interesting. Before going into the details, what are the memory effects of a masked load in terms of read memory and value stored? Are the masked elements set to 0 ? to an undefined value ? AVX etc seems to set the value to zero?

I'm not sold on the common implementation which looks quite heavy in scalar operation. I can see that we can't do a plain load followed by an and because it could lead to access to unallocated memory. If the mask were constant, we could optimize statically some common patterns, but with a dynamic mask as you propose...

@DiamonDinoia
Copy link
Contributor Author

DiamonDinoia commented Aug 22, 2025

Some thoughts:

  1. Undefined for masked values. Since depending on the operations 0 or 1 might be the correct values. In that case they could use the mask itself to initialize the values. Also, because imagine I want to polulate the even elements from one memory location and the odd from another. Masked loads (I think) are faster than a gather.
  2. We could remove the dynamic mask entirely. I added for completeness.
  3. We could do a la vcl and have a load partial, store partial where we just optimize for head and tail. I preferred this solution as I'm assuming xsimd users know the performance implications of the API.
  4. For now this is fast only on avx, av2 but for sse even if it heavy on scalar it is slow only when reading bytes or short. We could optimize these cases.
  5. I'm not sure about sve/neon
  6. With static masks if the first and the last element are read it is possible to do load+and

In general, I use these operations when I want to vectorize and inner loop that is not a multiple of the simd width. This is a small inner loop nested in a loop executed a lot of times. Depending on the operations, padding sometimes is slower than masking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants