Defining DLPack types with `lane_count > 1`? #1129

janbridley · 2025-08-04T22:42:10Z

janbridley
Aug 4, 2025

Goal

I'd like to pass {3, 8, -1} arrays of np.float16 into c++ and pack them into float16x8_t types. I'll operate on that data in c++, then return them to Numpy as either {3, 8, -1} or {3, -1} arrays. However, I cannot figure out how to define dtype_traits<__fp16> with a lane width greater than 1.

Problem Statement & My Attempts

I'm hoping to define some bind some code with Neon float16x8_t types. I'm able to read np.float16 values from python when I set the lane width to 1, and I can return the memory associated with fp16x8 structs, but I can't find a way to transparently set __fp16 to the underlying SIMD data type. I'm totally fine if there is some processing to get the data in the right shape (like reshaping to {3, -1, 8} or similar), I'm just a bit lost on the best path forward.

System Information

M1 Mac
Apple clang version 17.0.0 target arm64-apple-darwin24.5.0
Nanobind 2.8.0
Python 3.13.2

Minimum "working" example

#include <nanobind/nanobind.h>
#include <nanobind/ndarray.h>
#include <arm_neon.h>

// Type shorthand for intrinsics
using fp16x8 = float16x8_t;
inline fp16x8 splat(double x) { return vdupq_n_f16(x); };

// Define DLPack types =================== The issue is in here? =======
namespace nb = nanobind;
namespace nanobind::detail {
template <> struct dtype_traits<__fp16> {
  static constexpr dlpack::dtype value{
      (uint8_t)dlpack::dtype_code::Float,
      16, 1 /* Size in bits, SIMD lanes*/
  };
  static constexpr auto name = const_name("float16");
};
} // namespace nanobind::detail

int destruct_count = 0;
using Matrix3Xh = nb::ndarray<__fp16, nb::numpy, nb::shape<3, -1>>;

NB_MODULE(_c, m) {
  m.def("get_fp16_from_python", [](Matrix3Xh inpt) {
      for (int i = 0; i < inpt.shape(1); i++)
        printf("[%f %f %f]\n", inpt(0, i), inpt(1, i), inpt(2, i));
    }
  );
  m.def("ret_numpy_half", []() {
    fp16x8 *f = new fp16x8[3]{{splat(1)}, {splat(2)}, {splat(3)}};
    size_t shape[2] = {3, 8};

    nb::capsule deleter(f, [](void *data) noexcept {
      destruct_count++;
      delete[] (__fp16 *)data;
    });

    return nb::ndarray<nb::numpy, __fp16, nb::shape<3, -1>>(f, 2, shape, deleter);
  });
}

Answered by janbridley

Aug 7, 2025

The conversation here has moved beyond nanobind specifically, so I'll mark this as completed. For anyone else hoping to pass SIMD types into Python -- probably don't. Beyond portability issues and the lack of support on the Python side, the performance is not better than the standard, single lane approach.

While I would like to see future progress in DLPack with better support for SIMD types, as it stands right now Nanobind's implementation covers what I now feel is the correct pattern. Please see the helpful comments from hpkfft for more details.

View full answer

wjakob · 2025-08-05T10:34:58Z

wjakob
Aug 5, 2025
Maintainer

Are there other frameworks that even set a lane width other than 1? I haven't seen anyone actually doing this.

8 replies

hpkfft Aug 5, 2025

Packing/unpacking should be a no-op. I would suggest specifying nb::c_contig to be sure the arrays are contiguous. Then the SIMD instructions (which operate on contiguous data) should just work.

janbridley Aug 5, 2025
Author

I can confirm this works for lane_width=1. Thank you!

janbridley Aug 7, 2025
Author

Edit - can you clarify how reading the DLPack data into SIMD registers can be a no-op? My understanding is that one has to memcpy or reinterpret_cast to do so. The downside of using the portable _Float16 is that NEON expects the original __fp16 format

hpkfft Aug 7, 2025

Packing and unpacking are no-ops, by which I mean that data does not have to be copied or shuffled prior to using a load instrinsic to read the data into a SIMD register.
Yes, reinterpret_cast of pointers to memory will be needed in calling neon intrinsics that are defined as taking __fp16*.
Note that reinterpret_cast does not cause the compiler to emit any instructions. The bits of the actual floating-point data are identical for the two types.
You can write your own wrapper functions, for example load() to first reinterpret_cast and then call vld1_f16() or whatever. My suggestion is to do so, and use _Float16 in your code.

The __fp16 data type is not an arithmetic data type. The __fp16 data type is for storage and conversion only. Operations on __fp16 values do not use half-precision arithmetic. The values of __fp16 automatically promote to single-precision float (or double-precision double) floating-point data type when used in arithmetic operations. After the arithmetic operation, these values are automatically converted to the half-precision __fp16 data type for storage. The __fp16 data type is available in both C and C++ source language modes.

The _Float16 data type is an arithmetic data type. Operations on _Float16 values use half-precision arithmetic. The _Float16 data type is available in both C and C++ source language modes.

Arm recommends that for new code, you use the _Float16 data type instead of the __fp16 data type. __fp16 is an Arm C Language Extension and therefore requires compliance with the ACLE. _Float16 is defined by the C standards committee.

_Float16 arithmetic operations directly map to Armv8.2-A half-precision floating-point instructions when they are enabled on Armv8.2-A and later architectures. This avoids the need for conversions to and from single-precision floating-point, and therefore results in more performant code.

janbridley Aug 7, 2025
Author

Thank you for the clarification. For anyone who comes across this in the future, memcpy appears to emit the same instructions and has slightly safer aliasing on systems where sizeof(_Float16)!=sizeof(__fp16).

hpkfft · 2025-08-05T18:52:48Z

hpkfft
Aug 5, 2025

Of course, I don't know your requirements, but allow me to suggest specifying one lane in your dtype.
You can still use float16x8_t type in your Neon code, but consider this an implementation detail.

DLPack is the format for data interchange, and the choices you make in using it should be as independent of implementation as possible. Since, as you mentioned, most (or all) production codes use one lane, your doing the same would make it easier for users of your module to work also with other modules.

In my FFT module, for example, I specify one lane in my dtype. Nevertheless, my implementation for Intel Xeon AVX512_FP16 uses instructions that compute with a SIMD width of 32. On ARM hardware with 256-bit SVE instructions (e.g., Amazon Graviton3, SiPearl Rhea1), the SIMD width is 16. This is all transparent to the user, since Python just sees an array of half-precision floats.

In some hypothetical future, Apple may support 256-bit SVE. I would be nice if you could add such support to your module without user's having to adjust their code.

BTW, the CPU in Europe's exascale Jupiter uses Rhea1:
https://www.eetimes.eu/sipearl-tapes-out-rhea1-processor-closes-series-a-preps-series-b/

Also, I would suggest using _Float16 rather than __fp16.
https://gcc.gnu.org/onlinedocs/gcc/Half-Precision.html
https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point

2 replies

janbridley Aug 5, 2025
Author

This is very helpful, and I appreciate the insight. Am I reading correctly that __fp16 is by default promoted to float even when there are dedicated half precision registers?
If this is the case, it seems like it would be advisable to update the nanobind (and maybe DLPack) docs and specify that _Float16 is preferred.

hpkfft Aug 5, 2025

Yes, that is the case, so while the existing documentation is still correct, it would be nicer to have the example in the nanobind docs use _Float16. See PR #1132

I do not see the DLPack docs mentioning __fp16. Please let me know if I missed it.

janbridley · 2025-08-07T21:22:31Z

janbridley
Aug 7, 2025
Author

The conversation here has moved beyond nanobind specifically, so I'll mark this as completed. For anyone else hoping to pass SIMD types into Python -- probably don't. Beyond portability issues and the lack of support on the Python side, the performance is not better than the standard, single lane approach.

While I would like to see future progress in DLPack with better support for SIMD types, as it stands right now Nanobind's implementation covers what I now feel is the correct pattern. Please see the helpful comments from hpkfft for more details.

0 replies

Defining DLPack types with lane_count > 1? #1129

Uh oh!

janbridley Aug 4, 2025

Goal

Problem Statement & My Attempts

System Information

Minimum "working" example

Replies: 3 comments · 10 replies

Uh oh!

wjakob Aug 5, 2025 Maintainer

Uh oh!

hpkfft Aug 5, 2025

Uh oh!

janbridley Aug 5, 2025 Author

Uh oh!

Uh oh!

janbridley Aug 7, 2025 Author

Uh oh!

hpkfft Aug 7, 2025

Uh oh!

janbridley Aug 7, 2025 Author

Uh oh!

hpkfft Aug 5, 2025

Uh oh!

janbridley Aug 5, 2025 Author

Uh oh!

hpkfft Aug 5, 2025

Uh oh!

Uh oh!

janbridley Aug 7, 2025 Author

Defining DLPack types with `lane_count > 1`? #1129

janbridley
Aug 4, 2025

Replies: 3 comments 10 replies

wjakob
Aug 5, 2025
Maintainer

janbridley Aug 5, 2025
Author

janbridley Aug 7, 2025
Author

janbridley Aug 7, 2025
Author

hpkfft
Aug 5, 2025

janbridley Aug 5, 2025
Author

janbridley
Aug 7, 2025
Author