Defining DLPack types with lane_count > 1
?
#1129
-
GoalI'd like to pass Problem Statement & My AttemptsI'm hoping to define some bind some code with Neon System Information
Minimum "working" example#include <nanobind/nanobind.h>
#include <nanobind/ndarray.h>
#include <arm_neon.h>
// Type shorthand for intrinsics
using fp16x8 = float16x8_t;
inline fp16x8 splat(double x) { return vdupq_n_f16(x); };
// Define DLPack types =================== The issue is in here? =======
namespace nb = nanobind;
namespace nanobind::detail {
template <> struct dtype_traits<__fp16> {
static constexpr dlpack::dtype value{
(uint8_t)dlpack::dtype_code::Float,
16, 1 /* Size in bits, SIMD lanes*/
};
static constexpr auto name = const_name("float16");
};
} // namespace nanobind::detail
int destruct_count = 0;
using Matrix3Xh = nb::ndarray<__fp16, nb::numpy, nb::shape<3, -1>>;
NB_MODULE(_c, m) {
m.def("get_fp16_from_python", [](Matrix3Xh inpt) {
for (int i = 0; i < inpt.shape(1); i++)
printf("[%f %f %f]\n", inpt(0, i), inpt(1, i), inpt(2, i));
}
);
m.def("ret_numpy_half", []() {
fp16x8 *f = new fp16x8[3]{{splat(1)}, {splat(2)}, {splat(3)}};
size_t shape[2] = {3, 8};
nb::capsule deleter(f, [](void *data) noexcept {
destruct_count++;
delete[] (__fp16 *)data;
});
return nb::ndarray<nb::numpy, __fp16, nb::shape<3, -1>>(f, 2, shape, deleter);
});
}
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 10 replies
-
Are there other frameworks that even set a lane width other than |
Beta Was this translation helpful? Give feedback.
-
Of course, I don't know your requirements, but allow me to suggest specifying one lane in your dtype. DLPack is the format for data interchange, and the choices you make in using it should be as independent of implementation as possible. Since, as you mentioned, most (or all) production codes use one lane, your doing the same would make it easier for users of your module to work also with other modules. In my FFT module, for example, I specify one lane in my dtype. Nevertheless, my implementation for Intel Xeon AVX512_FP16 uses instructions that compute with a SIMD width of 32. On ARM hardware with 256-bit SVE instructions (e.g., Amazon Graviton3, SiPearl Rhea1), the SIMD width is 16. This is all transparent to the user, since Python just sees an array of half-precision floats. In some hypothetical future, Apple may support 256-bit SVE. I would be nice if you could add such support to your module without user's having to adjust their code. BTW, the CPU in Europe's exascale Jupiter uses Rhea1: Also, I would suggest using |
Beta Was this translation helpful? Give feedback.
-
The conversation here has moved beyond nanobind specifically, so I'll mark this as completed. For anyone else hoping to pass SIMD types into Python -- probably don't. Beyond portability issues and the lack of support on the Python side, the performance is not better than the standard, single lane approach. While I would like to see future progress in DLPack with better support for SIMD types, as it stands right now Nanobind's implementation covers what I now feel is the correct pattern. Please see the helpful comments from |
Beta Was this translation helpful? Give feedback.
The conversation here has moved beyond nanobind specifically, so I'll mark this as completed. For anyone else hoping to pass SIMD types into Python -- probably don't. Beyond portability issues and the lack of support on the Python side, the performance is not better than the standard, single lane approach.
While I would like to see future progress in DLPack with better support for SIMD types, as it stands right now Nanobind's implementation covers what I now feel is the correct pattern. Please see the helpful comments from
hpkfft
for more details.