-
Notifications
You must be signed in to change notification settings - Fork 131
[BUG]: CudaException: 'too many resources requested for launch' after upgrading to 1.5.2 #1323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The changes in v1.5.2 can be seen here: They are all relatively minor updates. The largest change was probably the functionality to use the more performant LibDevice calls instead of the standard XMath functions. Do any of the changes look related to your program? |
Thanks for the quick response. We make heavy use of MathF.* related functions. But I don't see exactly how that could suddenly case resource/memory issues. This commit looks like other CUDA SDKs might now be used, but I'm not sure whether that could cause our issues. Is there a flag to disable the changes related to the LibDevice calls? That would be a fast way to encircle our problem. |
hi @afmg-aherzog,
|
Cuda implements their math functions using the LibDevice library. For technical reasons, it was not possible to use LibDevice in ILGPU until v1.4. Prior to this, ILGPU provided its own (less performant) implementation of several math functions. The "too many resources requested" likely comes from the ILGPU implementation of the math functions. You will likely get a better result from enabling LibDevice support when creating your ILGPU context. The downside is that the target device needs a copy of NVVM and LibDevice. In v1.4, LibDevice support in ILGPU was standalone - you would only get access if you called LibDevice functions directly. In v1.5.2, when you enable LibDevice support, all the In v2.x, we planned to deprecate the ILGPU math implementations, and always use LibDevice.
ILGPU achieved LibDevice support by using the Cuda NVVM library. This commit allowed older versions of NVVM to be used.
If you are not enabling LibDevice when creating your ILGPU context, it should not take effect. |
Does this mean our user would actually need to install NVVM and LibDevice? Reading your explanations, I don't quite understand how "just upgrading" to 1.5.2 from 1.5.1 can cause new resource problems. I will try using the LibDevice implementation, but I suspect that should not be the issue here. |
I looked into the problem again. The first thing I have to solve is: After upgrading to 1.5.2 from 1.5.1, logic that just worked (same data, same setup, same buffer sizes) is now throwing CUDA exceptions with This problem is not observed when using a OpenCL device with the same input data/buffers. |
NVVM and LibDevice are part of the Cuda SDK, and may be distributed with your application.
I am unable to determine a precise cause of your issue, so I would like to ask some questions, and get you to perform some tests.
var kernel = accelerator.LoadAutoGroupedStreamKernel(....);
var ptx = (kernel.GetKernel()?.CompiledKernel as PTXCompiledKernel)?.PTXAssembly;
var arch = (accelerator as CudaAccelerator)?.Architecture;
var isa = (accelerator as CudaAccelerator)?.InstructionSet;
UPDATE: Steps I have already taken to check for issues:
|
Following the beginning of the compiled kernel in both versions and related csharp logic. If this is not enough, I can start to reduce this to a minimal sample. I would like to hold on that for the last option. It might also be interesting that this is not just affecting all kernels, but only two of them. The different parameter structs are listed below. The PTX below is for the first paramter struct. The parameter objects looks like so: public readonly record struct KernelParameters_NotWorking1(
int RequiredThreadCount,
ArrayView<float> ArrayView1,
ArrayView<float> ArrayView2,
ArrayView<float> ArrayView3,
ArrayView<Complex> ArrayView4,
ArrayView<float> ArrayView5,
ArrayView<CustomParameterStruct1> ArrayView6,
ArrayView<float> ArrayView7,
int IntParam1,
float FloatParam1,
byte Flag1,
byte Flag2);
public readonly record struct KernelParameters_NotWorking2(
ArrayView<CustomParameterStruct1> ArrayView1,
ArrayView<CustomParameterStruct2> ArrayView2,
CustomParameterStruct3 CustomStruct1,
float FloatParam1,
float FloatParam2,
int IntParam1,
int IntParam2);
public readonly record struct KernelParameters_Working(
int NumberOfThreads,
ArrayView<float> ArrayView1,
ArrayView<float> ArrayView2); The kernel is loaded like so: _kernelField = _accelerator.LoadKernel<KernelParameters_NotWorking1>(KernelName); PTX Assembly 1.5.1
PTX Assembly 1.5.2
|
The kernel parameters are marshaled as a single buffer. From the v1.5.1 PTX, FYI @m4rs-mt However, I'm not sure that this would explain the too many resources issue.
The only other difference is the ISA version, which went from v8.2 to v8.7. |
It seems super weird to me that the size of the single buffer changes, as we did not change anything on the C# side of things. |
The difference in buffer size from 136 bytes to 132 bytes is due to padding in the parameter structure. The structure is defined as: public readonly record struct KernelParameters_NotWorking1(
int RequiredThreadCount,
ArrayView<float> ArrayView1,
ArrayView<float> ArrayView2,
ArrayView<float> ArrayView3,
ArrayView<Complex> ArrayView4,
ArrayView<float> ArrayView5,
ArrayView<CustomParameterStruct1> ArrayView6,
ArrayView<float> ArrayView7,
int IntParam1,
float FloatParam1,
byte Flag1,
byte Flag2); An ArrayView is stored as two 64-bit values (a 64-bit pointer, and a 64-bit length) for 16 bytes in total. The field offsets would be:
In v1.5.1, the alignment is 8 bytes, so the structure is padded to 136 bytes. |
@afmg-aherzog, as discussed with @MoFtZ during our recent community meeting, we've confirmed that the invalid data alignment issue is present in ILGPU v1.5.2. We'll share a patch within the next few days and plan to release v1.5.3 shortly afterward. |
Am I correct in thinking that the currently available 1.5.2 still has the issue described here? |
@afmg-aherzog yes, version 1.5.2 is affected by this issue. It will be addressed in v1.5.3. |
Describe the bug
After upgrading ILGPU to 1.5.2 from 1.5.1, the
Invoke
call on a kernel fails withCudaException: too many resources requested for launch
. The number of kernels and groups is configured from my own logic. The versions change was the only change and always causes this issue.Was there any change in how CUDA is configured in the 1.5.2 release?
Constructing a minimal example is not easy, as the software is quite complicated.
Environment
Steps to reproduce
Expected behavior
Kernel runs like it does with 1.5.1.
Additional context
No response
The text was updated successfully, but these errors were encountered: