[BUG]: CudaException: 'too many resources requested for launch' after upgrading to 1.5.2 #1323

afmg-aherzog · 2025-03-23T09:00:54Z

Describe the bug

After upgrading ILGPU to 1.5.2 from 1.5.1, the Invoke call on a kernel fails with CudaException: too many resources requested for launch. The number of kernels and groups is configured from my own logic. The versions change was the only change and always causes this issue.

Was there any change in how CUDA is configured in the 1.5.2 release?

Constructing a minimal example is not easy, as the software is quite complicated.

Environment

ILGPU version: 1.5.2
.NET version: .NET 9
Operating system: Windows 11
Hardware NVIDIA GeForce RTX 3080

Steps to reproduce

Create manual kernelconfig. We use 256 threads and calculate the right group size depending on the input.
Invoke Kernel
Exception

Expected behavior

Kernel runs like it does with 1.5.1.

Additional context

No response

The text was updated successfully, but these errors were encountered:

MoFtZ · 2025-03-24T12:11:07Z

hi @afmg-aherzog

The changes in v1.5.2 can be seen here:
v1.5.1...v1.5.2

They are all relatively minor updates. The largest change was probably the functionality to use the more performant LibDevice calls instead of the standard XMath functions.

Do any of the changes look related to your program?

afmg-aherzog · 2025-03-24T13:51:59Z

Thanks for the quick response.

We make heavy use of MathF.* related functions. But I don't see exactly how that could suddenly case resource/memory issues.

This commit looks like other CUDA SDKs might now be used, but I'm not sure whether that could cause our issues.

Is there a flag to disable the changes related to the LibDevice calls? That would be a fast way to encircle our problem.

samy2 · 2025-03-24T21:32:07Z

hi @afmg-aherzog,
LibDevice is not used by default for compatibility reasons. You can enable it explicitly, for example :

var context = Context.Create(builder => builder.Optimize(OptimizationLevel.O2).EnableAlgorithms().LibDevice().Cuda());

MoFtZ · 2025-03-25T10:52:51Z

We make heavy use of MathF.* related functions. But I don't see exactly how that could suddenly case resource/memory issues.

Cuda implements their math functions using the LibDevice library. For technical reasons, it was not possible to use LibDevice in ILGPU until v1.4. Prior to this, ILGPU provided its own (less performant) implementation of several math functions.

The "too many resources requested" likely comes from the ILGPU implementation of the math functions. You will likely get a better result from enabling LibDevice support when creating your ILGPU context. The downside is that the target device needs a copy of NVVM and LibDevice.

In v1.4, LibDevice support in ILGPU was standalone - you would only get access if you called LibDevice functions directly.

In v1.5.2, when you enable LibDevice support, all the Math and MathF functions that are supported, will be switched from the ILGPU implementation to the LibDevice implementation.

In v2.x, we planned to deprecate the ILGPU math implementations, and always use LibDevice.

This commit looks like other CUDA SDKs might now be used, but I'm not sure whether that could cause our issues.

ILGPU achieved LibDevice support by using the Cuda NVVM library. This commit allowed older versions of NVVM to be used.

Is there a flag to disable the changes related to the LibDevice calls? That would be a fast way to encircle our problem.

If you are not enabling LibDevice when creating your ILGPU context, it should not take effect.

afmg-aherzog · 2025-03-25T11:29:56Z

The "too many resources requested" likely comes from the ILGPU implementation of the math functions. You will likely get a better result from enabling LibDevice support when creating your ILGPU context. The downside is that the target device needs a copy of NVVM and LibDevice.

Does this mean our user would actually need to install NVVM and LibDevice?

Reading your explanations, I don't quite understand how "just upgrading" to 1.5.2 from 1.5.1 can cause new resource problems.

I will try using the LibDevice implementation, but I suspect that should not be the issue here.

afmg-aherzog · 2025-03-28T07:12:36Z

I looked into the problem again. The first thing I have to solve is:

After upgrading to 1.5.2 from 1.5.1, logic that just worked (same data, same setup, same buffer sizes) is now throwing CUDA exceptions with too many resources requested for launch. I would assume this is due to an implicit change in configuration. Can you give me any pointers?

This problem is not observed when using a OpenCL device with the same input data/buffers.

MoFtZ · 2025-03-31T10:58:33Z

Does this mean our user would actually need to install NVVM and LibDevice?

NVVM and LibDevice are part of the Cuda SDK, and may be distributed with your application.

Reading your explanations, I don't quite understand how "just upgrading" to 1.5.2 from 1.5.1 can cause new resource problems.
I would assume this is due to an implicit change in configuration. Can you give me any pointers?

I am unable to determine a precise cause of your issue, so I would like to ask some questions, and get you to perform some tests.

I'm assuming you are using ILGPU v.1.5.2 and a matching ILGPU.Algorithms v1.5.2. Is that correct?
Can you please compare the generated PTX on ILGPU v1.5.1 and v1.5.2
I am expecting them to be mostly the same, with minor differences in variable names.
e.g.

var kernel = accelerator.LoadAutoGroupedStreamKernel(....);
var ptx = (kernel.GetKernel()?.CompiledKernel as PTXCompiledKernel)?.PTXAssembly;

Can you please check the Instruction Set used with the CudaAccelerator.
ILGPU will try to use the newest Instruction Set, which may cause the Cuda compiler to optimize differently

var arch = (accelerator as CudaAccelerator)?.Architecture;
var isa = (accelerator as CudaAccelerator)?.InstructionSet;

Are you able to provide sample code that reproduces the issue? Or are you able to compile ILGPU from source code? It only requires the .NET SDKs.

UPDATE: Steps I have already taken to check for issues:

Generated and compared the PTX on a few random samples - no differences found.
Compared the MSIL of the PTXMath class, incase the .NET compiler produced different output - no differences found.
Compared the commit changes in v1.5.2. The most likely culprit is the PTX Intrinsics change, but when I checked the PTX, I did not see any differences. I'm having you check as part of (2).
The next most likely culprit is that ILGPU now supports more instruction sets, and the newer value is causing the Cuda driver to behave differently - I'm having you check this as part of (3).

afmg-aherzog · 2025-04-01T05:47:40Z

Following the beginning of the compiled kernel in both versions and related csharp logic. If this is not enough, I can start to reduce this to a minimal sample. I would like to hold on that for the last option.
To me it seems like the struct we use as a bundle for the kernels input parameters changes in size and alignment. Do we need to explicitly configure that?

It might also be interesting that this is not just affecting all kernels, but only two of them. The different parameter structs are listed below. The PTX below is for the first paramter struct.

The parameter objects looks like so:

public readonly record struct KernelParameters_NotWorking1(
    int RequiredThreadCount,
    ArrayView<float> ArrayView1,
    ArrayView<float> ArrayView2,
    ArrayView<float> ArrayView3,
    ArrayView<Complex> ArrayView4,
    ArrayView<float> ArrayView5,
    ArrayView<CustomParameterStruct1> ArrayView6,
    ArrayView<float> ArrayView7,
    int IntParam1,
    float FloatParam1,
    byte Flag1,
    byte Flag2);
	
public readonly record struct KernelParameters_NotWorking2(
    ArrayView<CustomParameterStruct1> ArrayView1,
    ArrayView<CustomParameterStruct2> ArrayView2,
    CustomParameterStruct3 CustomStruct1,
    float FloatParam1,
    float FloatParam2,
    int IntParam1,
    int IntParam2);

public readonly record struct KernelParameters_Working(
    int NumberOfThreads,
    ArrayView<float> ArrayView1,
    ArrayView<float> ArrayView2);

The kernel is loaded like so:

_kernelField = _accelerator.LoadKernel<KernelParameters_NotWorking1>(KernelName);

PTX Assembly 1.5.1

//
// Generated by ILGPU v1.5.1
//

.version 8.2
.target sm_86
.address_size 64


.visible .entry Kernel_NotWorking1(
	.param .align 8 .b8 _p_49893[136]
)
{


	.reg .pred	%p<68>;
	.reg .b16	%rs<3>;
	.reg .b32	%r<139>;
	.reg .b64	%rd<38>;
	.reg .f32	%f<206>;

	ld.param.b32	%r1, [_p_49893];
	ld.param.b64	%rd15, [_p_49893+8];
	cvta.to.global.u64	%rd1, %rd15;
	ld.param.b64	%rd2, [_p_49893+16];
	ld.param.b64	%rd15, [_p_49893+24];
	cvta.to.global.u64	%rd3, %rd15;
	ld.param.b64	%rd4, [_p_49893+32];
	ld.param.b64	%rd15, [_p_49893+40];
	cvta.to.global.u64	%rd5, %rd15;
	ld.param.b64	%rd6, [_p_49893+48];
	ld.param.b64	%rd15, [_p_49893+56];
	cvta.to.global.u64	%rd7, %rd15;
	ld.param.b64	%rd8, [_p_49893+64];
	ld.param.b64	%rd15, [_p_49893+72];
	cvta.to.global.u64	%rd9, %rd15;
	ld.param.b64	%rd10, [_p_49893+80];
	ld.param.b64	%rd15, [_p_49893+88];
	cvta.to.global.u64	%rd11, %rd15;
	ld.param.b64	%rd12, [_p_49893+96];
	ld.param.b64	%rd15, [_p_49893+104];
	cvta.to.global.u64	%rd13, %rd15;
	ld.param.b64	%rd14, [_p_49893+112];
	ld.param.b32	%r2, [_p_49893+120];
	ld.param.f32	%f19, [_p_49893+124];
	ld.param.b8	%rs1, [_p_49893+128];
	ld.param.b8	%rs2, [_p_49893+129];

PTX Assembly 1.5.2

//
// Generated by ILGPU v1.5.2
//

.version 8.7
.target sm_86
.address_size 64


.visible .entry Kernel_NotWorking1(
	.param .align 4 .b8 _p_49897[132]
)
{


	.reg .pred	%p<68>;
	.reg .b16	%rs<3>;
	.reg .b32	%r<139>;
	.reg .b64	%rd<38>;
	.reg .f32	%f<206>;

	ld.param.b32	%r1, [_p_49897];
	ld.param.b64	%rd15, [_p_49897+8];
	cvta.to.global.u64	%rd1, %rd15;
	ld.param.b64	%rd2, [_p_49897+16];
	ld.param.b64	%rd15, [_p_49897+24];
	cvta.to.global.u64	%rd3, %rd15;
	ld.param.b64	%rd4, [_p_49897+32];
	ld.param.b64	%rd15, [_p_49897+40];
	cvta.to.global.u64	%rd5, %rd15;
	ld.param.b64	%rd6, [_p_49897+48];
	ld.param.b64	%rd15, [_p_49897+56];
	cvta.to.global.u64	%rd7, %rd15;
	ld.param.b64	%rd8, [_p_49897+64];
	ld.param.b64	%rd15, [_p_49897+72];
	cvta.to.global.u64	%rd9, %rd15;
	ld.param.b64	%rd10, [_p_49897+80];
	ld.param.b64	%rd15, [_p_49897+88];
	cvta.to.global.u64	%rd11, %rd15;
	ld.param.b64	%rd12, [_p_49897+96];
	ld.param.b64	%rd15, [_p_49897+104];
	cvta.to.global.u64	%rd13, %rd15;
	ld.param.b64	%rd14, [_p_49897+112];
	ld.param.b32	%r2, [_p_49897+120];
	ld.param.f32	%f19, [_p_49897+124];
	ld.param.b8	%rs1, [_p_49897+128];
	ld.param.b8	%rs2, [_p_49897+129];

MoFtZ · 2025-04-02T06:45:12Z

The kernel parameters are marshaled as a single buffer.

From the v1.5.1 PTX, .param .align 8 .b8 _p_49893[136] says that a parameter of 136 bytes should be aligned to 8 bytes.
From the v1.5.2 PTX, .param .align 4 .b8 _p_49897[132] says that a parameter of 132 bytes should be aligned to 4 bytes.

FYI @m4rs-mt
The alignment difference is potentially a regression from this v1.5.2 PR, which should have builder.Alignment = Math.Max(Alignment, builder.Alignment); rather than builder.Alignment = Alignment;.

However, I'm not sure that this would explain the too many resources issue.
Looking at the rest of the PTX, the number of registers pre-allocated are the same:

.reg .pred	%p<68>;
.reg .b16	%rs<3>;
.reg .b32	%r<139>;
.reg .b64	%rd<38>;
.reg .f32	%f<206>;

The only other difference is the ISA version, which went from v8.2 to v8.7.
Not sure if this affects the Cuda compiler's logic - it may use different optimizations.

afmg-aherzog · 2025-04-02T07:51:56Z

The kernel parameters are marshaled as a single buffer.

From the v1.5.1 PTX, .param .align 8 .b8 _p_49893[136] says that a parameter of 136 bytes should be aligned to 8 bytes. From the v1.5.2 PTX, .param .align 4 .b8 _p_49897[132] says that a parameter of 132 bytes should be aligned to 4 bytes.

It seems super weird to me that the size of the single buffer changes, as we did not change anything on the C# side of things.

MoFtZ · 2025-04-02T11:19:21Z

The difference in buffer size from 136 bytes to 132 bytes is due to padding in the parameter structure.

The structure is defined as:

public readonly record struct KernelParameters_NotWorking1(
    int RequiredThreadCount,
    ArrayView<float> ArrayView1,
    ArrayView<float> ArrayView2,
    ArrayView<float> ArrayView3,
    ArrayView<Complex> ArrayView4,
    ArrayView<float> ArrayView5,
    ArrayView<CustomParameterStruct1> ArrayView6,
    ArrayView<float> ArrayView7,
    int IntParam1,
    float FloatParam1,
    byte Flag1,
    byte Flag2);

An ArrayView is stored as two 64-bit values (a 64-bit pointer, and a 64-bit length) for 16 bytes in total.

The field offsets would be:
0 - RequiredThreadCount (4 bytes)
8 - ArrayView1 (16 bytes, and int64 is 8 byte aligned)
24 - ArrayView2
40 - ArrayView3
56 - ArrayView4
72 - ArrayView5
88 - ArrayView6
104 - ArrayView7
120 - intParam1 (4 bytes)
124 - FloatParam1 (4 bytes)
128 - Flag1 (1 byte)
129 - Flag2 (1 byte)

Flag1 just happens to be at the start of an 8 byte and 4 byte alignment.

In v1.5.1, the alignment is 8 bytes, so the structure is padded to 136 bytes.
In v1.5.2, the alignment is 4 bytes, so the structure is padded to 132 bytes.

m4rs-mt · 2025-04-13T22:03:25Z

@afmg-aherzog, as discussed with @MoFtZ during our recent community meeting, we've confirmed that the invalid data alignment issue is present in ILGPU v1.5.2. We'll share a patch within the next few days and plan to release v1.5.3 shortly afterward.

afmg-aherzog · 2025-05-16T06:11:47Z

@afmg-aherzog, as discussed with @MoFtZ during our recent community meeting, we've confirmed that the invalid data alignment issue is present in ILGPU v1.5.2. We'll share a patch within the next few days and plan to release v1.5.3 shortly afterward.

Am I correct in thinking that the currently available 1.5.2 still has the issue described here?

m4rs-mt · 2025-05-19T21:02:44Z

@afmg-aherzog yes, version 1.5.2 is affected by this issue. It will be addressed in v1.5.3.

afmg-aherzog added the bug label Mar 23, 2025

MoFtZ mentioned this issue May 27, 2025

Fixed regression in alignment of structure. #1333

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]: CudaException: 'too many resources requested for launch' after upgrading to 1.5.2 #1323

[BUG]: CudaException: 'too many resources requested for launch' after upgrading to 1.5.2 #1323

afmg-aherzog commented Mar 23, 2025

MoFtZ commented Mar 24, 2025

Uh oh!

afmg-aherzog commented Mar 24, 2025

Uh oh!

samy2 commented Mar 24, 2025

Uh oh!

MoFtZ commented Mar 25, 2025

Uh oh!

afmg-aherzog commented Mar 25, 2025 •

edited

Loading

Uh oh!

afmg-aherzog commented Mar 28, 2025

Uh oh!

MoFtZ commented Mar 31, 2025 •

edited

Loading

Uh oh!

afmg-aherzog commented Apr 1, 2025

Uh oh!

MoFtZ commented Apr 2, 2025

Uh oh!

afmg-aherzog commented Apr 2, 2025

Uh oh!

MoFtZ commented Apr 2, 2025

Uh oh!

m4rs-mt commented Apr 13, 2025

Uh oh!

afmg-aherzog commented May 16, 2025

Uh oh!

m4rs-mt commented May 19, 2025

Uh oh!

[BUG]: CudaException: 'too many resources requested for launch' after upgrading to 1.5.2 #1323

[BUG]: CudaException: 'too many resources requested for launch' after upgrading to 1.5.2 #1323

Comments

afmg-aherzog commented Mar 23, 2025

Describe the bug

Environment

Steps to reproduce

Expected behavior

Additional context

MoFtZ commented Mar 24, 2025

Uh oh!

afmg-aherzog commented Mar 24, 2025

Uh oh!

samy2 commented Mar 24, 2025

Uh oh!

MoFtZ commented Mar 25, 2025

Uh oh!

afmg-aherzog commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

afmg-aherzog commented Mar 28, 2025

Uh oh!

MoFtZ commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

afmg-aherzog commented Apr 1, 2025

PTX Assembly 1.5.1

PTX Assembly 1.5.2

Uh oh!

MoFtZ commented Apr 2, 2025

Uh oh!

afmg-aherzog commented Apr 2, 2025

Uh oh!

MoFtZ commented Apr 2, 2025

Uh oh!

m4rs-mt commented Apr 13, 2025

Uh oh!

afmg-aherzog commented May 16, 2025

Uh oh!

m4rs-mt commented May 19, 2025

Uh oh!

afmg-aherzog commented Mar 25, 2025 •

edited

Loading

MoFtZ commented Mar 31, 2025 •

edited

Loading