Skip to content

[BUG]: CudaException: 'too many resources requested for launch' after upgrading to 1.5.2 #1323

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
afmg-aherzog opened this issue Mar 23, 2025 · 14 comments
Labels

Comments

@afmg-aherzog
Copy link

Describe the bug

After upgrading ILGPU to 1.5.2 from 1.5.1, the Invoke call on a kernel fails with CudaException: too many resources requested for launch. The number of kernels and groups is configured from my own logic. The versions change was the only change and always causes this issue.

Was there any change in how CUDA is configured in the 1.5.2 release?

Constructing a minimal example is not easy, as the software is quite complicated.

Environment

  • ILGPU version: 1.5.2
  • .NET version: .NET 9
  • Operating system: Windows 11
  • Hardware NVIDIA GeForce RTX 3080

Steps to reproduce

  1. Create manual kernelconfig. We use 256 threads and calculate the right group size depending on the input.
  2. Invoke Kernel
  3. Exception

Expected behavior

Kernel runs like it does with 1.5.1.

Additional context

No response

@MoFtZ
Copy link
Collaborator

MoFtZ commented Mar 24, 2025

hi @afmg-aherzog

The changes in v1.5.2 can be seen here:
v1.5.1...v1.5.2

They are all relatively minor updates. The largest change was probably the functionality to use the more performant LibDevice calls instead of the standard XMath functions.

Do any of the changes look related to your program?

@afmg-aherzog
Copy link
Author

Thanks for the quick response.

We make heavy use of MathF.* related functions. But I don't see exactly how that could suddenly case resource/memory issues.

This commit looks like other CUDA SDKs might now be used, but I'm not sure whether that could cause our issues.

Is there a flag to disable the changes related to the LibDevice calls? That would be a fast way to encircle our problem.

@samy2
Copy link

samy2 commented Mar 24, 2025

hi @afmg-aherzog,
LibDevice is not used by default for compatibility reasons. You can enable it explicitly, for example :

var context = Context.Create(builder => builder.Optimize(OptimizationLevel.O2).EnableAlgorithms().LibDevice().Cuda());

@MoFtZ
Copy link
Collaborator

MoFtZ commented Mar 25, 2025

We make heavy use of MathF.* related functions. But I don't see exactly how that could suddenly case resource/memory issues.

Cuda implements their math functions using the LibDevice library. For technical reasons, it was not possible to use LibDevice in ILGPU until v1.4. Prior to this, ILGPU provided its own (less performant) implementation of several math functions.

The "too many resources requested" likely comes from the ILGPU implementation of the math functions. You will likely get a better result from enabling LibDevice support when creating your ILGPU context. The downside is that the target device needs a copy of NVVM and LibDevice.

In v1.4, LibDevice support in ILGPU was standalone - you would only get access if you called LibDevice functions directly.

In v1.5.2, when you enable LibDevice support, all the Math and MathF functions that are supported, will be switched from the ILGPU implementation to the LibDevice implementation.

In v2.x, we planned to deprecate the ILGPU math implementations, and always use LibDevice.

This commit looks like other CUDA SDKs might now be used, but I'm not sure whether that could cause our issues.

ILGPU achieved LibDevice support by using the Cuda NVVM library. This commit allowed older versions of NVVM to be used.

Is there a flag to disable the changes related to the LibDevice calls? That would be a fast way to encircle our problem.

If you are not enabling LibDevice when creating your ILGPU context, it should not take effect.

@afmg-aherzog
Copy link
Author

afmg-aherzog commented Mar 25, 2025

The "too many resources requested" likely comes from the ILGPU implementation of the math functions. You will likely get a better result from enabling LibDevice support when creating your ILGPU context. The downside is that the target device needs a copy of NVVM and LibDevice.

Does this mean our user would actually need to install NVVM and LibDevice?

Reading your explanations, I don't quite understand how "just upgrading" to 1.5.2 from 1.5.1 can cause new resource problems.

I will try using the LibDevice implementation, but I suspect that should not be the issue here.

@afmg-aherzog
Copy link
Author

I looked into the problem again. The first thing I have to solve is:

After upgrading to 1.5.2 from 1.5.1, logic that just worked (same data, same setup, same buffer sizes) is now throwing CUDA exceptions with too many resources requested for launch. I would assume this is due to an implicit change in configuration. Can you give me any pointers?

This problem is not observed when using a OpenCL device with the same input data/buffers.

@MoFtZ
Copy link
Collaborator

MoFtZ commented Mar 31, 2025

Does this mean our user would actually need to install NVVM and LibDevice?

NVVM and LibDevice are part of the Cuda SDK, and may be distributed with your application.

Reading your explanations, I don't quite understand how "just upgrading" to 1.5.2 from 1.5.1 can cause new resource problems.
I would assume this is due to an implicit change in configuration. Can you give me any pointers?

I am unable to determine a precise cause of your issue, so I would like to ask some questions, and get you to perform some tests.

  1. I'm assuming you are using ILGPU v.1.5.2 and a matching ILGPU.Algorithms v1.5.2. Is that correct?

  2. Can you please compare the generated PTX on ILGPU v1.5.1 and v1.5.2
    I am expecting them to be mostly the same, with minor differences in variable names.
    e.g.

var kernel = accelerator.LoadAutoGroupedStreamKernel(....);
var ptx = (kernel.GetKernel()?.CompiledKernel as PTXCompiledKernel)?.PTXAssembly;
  1. Can you please check the Instruction Set used with the CudaAccelerator.
    ILGPU will try to use the newest Instruction Set, which may cause the Cuda compiler to optimize differently
var arch = (accelerator as CudaAccelerator)?.Architecture;
var isa = (accelerator as CudaAccelerator)?.InstructionSet;
  1. Are you able to provide sample code that reproduces the issue? Or are you able to compile ILGPU from source code? It only requires the .NET SDKs.

UPDATE: Steps I have already taken to check for issues:

  1. Generated and compared the PTX on a few random samples - no differences found.
  2. Compared the MSIL of the PTXMath class, incase the .NET compiler produced different output - no differences found.
  3. Compared the commit changes in v1.5.2. The most likely culprit is the PTX Intrinsics change, but when I checked the PTX, I did not see any differences. I'm having you check as part of (2).
  4. The next most likely culprit is that ILGPU now supports more instruction sets, and the newer value is causing the Cuda driver to behave differently - I'm having you check this as part of (3).

@afmg-aherzog
Copy link
Author

Following the beginning of the compiled kernel in both versions and related csharp logic. If this is not enough, I can start to reduce this to a minimal sample. I would like to hold on that for the last option.
To me it seems like the struct we use as a bundle for the kernels input parameters changes in size and alignment. Do we need to explicitly configure that?

It might also be interesting that this is not just affecting all kernels, but only two of them. The different parameter structs are listed below. The PTX below is for the first paramter struct.

The parameter objects looks like so:

public readonly record struct KernelParameters_NotWorking1(
    int RequiredThreadCount,
    ArrayView<float> ArrayView1,
    ArrayView<float> ArrayView2,
    ArrayView<float> ArrayView3,
    ArrayView<Complex> ArrayView4,
    ArrayView<float> ArrayView5,
    ArrayView<CustomParameterStruct1> ArrayView6,
    ArrayView<float> ArrayView7,
    int IntParam1,
    float FloatParam1,
    byte Flag1,
    byte Flag2);
	
public readonly record struct KernelParameters_NotWorking2(
    ArrayView<CustomParameterStruct1> ArrayView1,
    ArrayView<CustomParameterStruct2> ArrayView2,
    CustomParameterStruct3 CustomStruct1,
    float FloatParam1,
    float FloatParam2,
    int IntParam1,
    int IntParam2);

public readonly record struct KernelParameters_Working(
    int NumberOfThreads,
    ArrayView<float> ArrayView1,
    ArrayView<float> ArrayView2);

The kernel is loaded like so:

_kernelField = _accelerator.LoadKernel<KernelParameters_NotWorking1>(KernelName);

PTX Assembly 1.5.1

//
// Generated by ILGPU v1.5.1
//

.version 8.2
.target sm_86
.address_size 64


.visible .entry Kernel_NotWorking1(
	.param .align 8 .b8 _p_49893[136]
)
{


	.reg .pred	%p<68>;
	.reg .b16	%rs<3>;
	.reg .b32	%r<139>;
	.reg .b64	%rd<38>;
	.reg .f32	%f<206>;

	ld.param.b32	%r1, [_p_49893];
	ld.param.b64	%rd15, [_p_49893+8];
	cvta.to.global.u64	%rd1, %rd15;
	ld.param.b64	%rd2, [_p_49893+16];
	ld.param.b64	%rd15, [_p_49893+24];
	cvta.to.global.u64	%rd3, %rd15;
	ld.param.b64	%rd4, [_p_49893+32];
	ld.param.b64	%rd15, [_p_49893+40];
	cvta.to.global.u64	%rd5, %rd15;
	ld.param.b64	%rd6, [_p_49893+48];
	ld.param.b64	%rd15, [_p_49893+56];
	cvta.to.global.u64	%rd7, %rd15;
	ld.param.b64	%rd8, [_p_49893+64];
	ld.param.b64	%rd15, [_p_49893+72];
	cvta.to.global.u64	%rd9, %rd15;
	ld.param.b64	%rd10, [_p_49893+80];
	ld.param.b64	%rd15, [_p_49893+88];
	cvta.to.global.u64	%rd11, %rd15;
	ld.param.b64	%rd12, [_p_49893+96];
	ld.param.b64	%rd15, [_p_49893+104];
	cvta.to.global.u64	%rd13, %rd15;
	ld.param.b64	%rd14, [_p_49893+112];
	ld.param.b32	%r2, [_p_49893+120];
	ld.param.f32	%f19, [_p_49893+124];
	ld.param.b8	%rs1, [_p_49893+128];
	ld.param.b8	%rs2, [_p_49893+129];

PTX Assembly 1.5.2

//
// Generated by ILGPU v1.5.2
//

.version 8.7
.target sm_86
.address_size 64


.visible .entry Kernel_NotWorking1(
	.param .align 4 .b8 _p_49897[132]
)
{


	.reg .pred	%p<68>;
	.reg .b16	%rs<3>;
	.reg .b32	%r<139>;
	.reg .b64	%rd<38>;
	.reg .f32	%f<206>;

	ld.param.b32	%r1, [_p_49897];
	ld.param.b64	%rd15, [_p_49897+8];
	cvta.to.global.u64	%rd1, %rd15;
	ld.param.b64	%rd2, [_p_49897+16];
	ld.param.b64	%rd15, [_p_49897+24];
	cvta.to.global.u64	%rd3, %rd15;
	ld.param.b64	%rd4, [_p_49897+32];
	ld.param.b64	%rd15, [_p_49897+40];
	cvta.to.global.u64	%rd5, %rd15;
	ld.param.b64	%rd6, [_p_49897+48];
	ld.param.b64	%rd15, [_p_49897+56];
	cvta.to.global.u64	%rd7, %rd15;
	ld.param.b64	%rd8, [_p_49897+64];
	ld.param.b64	%rd15, [_p_49897+72];
	cvta.to.global.u64	%rd9, %rd15;
	ld.param.b64	%rd10, [_p_49897+80];
	ld.param.b64	%rd15, [_p_49897+88];
	cvta.to.global.u64	%rd11, %rd15;
	ld.param.b64	%rd12, [_p_49897+96];
	ld.param.b64	%rd15, [_p_49897+104];
	cvta.to.global.u64	%rd13, %rd15;
	ld.param.b64	%rd14, [_p_49897+112];
	ld.param.b32	%r2, [_p_49897+120];
	ld.param.f32	%f19, [_p_49897+124];
	ld.param.b8	%rs1, [_p_49897+128];
	ld.param.b8	%rs2, [_p_49897+129];

@MoFtZ
Copy link
Collaborator

MoFtZ commented Apr 2, 2025

The kernel parameters are marshaled as a single buffer.

From the v1.5.1 PTX, .param .align 8 .b8 _p_49893[136] says that a parameter of 136 bytes should be aligned to 8 bytes.
From the v1.5.2 PTX, .param .align 4 .b8 _p_49897[132] says that a parameter of 132 bytes should be aligned to 4 bytes.

FYI @m4rs-mt
The alignment difference is potentially a regression from this v1.5.2 PR, which should have builder.Alignment = Math.Max(Alignment, builder.Alignment); rather than builder.Alignment = Alignment;.

However, I'm not sure that this would explain the too many resources issue.
Looking at the rest of the PTX, the number of registers pre-allocated are the same:

.reg .pred	%p<68>;
.reg .b16	%rs<3>;
.reg .b32	%r<139>;
.reg .b64	%rd<38>;
.reg .f32	%f<206>;

The only other difference is the ISA version, which went from v8.2 to v8.7.
Not sure if this affects the Cuda compiler's logic - it may use different optimizations.

@afmg-aherzog
Copy link
Author

The kernel parameters are marshaled as a single buffer.

From the v1.5.1 PTX, .param .align 8 .b8 _p_49893[136] says that a parameter of 136 bytes should be aligned to 8 bytes. From the v1.5.2 PTX, .param .align 4 .b8 _p_49897[132] says that a parameter of 132 bytes should be aligned to 4 bytes.

It seems super weird to me that the size of the single buffer changes, as we did not change anything on the C# side of things.

@MoFtZ
Copy link
Collaborator

MoFtZ commented Apr 2, 2025

The difference in buffer size from 136 bytes to 132 bytes is due to padding in the parameter structure.

The structure is defined as:

public readonly record struct KernelParameters_NotWorking1(
    int RequiredThreadCount,
    ArrayView<float> ArrayView1,
    ArrayView<float> ArrayView2,
    ArrayView<float> ArrayView3,
    ArrayView<Complex> ArrayView4,
    ArrayView<float> ArrayView5,
    ArrayView<CustomParameterStruct1> ArrayView6,
    ArrayView<float> ArrayView7,
    int IntParam1,
    float FloatParam1,
    byte Flag1,
    byte Flag2);

An ArrayView is stored as two 64-bit values (a 64-bit pointer, and a 64-bit length) for 16 bytes in total.

The field offsets would be:
0 - RequiredThreadCount (4 bytes)
8 - ArrayView1 (16 bytes, and int64 is 8 byte aligned)
24 - ArrayView2
40 - ArrayView3
56 - ArrayView4
72 - ArrayView5
88 - ArrayView6
104 - ArrayView7
120 - intParam1 (4 bytes)
124 - FloatParam1 (4 bytes)
128 - Flag1 (1 byte)
129 - Flag2 (1 byte)

Flag1 just happens to be at the start of an 8 byte and 4 byte alignment.

In v1.5.1, the alignment is 8 bytes, so the structure is padded to 136 bytes.
In v1.5.2, the alignment is 4 bytes, so the structure is padded to 132 bytes.

@m4rs-mt
Copy link
Owner

m4rs-mt commented Apr 13, 2025

@afmg-aherzog, as discussed with @MoFtZ during our recent community meeting, we've confirmed that the invalid data alignment issue is present in ILGPU v1.5.2. We'll share a patch within the next few days and plan to release v1.5.3 shortly afterward.

@afmg-aherzog
Copy link
Author

@afmg-aherzog, as discussed with @MoFtZ during our recent community meeting, we've confirmed that the invalid data alignment issue is present in ILGPU v1.5.2. We'll share a patch within the next few days and plan to release v1.5.3 shortly afterward.

Am I correct in thinking that the currently available 1.5.2 still has the issue described here?

@m4rs-mt
Copy link
Owner

m4rs-mt commented May 19, 2025

@afmg-aherzog yes, version 1.5.2 is affected by this issue. It will be addressed in v1.5.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants