Skip to content

Unable to load the kernel module 'nvidia.ko #1390

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rajendragosavi opened this issue Apr 9, 2025 · 2 comments
Open

Unable to load the kernel module 'nvidia.ko #1390

rajendragosavi opened this issue Apr 9, 2025 · 2 comments

Comments

@rajendragosavi
Copy link

rajendragosavi commented Apr 9, 2025

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

Describe the bug

Could not install Nvidia GPU operator on my Kubernetes Cluster. The nvidia-driver-daemonset crashing with following error

 k logs -f nvidia-driver-daemonset-zpg9q                                    
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-570.124.06
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 570.124.06.......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.


WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.


========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 570.124.06 for Linux kernel version 5.4.0-205-generic

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.4.0-205-generic
Installing Linux kernel headers...
Installing Linux kernel module files...
Generating Linux kernel version string...
Compiling NVIDIA driver kernel modules...
/usr/src/nvidia-570.124.06/kernel-open/nvidia/nv-procfs.o: warning: objtool: .text.unlikely: unexpected end of section
Relinking NVIDIA driver kernel modules...
Building NVIDIA driver package nvidia-modules-5.4.0-205...
Installing NVIDIA driver kernel modules...

WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.


ERROR: Unable to open 'kernel-open/dkms.conf' for copying (No such file or directory)


Welcome to the NVIDIA Software Installer for Unix/Linux

Detected 48 CPUs online; setting concurrency level to 32.
Unable to locate any tools for listing initramfs contents.
Unable to scan initramfs: no tool found
Installing NVIDIA driver version 570.124.06.
Performing CC sanity check with CC="/usr/bin/cc".
Performing CC check.
Kernel source path: '/lib/modules/5.4.0-205-generic/build'

Kernel output path: '/lib/modules/5.4.0-205-generic/build'

Performing Compiler check.
Performing Dom0 check.
Performing Xen check.
Performing PREEMPT_RT check.
Performing vgpu_kvm check.
Cleaning kernel module build directory.
Building kernel modules: 

  [##############################] 100%

ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.

Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.


ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

Kernel module compilation complete.
Kernel module load error: No such device
Kernel messages:
[153192.646880] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[153192.646919] IPv6: ADDRCONF(NETDEV_CHANGE): cali51406785c39: link becomes ready
[153193.805194] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[153193.805230] IPv6: ADDRCONF(NETDEV_CHANGE): caliab4854e8ff6: link becomes ready
[153195.286443] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[153195.286479] IPv6: ADDRCONF(NETDEV_CHANGE): cali229c0d72a0f: link becomes ready
[153243.698480] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[153243.698514] IPv6: ADDRCONF(NETDEV_CHANGE): cali79281f20159: link becomes ready
[153244.672451] IPv6: ADDRCONF(NETDEV_CHANGE): califcbe515c63b: link becomes ready
[153298.490088] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[153298.490128] IPv6: ADDRCONF(NETDEV_CHANGE): cali32eddb18bf0: link becomes ready
[153450.269906] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[153450.270849] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                NVRM: BAR0 is 0M @ 0x0 (PCI:0000:06:00.0)
[153450.270852] nvidia: probe of 0000:06:00.0 failed with error -1
[153450.270871] NVRM: The NVIDIA probe routine failed for 1 device(s).
[153450.270871] NVRM: None of the NVIDIA devices were initialized.
[153450.308728] nvidia-nvlink: Unregistered Nvlink Core, major device number 237
[153530.543373] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[153530.544316] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                NVRM: BAR0 is 0M @ 0x0 (PCI:0000:06:00.0)
[153530.544319] nvidia: probe of 0000:06:00.0 failed with error -1
[153530.544341] NVRM: The NVIDIA probe routine failed for 1 device(s).
[153530.544341] NVRM: None of the NVIDIA devices were initialized.
[153530.596847] nvidia-nvlink: Unregistered Nvlink Core, major device number 237
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

To Reproduce

I was able to reproduce this consistently with my Kubernetes Cluster running 1.31.1 version.

Expected behavior

Nvidia GPU Operator should install all pods successfully.

Environment (please provide the following information):

  • GPU Operator Version: [e.g. v25.3.0]
  • OS: [Ubuntu 20.04.6 LTS]
  • Kernel Version: [5.4.0-205-generic]
  • Container Runtime Version: [ containerd 1.7.24]
  • Kubernetes Distro and Version: [Kubernetes 1.31.1]

Information to attach (optional if deemed irrelevant)

Note : I got the Nvidia GPU operator working with kernel version - 5.4.0-193-generic

 k get pods -o wide                         
NAME                                                              READY   STATUS     RESTARTS        AGE   IP             NODE                   NOMINATED NODE   READINESS GATES
gpu-feature-discovery-f5vbn                                       0/1     Init:0/1   0               31m   <none>         pooler-1744029339851   <none>           <none>
gpu-operator-1744182548-node-feature-discovery-gc-65c4c7cfpm6px   1/1     Running    0               33m   10.20.98.211   pooler-1744029339851   <none>           <none>
gpu-operator-1744182548-node-feature-discovery-master-9f98ch7x2   1/1     Running    0               33m   10.20.98.210   pooler-1744029339851   <none>           <none>
gpu-operator-1744182548-node-feature-discovery-worker-tf78s       1/1     Running    0               33m   10.20.98.209   pooler-1744029339851   <none>           <none>
gpu-operator-69ffb4fcb7-88zc7                                     1/1     Running    0               33m   10.20.98.212   pooler-1744029339851   <none>           <none>
nvidia-container-toolkit-daemonset-8r5h7                          0/1     Init:0/1   0               31m   10.20.98.215   pooler-1744029339851   <none>           <none>
nvidia-dcgm-exporter-nklsl                                        0/1     Init:0/1   0               31m   <none>         pooler-1744029339851   <none>           <none>
nvidia-device-plugin-daemonset-b5fx7                              0/1     Init:0/1   0               31m   <none>         pooler-1744029339851   <none>           <none>
nvidia-driver-daemonset-zpg9q                                     0/1     Running    8 (6m32s ago)   32m   10.20.98.214   pooler-1744029339851   <none>           <none>
nvidia-operator-validator-wc7gz                                   0/1     Init:0/4   0               31m   <none>         pooler-1744029339851   <none>           <none>

@Devin-Yue
Copy link

Could you check if the nouveau blacklist or not?
Following this guide to blacklist if needed.
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/troubleshooting.html#troubleshooting-the-nvidia-gpu-operator

@rajendragosavi
Copy link
Author

Hi @Devin-Yue , I checked in my machine there is no - nouveau module installed. I tried the steps mentioned in the above doc but could not resolve the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants