You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
Describe the bug
Could not install Nvidia GPU operator on my Kubernetes Cluster. The nvidia-driver-daemonset crashing with following error
k logs -f nvidia-driver-daemonset-zpg9q
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-570.124.06
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 570.124.06.......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that NVIDIA kernel modules matching this driver version are installed separately.
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 570.124.06 for Linux kernel version 5.4.0-205-generic
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.4.0-205-generic
Installing Linux kernel headers...
Installing Linux kernel module files...
Generating Linux kernel version string...
Compiling NVIDIA driver kernel modules...
/usr/src/nvidia-570.124.06/kernel-open/nvidia/nv-procfs.o: warning: objtool: .text.unlikely: unexpected end of section
Relinking NVIDIA driver kernel modules...
Building NVIDIA driver package nvidia-modules-5.4.0-205...
Installing NVIDIA driver kernel modules...
WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.
ERROR: Unable to open 'kernel-open/dkms.conf' for copying (No such file or directory)
Welcome to the NVIDIA Software Installer for Unix/Linux
Detected 48 CPUs online; setting concurrency level to 32.
Unable to locate any tools for listing initramfs contents.
Unable to scan initramfs: no tool found
Installing NVIDIA driver version 570.124.06.
Performing CC sanity check with CC="/usr/bin/cc".
Performing CC check.
Kernel source path: '/lib/modules/5.4.0-205-generic/build'
Kernel output path: '/lib/modules/5.4.0-205-generic/build'
Performing Compiler check.
Performing Dom0 check.
Performing Xen check.
Performing PREEMPT_RT check.
Performing vgpu_kvm check.
Cleaning kernel module build directory.
Building kernel modules:
[##############################] 100%
ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.
Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
Kernel module compilation complete.
Kernel module load error: No such device
Kernel messages:
[153192.646880] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[153192.646919] IPv6: ADDRCONF(NETDEV_CHANGE): cali51406785c39: link becomes ready
[153193.805194] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[153193.805230] IPv6: ADDRCONF(NETDEV_CHANGE): caliab4854e8ff6: link becomes ready
[153195.286443] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[153195.286479] IPv6: ADDRCONF(NETDEV_CHANGE): cali229c0d72a0f: link becomes ready
[153243.698480] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[153243.698514] IPv6: ADDRCONF(NETDEV_CHANGE): cali79281f20159: link becomes ready
[153244.672451] IPv6: ADDRCONF(NETDEV_CHANGE): califcbe515c63b: link becomes ready
[153298.490088] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[153298.490128] IPv6: ADDRCONF(NETDEV_CHANGE): cali32eddb18bf0: link becomes ready
[153450.269906] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[153450.270849] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:06:00.0)
[153450.270852] nvidia: probe of 0000:06:00.0 failed with error -1
[153450.270871] NVRM: The NVIDIA probe routine failed for 1 device(s).
[153450.270871] NVRM: None of the NVIDIA devices were initialized.
[153450.308728] nvidia-nvlink: Unregistered Nvlink Core, major device number 237
[153530.543373] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[153530.544316] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:06:00.0)
[153530.544319] nvidia: probe of 0000:06:00.0 failed with error -1
[153530.544341] NVRM: The NVIDIA probe routine failed for 1 device(s).
[153530.544341] NVRM: None of the NVIDIA devices were initialized.
[153530.596847] nvidia-nvlink: Unregistered Nvlink Core, major device number 237
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
To Reproduce
I was able to reproduce this consistently with my Kubernetes Cluster running 1.31.1 version.
Expected behavior
Nvidia GPU Operator should install all pods successfully.
Environment (please provide the following information):
GPU Operator Version: [e.g. v25.3.0]
OS: [Ubuntu 20.04.6 LTS]
Kernel Version: [5.4.0-205-generic]
Container Runtime Version: [ containerd 1.7.24]
Kubernetes Distro and Version: [Kubernetes 1.31.1]
Information to attach (optional if deemed irrelevant)
Note : I got the Nvidia GPU operator working with kernel version - 5.4.0-193-generic
Hi @Devin-Yue , I checked in my machine there is no - nouveau module installed. I tried the steps mentioned in the above doc but could not resolve the issue.
Uh oh!
There was an error while loading. Please reload this page.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
Describe the bug
Could not install Nvidia GPU operator on my Kubernetes Cluster. The nvidia-driver-daemonset crashing with following error
To Reproduce
I was able to reproduce this consistently with my Kubernetes Cluster running 1.31.1 version.
Expected behavior
Nvidia GPU Operator should install all pods successfully.
Environment (please provide the following information):
Information to attach (optional if deemed irrelevant)
Note : I got the Nvidia GPU operator working with kernel version -
5.4.0-193-generic
The text was updated successfully, but these errors were encountered: