Skip to content

Install On Air-Gapped OKD 4.15.0-0 FCOS #1449

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 6 tasks
jvincze84 opened this issue May 19, 2025 · 0 comments
Open
1 of 6 tasks

Install On Air-Gapped OKD 4.15.0-0 FCOS #1449

jvincze84 opened this issue May 19, 2025 · 0 comments

Comments

@jvincze84
Copy link

jvincze84 commented May 19, 2025

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

Describe the bug

We are trying to install the operator on OKD, but we get this error:

{"level":"error","ts":"2025-05-19T13:49:04Z","msg":"Reconciler error","controller":"clusterpolicy-controller","object":{"name":"gpu-cluster-policy"},"namespace":"","name":"gpu-cluster-policy","reconcileID":"a535d3ea-ebc7-4c22-9e62-7c372c6814c0","error":"failed to handle OpenShift Driver Toolkit Daemonset for version 39.20240210.3.0: ERROR: failed to get destination directory for custom repo config: distribution not supported"}

We have an air-gapped environment, so trying to use repo config option:

    repoConfig:
      configMapName: repo-config

But we noticed that "fedora" is missing from the Map:

// RepoConfigPathMap indicates standard OS specific paths for repository configuration files
var RepoConfigPathMap = map[string]string{
"centos": "/etc/yum.repos.d",
"ubuntu": "/etc/apt/sources.list.d",
"rhcos": "/etc/yum.repos.d",
"rhel": "/etc/yum.repos.d",
}

Details:

[gpu-operator@gpu-operator-6ffdc677f6-92828 /]$ cat /host-etc/os-release
NAME="Fedora Linux"
VERSION="39.20240210.3.0 (CoreOS)"
ID=fedora
VERSION_ID=39
VERSION_CODENAME=""
PLATFORM_ID="platform:f39"
PRETTY_NAME="Fedora CoreOS 39.20240210.3.0"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:39"
HOME_URL="https://getfedora.org/coreos/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora-coreos/"
SUPPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
BUG_REPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=39
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=39
SUPPORT_END=2024-11-12
VARIANT="CoreOS"
VARIANT_ID=coreos
OSTREE_VERSION='39.20240210.3.0'

The ID=fedora

Is it a bug, or on purpose? Is OKD fedora supported?

To Reproduce

Install the operator on an air-gapped environment with custom repo config.,

Expected behavior

Successful install on air-gapped OKD.

Environment (please provide the following information):

  • GPU Operator Version: 25.3.0
  • OS: [e.g. Ubuntu24.04]
  • Kernel Version: 6.7.4-200.fc39.x86_64
  • Container Runtime Version: v1.28.7+6e2789b (crio)
  • Kubernetes Distro and Version: OKD - Cluster version is 4.15.0-0.okd-2024-03-10-010116

Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
NAME                                                     READY   STATUS             RESTARTS         AGE
gpu-feature-discovery-4xc4h                              0/1     Init:0/1           0                4m4s
gpu-operator-6ffdc677f6-92828                            1/1     Running            0                53m
nvidia-container-toolkit-daemonset-mn5mj                 0/1     Init:0/1           0                4m5s
nvidia-dcgm-exporter-86njk                               0/1     Init:0/1           0                4m5s
nvidia-dcgm-qbjfc                                        0/1     Init:0/1           0                4m5s
nvidia-device-plugin-daemonset-cv24p                     0/1     Init:0/1           0                4m5s
nvidia-driver-daemonset-39.20240210.3.0-88f4n            1/2     CrashLoopBackOff   24 (4m11s ago)   108m
nvidia-driver-daemonset-392024021030-88f4n-debug-mrq2c   2/2     Running            0                4m37s
nvidia-node-status-exporter-8kk4n                        1/1     Running            0                108m
nvidia-operator-validator-mvr4n                          0/1     Init:0/4           0                4m5s
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
NAME                                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                   AGE
gpu-feature-discovery                     1         1         0       1            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true                                                                3d3h
nvidia-container-toolkit-daemonset        1         1         0       1            0           nvidia.com/gpu.deploy.container-toolkit=true                                                                    3d3h
nvidia-dcgm                               1         1         0       1            0           nvidia.com/gpu.deploy.dcgm=true                                                                                 3d3h
nvidia-dcgm-exporter                      1         1         0       1            0           nvidia.com/gpu.deploy.dcgm-exporter=true                                                                        3d3h
nvidia-device-plugin-daemonset            1         1         0       1            0           nvidia.com/gpu.deploy.device-plugin=true                                                                        3d3h
nvidia-device-plugin-mps-control-daemon   0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true                                            3d3h
nvidia-driver-daemonset-39.20240210.3.0   1         1         0       1            0           feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=39.20240210.3.0,nvidia.com/gpu.deploy.driver=true   3d3h
nvidia-mig-manager                        0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true                                                                          3d3h
nvidia-node-status-exporter               1         1         1       1            1           nvidia.com/gpu.deploy.node-status-exporter=true                                                                 3d3h
nvidia-operator-validator                 1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true                                                                   3d3h
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant