8*4090， bash example/24B/run.sh OOM #84

walt008 · 2025-05-29T08:11:21Z

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 192.00 MiB. GPU 1 has a total capacity of 23.64 GiB of which 168.50 MiB is free. Including non-PyTorch memory, this process has 23.47 GiB memory in use. Of the allocated memory 22.64 GiB is allocated by PyTorch, and 259.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank2]: Traceback (most recent call last):
8 GPUs all got this error.

The system has 8 RTX 4090 GPUs, all idle. Using the default 24B_distill_quant_config.json configuration file, NVIDIA driver version 550, CUDA 12.4, and Ubuntu 2204 x64, an out-of-memory (OOM) error occurs on all 8 GPUs after running the inference script. What could be the possible reasons for this?

run.sh
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NCCL_ALGO=^NVLS
export PAD_HQ=1
export PAD_DURATION=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export OFFLOAD_T5_CACHE=true
export OFFLOAD_VAE_CACHE=true
export TORCH_CUDA_ARCH_LIST="8.9;9.0"
GPUS_PER_NODE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
DISTRIBUTED_ARGS="
--rdzv-backend=c10d
--rdzv-endpoint=localhost:6009
--nnodes=1
--nproc_per_node=$GPUS_PER_NODE
"
MAGI_ROOT=$(git rev-parse --show-toplevel)
LOG_DIR=log_$(date "+%Y-%m-%d_%H:%M:%S").log
export PYTHONPATH="$MAGI_ROOT:$PYTHONPATH"
torchrun $DISTRIBUTED_ARGS inference/pipeline/entry.py
--config_file example/24B/24B_distill_quant_config.json
--mode i2v
--prompt "科技感光效环绕，360度旋转展示"
--image_path example/assets/11.jpg
--output_path example/assets/output_i2v.mp4
2>&1 | tee $LOG_DIR

levi131 · 2025-05-29T12:36:25Z

Thank you for your attention to our work. The default config is for 8 H100 cards. On 8 4090 cards, please modify the following configurations:

walt008 · 2025-05-30T02:15:38Z

3Q ，it works !

stupidZZ assigned levi131 May 29, 2025

walt008 closed this as completed Jun 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

8*4090， bash example/24B/run.sh OOM #84

8*4090， bash example/24B/run.sh OOM #84

walt008 commented May 29, 2025

levi131 commented May 29, 2025

Uh oh!

walt008 commented May 30, 2025

Uh oh!

8*4090， bash example/24B/run.sh OOM #84

8*4090， bash example/24B/run.sh OOM #84

Comments

walt008 commented May 29, 2025

levi131 commented May 29, 2025

Uh oh!

walt008 commented May 30, 2025

Uh oh!