Skip to content

8*4090, bash example/24B/run.sh OOM #84

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
walt008 opened this issue May 29, 2025 · 2 comments
Closed

8*4090, bash example/24B/run.sh OOM #84

walt008 opened this issue May 29, 2025 · 2 comments
Assignees

Comments

@walt008
Copy link

walt008 commented May 29, 2025

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 192.00 MiB. GPU 1 has a total capacity of 23.64 GiB of which 168.50 MiB is free. Including non-PyTorch memory, this process has 23.47 GiB memory in use. Of the allocated memory 22.64 GiB is allocated by PyTorch, and 259.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank2]: Traceback (most recent call last):
8 GPUs all got this error.

Image

The system has 8 RTX 4090 GPUs, all idle. Using the default 24B_distill_quant_config.json configuration file, NVIDIA driver version 550, CUDA 12.4, and Ubuntu 2204 x64, an out-of-memory (OOM) error occurs on all 8 GPUs after running the inference script. What could be the possible reasons for this?

run.sh
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NCCL_ALGO=^NVLS
export PAD_HQ=1
export PAD_DURATION=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export OFFLOAD_T5_CACHE=true
export OFFLOAD_VAE_CACHE=true
export TORCH_CUDA_ARCH_LIST="8.9;9.0"
GPUS_PER_NODE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
DISTRIBUTED_ARGS="
--rdzv-backend=c10d
--rdzv-endpoint=localhost:6009
--nnodes=1
--nproc_per_node=$GPUS_PER_NODE
"
MAGI_ROOT=$(git rev-parse --show-toplevel)
LOG_DIR=log_$(date "+%Y-%m-%d_%H:%M:%S").log
export PYTHONPATH="$MAGI_ROOT:$PYTHONPATH"
torchrun $DISTRIBUTED_ARGS inference/pipeline/entry.py
--config_file example/24B/24B_distill_quant_config.json
--mode i2v
--prompt "科技感光效环绕,360度旋转展示"
--image_path example/assets/11.jpg
--output_path example/assets/output_i2v.mp4
2>&1 | tee $LOG_DIR

@levi131
Copy link
Collaborator

levi131 commented May 29, 2025

Thank you for your attention to our work. The default config is for 8 H100 cards. On 8 4090 cards, please modify the following configurations:
Image

@walt008
Copy link
Author

walt008 commented May 30, 2025

3Q ,it works !

@walt008 walt008 closed this as completed Jun 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants