You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 192.00 MiB. GPU 1 has a total capacity of 23.64 GiB of which 168.50 MiB is free. Including non-PyTorch memory, this process has 23.47 GiB memory in use. Of the allocated memory 22.64 GiB is allocated by PyTorch, and 259.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank2]: Traceback (most recent call last):
8 GPUs all got this error.
The system has 8 RTX 4090 GPUs, all idle. Using the default 24B_distill_quant_config.json configuration file, NVIDIA driver version 550, CUDA 12.4, and Ubuntu 2204 x64, an out-of-memory (OOM) error occurs on all 8 GPUs after running the inference script. What could be the possible reasons for this?
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 192.00 MiB. GPU 1 has a total capacity of 23.64 GiB of which 168.50 MiB is free. Including non-PyTorch memory, this process has 23.47 GiB memory in use. Of the allocated memory 22.64 GiB is allocated by PyTorch, and 259.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank2]: Traceback (most recent call last):
8 GPUs all got this error.
The system has 8 RTX 4090 GPUs, all idle. Using the default 24B_distill_quant_config.json configuration file, NVIDIA driver version 550, CUDA 12.4, and Ubuntu 2204 x64, an out-of-memory (OOM) error occurs on all 8 GPUs after running the inference script. What could be the possible reasons for this?
run.sh
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NCCL_ALGO=^NVLS
export PAD_HQ=1
export PAD_DURATION=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export OFFLOAD_T5_CACHE=true
export OFFLOAD_VAE_CACHE=true
export TORCH_CUDA_ARCH_LIST="8.9;9.0"
GPUS_PER_NODE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
DISTRIBUTED_ARGS="
--rdzv-backend=c10d
--rdzv-endpoint=localhost:6009
--nnodes=1
--nproc_per_node=$GPUS_PER_NODE
"
MAGI_ROOT=$(git rev-parse --show-toplevel)
LOG_DIR=log_$(date "+%Y-%m-%d_%H:%M:%S").log
export PYTHONPATH="$MAGI_ROOT:$PYTHONPATH"
torchrun $DISTRIBUTED_ARGS inference/pipeline/entry.py
--config_file example/24B/24B_distill_quant_config.json
--mode i2v
--prompt "科技感光效环绕,360度旋转展示"
--image_path example/assets/11.jpg
--output_path example/assets/output_i2v.mp4
2>&1 | tee $LOG_DIR
The text was updated successfully, but these errors were encountered: