-
Notifications
You must be signed in to change notification settings - Fork 186
8 x 4090 run MAGI-1-24B-distill+fp8_quant out of memory #83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I change 24B_distill_quant_config.json as follows, reduce the number of video_size_h, video_size_w and num_frames, |
Me too |
Thank you for your attention to our work. The default config is for 8 H100 cards. On 8 4090 cards, please modify the following configurations:
|
This log shows that a shape mismatch error occurred when loading the model weights. The number 1536 from the checkpoint is correct, and the number 7680 from the model is wrong. I guess you modified the model_config incorrectly or some code caused this error. |
Wondering if I can run it on 4xRTX3090, can add more 2 GPUs too, so 6xRTX3090. Or strictly 8 needed and minimum 4090? |
Memory is the major limiting factor. 3090, like 4090, only has 24GB of memory per card, so it also requires at least pp_size=2 to run the 24B model. The number of cards is not strictly limited to 8, but if you use 6 cards(cp_size=3,pp_size=2), it may be necessary to reduce the size of some input images. |
Thanks for your work, but when i try to use RTX 4090 × 8 to run MAGI-1-24B-distill+fp8_quant, every gpu got error like this "torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 192.00 MiB. GPU 7 has a total capacity of 23.64 GiB of which 71.62 MiB is free. Including non-PyTorch memory, this process has 23.57 GiB memory in use. Of the allocated memory 22.97 GiB is allocated by PyTorch, and 15.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)"
waiting for your reply
The text was updated successfully, but these errors were encountered: