8 x 4090 run MAGI-1-24B-distill+fp8_quant out of memory #83

PMPBinZhang · 2025-05-29T02:29:36Z

Thanks for your work, but when i try to use RTX 4090 × 8 to run MAGI-1-24B-distill+fp8_quant, every gpu got error like this "torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 192.00 MiB. GPU 7 has a total capacity of 23.64 GiB of which 71.62 MiB is free. Including non-PyTorch memory, this process has 23.57 GiB memory in use. Of the allocated memory 22.97 GiB is allocated by PyTorch, and 15.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)"
waiting for your reply

PMPBinZhang · 2025-05-29T02:52:06Z

I change 24B_distill_quant_config.json as follows, reduce the number of video_size_h, video_size_w and num_frames,
"clean_chunk_kvrange": 1,
"clean_t": 0.9999,
"seed": 1234,
"num_frames": 64,
"video_size_h": 480,
"video_size_w": 720,
"num_steps": 16,
"window_size": 4,
"fps": 24,
"chunk_width": 6,
"load": "./downloads/24B_distill_quant",
"t5_pretrained": "./downloads/t5_pretrained",
"t5_device": "cuda",
"vae_pretrained": "./downloads/vae",
"scale_factor": 0.18215,
"temporal_downsample_factor": 4
},
I got the error , partically as follows:
Traceback (most recent call last):
[rank4]: File "/home/fusion/work/gen_video/MAGI-1/inference/pipeline/entry.py", line 54, in
[rank4]: main()
[rank4]: File "/home/fusion/work/gen_video/MAGI-1/inference/pipeline/entry.py", line 45, in main
[rank4]: pipeline.run_image_to_video(prompt=args.prompt, image_path=args.image_path, output_path=args.output_path)
[rank4]: File "/home/fusion/work/gen_video/MAGI-1/inference/pipeline/pipeline.py", line 39, in run_image_to_video
[rank4]: self._run(prompt, prefix_video, output_path)
[rank4]: File "/home/fusion/work/gen_video/MAGI-1/inference/pipeline/pipeline.py", line 47, in _run
[rank4]: dit = get_dit(self.config)
[rank4]: File "/home/fusion/work/gen_video/MAGI-1/inference/model/dit/dit_model.py", line 654, in get_dit
[rank4]: model = load_checkpoint(model)
[rank4]: File "/home/fusion/work/gen_video/MAGI-1/inference/infra/checkpoint/checkpointing.py", line 163, in load_checkpoint
[rank4]: missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False, assign=True)
[rank4]: File "/home/fusion/anaconda3/envs/magi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2215, in load_state_dict
[rank4]: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
[rank4]: RuntimeError: Error(s) in loading state_dict for VideoDiTModel:
[rank4]: size mismatch for t_embedder.mlp.0.weight: copying a param with shape torch.Size([1536, 256]) from checkpoint, the shape in current model is torch.Size([7680, 256]).
[rank4]: size mismatch for t_embedder.mlp.0.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([7680]).
[rank4]: size mismatch for t_embedder.mlp.2.weight: copying a param with shape torch.Size([1536, 1536]) from checkpoint, the shape in current model is torch.Size([7680, 7680]).
[rank4]: size mismatch for t_embedder.mlp.2.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([7680]).
[rank4]: size mismatch for y_embedder.y_proj_adaln.0.weight: copying a param with shape torch.Size([1536, 4096]) from checkpoint, the shape in current model is torch.Size([7680, 4096]).
[rank4]: size mismatch for y_embedder.y_proj_adaln.0.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([7680]).
[rank4]: size mismatch for videodit_blocks.layers.0.ada_modulate_layer.proj.0.weight: copying a param with shape torch.Size([12288, 1536]) from checkpoint, the shape in current model is torch.Size([12288, 7680]).
[rank4]: size mismatch for videodit_blocks.layers.1.ada_modulate_layer.proj.0.weight: copying a param with shape torch.Size([12288, 1536]) from checkpoint, the shape in current model is torch.Size([12288, 7680]).
[rank4]: size mismatch for videodit_blocks.layers.2.ada_modulate_layer.proj.0.weight: copying a param with shape torch.Size([12288, 1536]) from checkpoint, the shape in current model is torch.Size([12288, 7680]).
[rank4]: size mismatch for videodit_blocks.layers.3.ada_modulate_layer.proj.0.weight: copying a param with shape torch.Size([12288, 1536]) from checkpoint, the shape in current model is torch.Size([12288, 7680]).
[rank4]: size mismatch for videodit_blocks.layers.4.ada_modulate_layer.proj.0.weight: copying a param with shape torch.Size([12288, 1536]) from checkpoint, the shape in current model is torch.Size([12288, 7680]).
[rank4]: size mismatch for videodit_blocks.layers.5.ada_modulate_layer.proj.0.weight: copying a param with shape torch.Size([12288, 1536]) from checkpoint, the shape in current model is torch.Size([12288, 7680]).
[rank4]: size mismatch for videodit_blocks.layers.6.ada_modulate_layer.proj.0.weight: copying a param with shape torch.Size([12288, 1536]) from checkpoint, the shape in current mo

walt008 · 2025-05-29T08:16:23Z

Me too

levi131 · 2025-05-29T12:38:40Z

Thank you for your attention to our work. The default config is for 8 H100 cards. On 8 4090 cards, please modify the following configurations:

Thanks for your work, but when i try to use RTX 4090 × 8 to run MAGI-1-24B-distill+fp8_quant, every gpu got error like this "torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 192.00 MiB. GPU 7 has a total capacity of 23.64 GiB of which 71.62 MiB is free. Including non-PyTorch memory, this process has 23.57 GiB memory in use. Of the allocated memory 22.97 GiB is allocated by PyTorch, and 15.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)" waiting for your reply

levi131 · 2025-05-29T12:55:15Z

I change 24B_distill_quant_config.json as follows, reduce the number of video_size_h, video_size_w and num_frames, "clean_chunk_kvrange": 1, "clean_t": 0.9999, "seed": 1234, "num_frames": 64, "video_size_h": 480, "video_size_w": 720, "num_steps": 16, "window_size": 4, "fps": 24, "chunk_width": 6, "load": "./downloads/24B_distill_quant", "t5_pretrained": "./downloads/t5_pretrained", "t5_device": "cuda", "vae_pretrained": "./downloads/vae", "scale_factor": 0.18215, "temporal_downsample_factor": 4 }, I got the error , partically as follows: Traceback (most recent call last): [rank4]: File "/home/fusion/work/gen_video/MAGI-1/inference/pipeline/entry.py", line 54, in [rank4]: main() [rank4]: File "/home/fusion/work/gen_video/MAGI-1/inference/pipeline/entry.py", line 45, in main [rank4]: pipeline.run_image_to_video(prompt=args.prompt, image_path=args.image_path, output_path=args.output_path) [rank4]: File "/home/fusion/work/gen_video/MAGI-1/inference/pipeline/pipeline.py", line 39, in run_image_to_video [rank4]: self._run(prompt, prefix_video, output_path) [rank4]: File "/home/fusion/work/gen_video/MAGI-1/inference/pipeline/pipeline.py", line 47, in _run [rank4]: dit = get_dit(self.config) [rank4]: File "/home/fusion/work/gen_video/MAGI-1/inference/model/dit/dit_model.py", line 654, in get_dit [rank4]: model = load_checkpoint(model) [rank4]: File "/home/fusion/work/gen_video/MAGI-1/inference/infra/checkpoint/checkpointing.py", line 163, in load_checkpoint [rank4]: missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False, assign=True) [rank4]: File "/home/fusion/anaconda3/envs/magi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2215, in load_state_dict [rank4]: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( [rank4]: RuntimeError: Error(s) in loading state_dict for VideoDiTModel: [rank4]: size mismatch for t_embedder.mlp.0.weight: copying a param with shape torch.Size([1536, 256]) from checkpoint, the shape in current model is torch.Size([7680, 256]). [rank4]: size mismatch for t_embedder.mlp.0.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([7680]). [rank4]: size mismatch for t_embedder.mlp.2.weight: copying a param with shape torch.Size([1536, 1536]) from checkpoint, the shape in current model is torch.Size([7680, 7680]). [rank4]: size mismatch for t_embedder.mlp.2.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([7680]). [rank4]: size mismatch for y_embedder.y_proj_adaln.0.weight: copying a param with shape torch.Size([1536, 4096]) from checkpoint, the shape in current model is torch.Size([7680, 4096]). [rank4]: size mismatch for y_embedder.y_proj_adaln.0.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([7680]). [rank4]: size mismatch for videodit_blocks.layers.0.ada_modulate_layer.proj.0.weight: copying a param with shape torch.Size([12288, 1536]) from checkpoint, the shape in current model is torch.Size([12288, 7680]). [rank4]: size mismatch for videodit_blocks.layers.1.ada_modulate_layer.proj.0.weight: copying a param with shape torch.Size([12288, 1536]) from checkpoint, the shape in current model is torch.Size([12288, 7680]). [rank4]: size mismatch for videodit_blocks.layers.2.ada_modulate_layer.proj.0.weight: copying a param with shape torch.Size([12288, 1536]) from checkpoint, the shape in current model is torch.Size([12288, 7680]). [rank4]: size mismatch for videodit_blocks.layers.3.ada_modulate_layer.proj.0.weight: copying a param with shape torch.Size([12288, 1536]) from checkpoint, the shape in current model is torch.Size([12288, 7680]). [rank4]: size mismatch for videodit_blocks.layers.4.ada_modulate_layer.proj.0.weight: copying a param with shape torch.Size([12288, 1536]) from checkpoint, the shape in current model is torch.Size([12288, 7680]). [rank4]: size mismatch for videodit_blocks.layers.5.ada_modulate_layer.proj.0.weight: copying a param with shape torch.Size([12288, 1536]) from checkpoint, the shape in current model is torch.Size([12288, 7680]). [rank4]: size mismatch for videodit_blocks.layers.6.ada_modulate_layer.proj.0.weight: copying a param with shape torch.Size([12288, 1536]) from checkpoint, the shape in current mo

This log shows that a shape mismatch error occurred when loading the model weights. The number 1536 from the checkpoint is correct, and the number 7680 from the model is wrong. I guess you modified the model_config incorrectly or some code caused this error.

Issues-maker · 2025-05-29T15:06:23Z

Wondering if I can run it on 4xRTX3090, can add more 2 GPUs too, so 6xRTX3090. Or strictly 8 needed and minimum 4090?

levi131 · 2025-05-29T15:21:54Z

Wondering if I can run it on 4xRTX3090, can add more 2 GPUs too, so 6xRTX3090. Or strictly 8 needed and minimum 4090?

Memory is the major limiting factor. 3090, like 4090, only has 24GB of memory per card, so it also requires at least pp_size=2 to run the 24B model. The number of cards is not strictly limited to 8, but if you use 6 cards(cp_size=3,pp_size=2), it may be necessary to reduce the size of some input images.

stupidZZ assigned levi131 May 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

8 x 4090 run MAGI-1-24B-distill+fp8_quant out of memory #83

8 x 4090 run MAGI-1-24B-distill+fp8_quant out of memory #83

PMPBinZhang commented May 29, 2025

PMPBinZhang commented May 29, 2025

Uh oh!

walt008 commented May 29, 2025

Uh oh!

levi131 commented May 29, 2025

Uh oh!

levi131 commented May 29, 2025

Uh oh!

Issues-maker commented May 29, 2025

Uh oh!

levi131 commented May 29, 2025 •

edited

Loading

Uh oh!

8 x 4090 run MAGI-1-24B-distill+fp8_quant out of memory #83

8 x 4090 run MAGI-1-24B-distill+fp8_quant out of memory #83

Comments

PMPBinZhang commented May 29, 2025

PMPBinZhang commented May 29, 2025

Uh oh!

walt008 commented May 29, 2025

Uh oh!

levi131 commented May 29, 2025

Uh oh!

levi131 commented May 29, 2025

Uh oh!

Issues-maker commented May 29, 2025

Uh oh!

levi131 commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

levi131 commented May 29, 2025 •

edited

Loading