Omni如何进行分布式训练？ #210

zlf0307 · 2025-04-08T06:49:02Z

我使用如下命令进行多卡分布式训练：
CUDA_VISIBLE_DEVICES=5,6 python -m torch.distributed.run
main.py
--data_root ./text_spotting_datasets/
--output_folder ./output/pretrain/stage1/
--train_dataset totaltext_train mlt_train ic13_train ic15_train syntext1_train syntext2_train
--lr 0.0005
--max_steps 400000
--warmup_steps 5000
--checkpoint_freq 10000
--batch_size 6
--tfm_pre_norm
--train_max_size 768
--rec_loss_weight 2
--use_fpn
--use_char_window_prompt
但是实际上只有5号卡在训练，6号卡没有显存占用

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Omni如何进行分布式训练？ #210

Omni如何进行分布式训练？ #210

zlf0307 commented Apr 8, 2025

Omni如何进行分布式训练？ #210

Omni如何进行分布式训练？ #210

Comments

zlf0307 commented Apr 8, 2025