Skip to content

Does not run in Pytorch 1.3.1 #14

Open
@michaelklachko

Description

@michaelklachko

I just cloned your repo and when I'm launching the command:

CUDA_VISIBLE_DEVICES=2,3,4,5 python imagenet.py -a mobilenetv2 -d /path/to/dataset/ImageNet2012/ --epochs 150 --lr-decay cos --lr 0.05 --wd 4e-5 -c checkpoints --width-mult 1 --input-size 224 -j 12

It gets stuck at this point:

=> creating model 'mobilenetv2'

Epoch: [1 | 150]
Processing

<Ctrl+C pressed after 10 min of nothing happening:>

^CTraceback (most recent call last):
  File "imagenet.py", line 403, in <module>
    main()
  File "imagenet.py", line 224, in main
    train_loss, train_acc = train(train_loader, train_loader_len, model, criterion, optimizer, epoch)
  File "imagenet.py", line 271, in train
    for i, (input, target) in enumerate(train_loader):
  File "/home/michael/mobilenetv2.pytorch/utils/dataloaders.py", line 190, in prefetched_loader
    for next_input, next_target in loader:
  File "/home/michael/miniconda2/envs/pt/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 804, in __next__
    idx, data = self._get_data()
  File "/home/michael/miniconda2/envs/pt/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 761, in _get_data
    success, data = self._try_get_data()
  File "/home/michael/miniconda2/envs/pt/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 724, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/michael/miniconda2/envs/pt/lib/python3.7/queue.py", line 179, in get
    self.not_empty.wait(remaining)
  File "/home/michael/miniconda2/envs/pt/lib/python3.7/threading.py", line 300, in wait
    gotit = waiter.acquire(True, timeout)
KeyboardInterrupt

Nothing is happening at this point. nvidia-smi shows that a single GPU consumes ~500M of memory, and CPU cores are ~60% busy, but it's not clear what are they doing. I waited for 10 minutes before aborting. I also tried it on a single GPU - same issue.
If I switch to --data-backend dali-cpu (using nvidia-dali version 0.16) it fails with the following error:

=> creating model 'mobilenetv2' Traceback (most recent call last): File "imagenet.py", line 403, in <module> main() File "imagenet.py", line 194, in main train_loader, train_loader_len = get_train_loader(args.data, args.batch_size, workers=args.workers, input_size=args.input_size) TypeError: gdtl() got an unexpected keyword argument 'input_size'

I'm using Pytorch 1.3.1 with 4x Titan Xp cards. The only thing I had to change in your code is to replace cuda(async=True) with cuda(non_blocking=True). Changing tonon_blocking=False does not help.

Can you please try cloning your repo to a clean Pytorch 1.3.1 environment and see if you can run it? Any idea what's going on?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions