Model is not completely moved to a different device when using .cuda() or .cpu() #599
Open
1 of 2 tasks
Labels
bug
Something isn't working
Uh oh!
There was an error while loading. Please reload this page.
Hi Wenjie,
I was debugging an issue with StemGNN allocating too much VRAM on the
backward()
step, causing an OOM Exception. To handle this I wanted to finish the training on CPU, as it only happens with some model configurations. When doing so, I couldn't run the training as something was left on the CUDA device. Debugging the training function I discovered that the optimizer is not moved as it is not part of the modules.1. System Info
2. Information
I solved the issue with the following:
The
_move_optimizer_state_to_device
is as follows:I wrote the function looking at the Adam wrapper I found in PyPOTS and its state_dict, so I'm not sure it works in every case. I am using PyPOTS version 0.8.1.
3. Reproduction
StemGNN
model oncuda:0
;fit()
.4. Expected behavior
The model correctly moves all of its parts on CPU and runs the
.fit()
function correctly, without raising Exceptions.I wanted to know what you think about it and ask you if I could test this on the latest PyPOTS version to create a PR, if you like this approach of course. In case you'd appreciate it, I wanted to ask you where to include this function. I noticed that
.cuda()
or.cpu()
are functions from the PyTorchnn.Module
class, so I think it would be cool to overload them with a PyPOTS version that calls PyTorch's one and_move_optimizer_state_to_device
. Also, I tested this only onStemGNN
for now so it's better to test it with other optimizers or models if we want to include it.Looking forward to your kind response.
Best Regards,
Giacomo Guiduzzi
The text was updated successfully, but these errors were encountered: