Skip to content

AssertionError: Loss is NaN or infinite at epoch 0, batch 712. Stopping training. #5

Open
@agoransson

Description

@agoransson

I get it at different batches in the first epoch, not always the same. But around 70-80% progress iof the first batch it seems.

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[48], line 1
----> 1 train_loop(model=model, 
      2            train_dataloader=train_dataloader,
      3            valid_dataloader=valid_dataloader,
      4            optimizer=optimizer, 
      5            loss_func=yolox_loss, 
      6            lr_scheduler=lr_scheduler, 
      7            device=torch.device(device), 
      8            epochs=epochs, 
      9            checkpoint_path=checkpoint_path,
     10            use_scaler=True)

Cell In[42], line 120, in train_loop(model, train_dataloader, valid_dataloader, optimizer, loss_func, lr_scheduler, device, epochs, checkpoint_path, use_scaler)
    117 # Loop over the epochs
    118 for epoch in tqdm(range(epochs), desc="Epochs"):
    119     # Run a training epoch and get the training loss
--> 120     train_loss = run_epoch(model, train_dataloader, optimizer, lr_scheduler, loss_func, device, scaler, epoch, is_training=True)
    121     # Run an evaluation epoch and get the validation loss
    122     with torch.no_grad():

Cell In[42], line 77, in run_epoch(model, dataloader, optimizer, lr_scheduler, loss_func, device, scaler, epoch_id, is_training)
     75         if math.isfinite(loss_item):
     76             print(finate_training_message)
---> 77         assert not math.isnan(loss_item) and math.isfinite(loss_item), stop_training_message
     79 # Cleanup and close the progress bar 
     80 progress_bar.close()

AssertionError: Loss is NaN or infinite at epoch 0, batch 712. Stopping training.

I continue the process after this failure, hoping to get somewhere with the model. When I eventually run the following code block:

# Ensure the model and input data are on the same device
print(device)
wrapped_model.to(device)
input_tensor = transforms.Compose([transforms.ToImage(), 
                                   transforms.ToDtype(torch.float32, scale=True)])(input_img)[None].to(device)

# Make a prediction with the model
with torch.no_grad():
    model_output = wrapped_model(input_tensor)

model_output.shape

I get this error. It seems mostly connected to failure to pass both the model and input to the device (mps) - but as far as I can see in the code both are already passed to device.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[74], line 9
      7 # Make a prediction with the model
      8 with torch.no_grad():
----> 9     model_output = wrapped_model(input_tensor)
     11 model_output.shape

File ~/miniforge3/envs/pytorch-env/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
   1551     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1552 else:
-> 1553     return self._call_impl(*args, **kwargs)

File ~/miniforge3/envs/pytorch-env/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
   1557 # If we don't have any hooks, we want to skip the rest of the logic in
   1558 # this function, and just call forward.
   1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1560         or _global_backward_pre_hooks or _global_backward_hooks
   1561         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562     return forward_call(*args, **kwargs)
   1564 try:
   1565     result = None

File ~/miniforge3/envs/pytorch-env/lib/python3.10/site-packages/cjm_yolox_pytorch/inference.py:151, in YOLOXInferenceWrapper.forward(self, x)
    147 x = self.process_output(x)
    149 if self.run_box_and_prob_calculation:
    150     # Generate output grids
--> 151     output_grids = generate_output_grids(*input_dims, self.strides).to(x.device)
    152     # Calculate the bounding boxes and their probabilities
    153     x = self.calculate_boxes_and_probs(x, output_grids)

File ~/miniforge3/envs/pytorch-env/lib/python3.10/site-packages/cjm_yolox_pytorch/utils.py:56, in generate_output_grids(height, width, strides)
     52 # We will use a loop but it won't affect the exportability of the model to ONNX 
     53 # as the loop is not dependent on the input data (height, width) but on the 'strides' which is model parameter.
     54 for i, stride in enumerate(strides):
     55     # Calculate the grid height and width
---> 56     grid_height = height // stride
     57     grid_width = width // stride
     59     # Generate grid coordinates

File ~/miniforge3/envs/pytorch-env/lib/python3.10/site-packages/torch/_tensor.py:41, in _handle_torch_function_and_wrap_type_error_to_not_implemented.<locals>.wrapped(*args, **kwargs)
     39     if has_torch_function(args):
     40         return handle_torch_function(wrapped, args, *args, **kwargs)
---> 41     return f(*args, **kwargs)
     42 except TypeError:
     43     return NotImplemented

File ~/miniforge3/envs/pytorch-env/lib/python3.10/site-packages/torch/_tensor.py:999, in Tensor.__rfloordiv__(self, other)
    997 @_handle_torch_function_and_wrap_type_error_to_not_implemented
    998 def __rfloordiv__(self, other):
--> 999     return torch.floor_divide(other, self)

RuntimeError: Placeholder storage has not been allocated on MPS device!

Would you know how to handle such an issue? Or is it again a problem with MPS support?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions