Skip to content

Conversation

linxiulei
Copy link

When there are system errors on the hosts (e.g hardware errors), the system is given limited timeout to shutdown so likely the checkpoint saving is unable to complete and also unnecessary if other slices have the full checkpoint to save without system errors.

This change will skip checkpoint saving to reduce the termination latency so that the overall recovery time is improved.

@samos123
Copy link
Contributor

samos123 commented Aug 3, 2025

Checkpoint saving can be extremely fast with emergency checkpointing. I'm afraid not saving checkpoints may be worse. Especially since we have a custom method deployed to delete lingering pods.

For example for a 70B model, I'm seeing in memory checkpoint saving time to be 7 seconds.

@linxiulei
Copy link
Author

Checkpoint saving can be extremely fast with emergency checkpointing. I'm afraid not saving checkpoints may be worse.

This doesn't stop checkpoint saving in other Nodes without hardware faults. The intent of this PR is to make the faulty nodes to terminate and shutdown as soon as possible while other nodes can do the checkpoint saving so that the overall recovery time is reduced. To illustrate:

node0 (without fault): running -> JobSet restart -> checkpoint saving -> pod terminate -> job creation -> pod start
node1 (with fault): running -> JobSet restart -> checkpoint saving -> pod terminate -> node shutdown for repair -> job creation -> pod start

After this PR:

node 1 (with faulty): running -> JobSet restart -> pod terminate -> node shutdown for repair -> job creation -> pod start

Since node 1 is likely having the longest recovery time for whole workload to be back running, skipping its checkpoint saving step would reduce overall recovery time. Besides that, if there is hardware fault such as link error, the checkpoint saving may fail anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants