Skip to content

Fix test_elastic_training_agent.py for torch version 2.4 and above. #1531

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

zhengchenyu
Copy link

What changes were proposed in this pull request?

There are two changes:
(1) After pytorch/pytorch@67d3e4f, the parametes redirects and tee are removed. Although #1130 solved this problem, it seems that the unit test part has not been modified.
(2) The unit test will be stuck. #1279 make compatibility in torch-2.4. But after this, for the unit test, we have to wait for all workers to finish writing data to torchelastic/role_info/{i}. But the unit test does not mock this. so will stuck util timeout.
In fact, we don’t need to process the data as pytorch/pytorch@dc4c75b because we already have the rank list.

Why are the changes needed?

Fix test_elastic_training_agent.py for torch version 2.4 and above.

Does this PR introduce any user-facing change?

No

How was this patch tested?

test in pytorch-2.6, pytorch-2.3, pytorch-2.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant