-
Notifications
You must be signed in to change notification settings - Fork 341
Adding LWS Integration #1174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Adding LWS Integration #1174
Conversation
|
||
def __call__(self) -> Nested[Any]: | ||
system = USER_FACING_NAME_TO_SYSTEM_CHARACTERISTICS[self._tpu_type] | ||
return dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the retry policy of LWS?
Could you help me what happen when:
- leader fails/is preempted?
- a worker fails/is preempted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default behavior is that if any pod in the group fails, regardless if it is a leader or a worker, the whole group fails. LWS also supports not restarting the whole group by setting RestartPolicy:None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about failures between groups? Could you help me compare the failure handling at all levels between LWS and jobset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no failure policies between groups. Each group is independent of each other, so if one group fails, the other will continue running
def __call__(self) -> Nested[Any]: | ||
system = USER_FACING_NAME_TO_SYSTEM_CHARACTERISTICS[self._tpu_type] | ||
return dict( | ||
size=system.vms_per_slice, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use case where a leader worker set w/o leader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All use cases that LWS has still apply on an LWS without a leader. The only difference is that the dual-template feature is not being used.
I made the generic TPULeaderWorkerTemplate
be single template to mirror the TPUReplicatedJob
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also cover the PathwaysMultiheadReplicatedJob
, where we creates multiple pathways cluster replicas at a time?
if self._tpu_type not in USER_FACING_NAME_TO_SYSTEM_CHARACTERISTICS: | ||
raise NotImplementedError(f"Missing system characteristics for {self._tpu_type}") | ||
|
||
def _build_container(self) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markblee _build_container
and _build_pod
should be able to shared w/ the jobset version. Do you have preference to extract them to a parent class or use the modifier pattern?
@@ -378,3 +378,28 @@ class GCPAPI(str, enum.Enum): | |||
"""GCP API to submit resource requests to.""" | |||
|
|||
GKE = "GKE" | |||
|
|||
|
|||
def delete_k8s_leaderworkerset(name: str, *, namespace: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also define list_k8s_leaderworkerset
? It will used by some tooling.
so create something like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By this you mean creating multiple multi-host inference deployments? |
I want to make sure N replicas of the Pathways cluster will be created. Different from Jobset which uses replicated job to control the replication. LWS will replicates a group as a whole, right? E.g. if you set --num_replicas=N, then N head node and N TPU worker group will be created? If this is the case, then I think Could you confirm it? |
That is correct, if --num_replicas=N, it will create N replicas of the Pathways cluster |
The number of workers is not set by --num-replicas however, it is determined by the machine type. So for a TPU 4x4 multi-slice, it will create 4 workers axlearn/axlearn/cloud/gcp/lws_utils.py Line 194 in cd3ffe1
|
What else is needed to merge this PR? |
@muyangyuapple or @Ethanlm could you please provide another review? It takes effort to keep a large PR open because main frequently changes. Right now this branch has conflicts with main. After your approval, we'll also have to get approval from Mark. |
builder=TPULeaderWorkerTemplate.default_config(), | ||
annotations=config_for_function(exclusive_topology_annotations_leaderworkerset), | ||
) | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add
pre_provisioner=TPUNodePoolProvisioner.default_config(),
for these two runners.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't finished my review yet. Left some initial minor comments.
Can you please provide some concrete test examples in the PR summary, and demonstrate what a LWS TPU job and a LWS pathways job would look like on k8s?
Like what services and pods are created on k8s, and what the naming convention look like, and what env variables or annotations are added by LWS controller automatically?
Co-authored-by: Meng (Ethan) Li <[email protected]>
Co-authored-by: Meng (Ethan) Li <[email protected]>
Co-authored-by: Meng (Ethan) Li <[email protected]>
Co-authored-by: Meng (Ethan) Li <[email protected]>
inner: The wrapped TPULeaderWorkerTemplate configuration | ||
""" | ||
|
||
inner: Required[TPULeaderWorkerTemplate.Config] = REQUIRED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably get rid of inner builder and flatten the logic in inner to this level.
The reason why inner was used in the jobset version was to reuse some codes from the non-pathways training job. But there is no re-usage of inner there.
Having inner makes it hard to reuse components like pre-provisioner.
@@ -556,3 +565,147 @@ def __call__(self) -> Sequence[Nested[Any]]: | |||
) | |||
|
|||
return replicated_jobs | |||
|
|||
|
|||
class PathwaysLeaderWorkerTemplate(BaseLeaderWorkerTemplate): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to set these env var in this builder: https://github.com/apple/axlearn/blob/main/axlearn/cloud/gcp/pathways_utils.py#L240-L253
Added integration with https://github.com/kubernetes-sigs/lws for TPUs, as well as integration of LWS + Pathways.
To run basic LWS+TPU
To run LWS+Pathways