Skip to content

Adding LWS Integration #1174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open

Conversation

Edwinhr716
Copy link

@Edwinhr716 Edwinhr716 commented May 12, 2025

Added integration with https://github.com/kubernetes-sigs/lws for TPUs, as well as integration of LWS + Pathways.

To run basic LWS+TPU

axlearn gcp launch run --cluster=$CLUSTER \
--runner_name gke_tpu_lws \
--name=$USER \
--instance_type=tpu-v6e-16 \
--bundler_spec=allow_dirty=True \
--bundler_type=artifactregistry --bundler_spec=image=tpu \
--bundler_spec=dockerfile=Dockerfile --bundler_spec=target=tpu \
-- sleep infinity;

To run LWS+Pathways

axlearn gcp launch run --cluster=$CLUSTER \
--runner_name gke_tpu_lws_pathways \
--name=$USER \
--instance_type=tpu-v6e-16 \
--bundler_spec=allow_dirty=True \
--bundler_type=artifactregistry --bundler_spec=image=tpu \
--bundler_spec=dockerfile=Dockerfile --bundler_spec=target=tpu \
-- sleep infinity;

@Edwinhr716 Edwinhr716 requested review from ruomingp, markblee and a team as code owners May 12, 2025 16:10

def __call__(self) -> Nested[Any]:
system = USER_FACING_NAME_TO_SYSTEM_CHARACTERISTICS[self._tpu_type]
return dict(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the retry policy of LWS?

Could you help me what happen when:

  1. leader fails/is preempted?
  2. a worker fails/is preempted?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default behavior is that if any pod in the group fails, regardless if it is a leader or a worker, the whole group fails. LWS also supports not restarting the whole group by setting RestartPolicy:None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about failures between groups? Could you help me compare the failure handling at all levels between LWS and jobset?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no failure policies between groups. Each group is independent of each other, so if one group fails, the other will continue running

def __call__(self) -> Nested[Any]:
system = USER_FACING_NAME_TO_SYSTEM_CHARACTERISTICS[self._tpu_type]
return dict(
size=system.vms_per_slice,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case where a leader worker set w/o leader?

Copy link
Author

@Edwinhr716 Edwinhr716 May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All use cases that LWS has still apply on an LWS without a leader. The only difference is that the dual-template feature is not being used.

I made the generic TPULeaderWorkerTemplate be single template to mirror the TPUReplicatedJob

Copy link
Contributor

@muyangyuapple muyangyuapple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also cover the PathwaysMultiheadReplicatedJob, where we creates multiple pathways cluster replicas at a time?

if self._tpu_type not in USER_FACING_NAME_TO_SYSTEM_CHARACTERISTICS:
raise NotImplementedError(f"Missing system characteristics for {self._tpu_type}")

def _build_container(self) -> dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markblee _build_container and _build_pod should be able to shared w/ the jobset version. Do you have preference to extract them to a parent class or use the modifier pattern?

@@ -378,3 +378,28 @@ class GCPAPI(str, enum.Enum):
"""GCP API to submit resource requests to."""

GKE = "GKE"


def delete_k8s_leaderworkerset(name: str, *, namespace: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also define list_k8s_leaderworkerset? It will used by some tooling.

@lkolluru05
Copy link

lkolluru05 commented May 20, 2025

Could you also cover the PathwaysMultiheadReplicatedJob, where we creates multiple pathways cluster replicas at a time?

so create something like LWSMultiheadReplicatedJob ?

@ruomingp ruomingp requested a review from Ethanlm May 21, 2025 13:35
Copy link
Contributor

@ruomingp ruomingp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will defer to @Ethanlm and @markblee for approval.

@Edwinhr716
Copy link
Author

Could you also cover the PathwaysMultiheadReplicatedJob, where we creates multiple pathways cluster replicas at a time?

By this you mean creating multiple multi-host inference deployments?

@muyangyuapple
Copy link
Contributor

Could you also cover the PathwaysMultiheadReplicatedJob, where we creates multiple pathways cluster replicas at a time?

By this you mean creating multiple multi-host inference deployments?

I want to make sure N replicas of the Pathways cluster will be created.

Different from Jobset which uses replicated job to control the replication. LWS will replicates a group as a whole, right? E.g. if you set --num_replicas=N, then N head node and N TPU worker group will be created?

If this is the case, then I think gke_tpu_lws_pathways already cover this case.

Could you confirm it?

@Edwinhr716
Copy link
Author

Could you also cover the PathwaysMultiheadReplicatedJob, where we creates multiple pathways cluster replicas at a time?

By this you mean creating multiple multi-host inference deployments?

I want to make sure N replicas of the Pathways cluster will be created.

Different from Jobset which uses replicated job to control the replication. LWS will replicates a group as a whole, right? E.g. if you set --num_replicas=N, then N head node and N TPU worker group will be created?

If this is the case, then I think gke_tpu_lws_pathways already cover this case.

Could you confirm it?

That is correct, if --num_replicas=N, it will create N replicas of the Pathways cluster

@Edwinhr716
Copy link
Author

The number of workers is not set by --num-replicas however, it is determined by the machine type. So for a TPU 4x4 multi-slice, it will create 4 workers

size=system.vms_per_slice,

@Edwinhr716
Copy link
Author

What else is needed to merge this PR?

@samos123
Copy link
Contributor

samos123 commented Jun 9, 2025

@muyangyuapple or @Ethanlm could you please provide another review? It takes effort to keep a large PR open because main frequently changes. Right now this branch has conflicts with main. After your approval, we'll also have to get approval from Mark.

builder=TPULeaderWorkerTemplate.default_config(),
annotations=config_for_function(exclusive_topology_annotations_leaderworkerset),
)
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add

            pre_provisioner=TPUNodePoolProvisioner.default_config(),

for these two runners.

Copy link
Contributor

@Ethanlm Ethanlm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't finished my review yet. Left some initial minor comments.

Can you please provide some concrete test examples in the PR summary, and demonstrate what a LWS TPU job and a LWS pathways job would look like on k8s?

Like what services and pods are created on k8s, and what the naming convention look like, and what env variables or annotations are added by LWS controller automatically?

@lkolluru05 lkolluru05 requested a review from a team as a code owner June 17, 2025 18:47
Edwinhr716 and others added 4 commits June 17, 2025 16:20
Co-authored-by: Meng (Ethan) Li <[email protected]>
Co-authored-by: Meng (Ethan) Li <[email protected]>
Co-authored-by: Meng (Ethan) Li <[email protected]>
Co-authored-by: Meng (Ethan) Li <[email protected]>
inner: The wrapped TPULeaderWorkerTemplate configuration
"""

inner: Required[TPULeaderWorkerTemplate.Config] = REQUIRED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably get rid of inner builder and flatten the logic in inner to this level.

The reason why inner was used in the jobset version was to reuse some codes from the non-pathways training job. But there is no re-usage of inner there.

Having inner makes it hard to reuse components like pre-provisioner.

@@ -556,3 +565,147 @@ def __call__(self) -> Sequence[Nested[Any]]:
)

return replicated_jobs


class PathwaysLeaderWorkerTemplate(BaseLeaderWorkerTemplate):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants