Adding LWS Integration #1174

Edwinhr716 · 2025-05-12T16:10:08Z

Added integration with https://github.com/kubernetes-sigs/lws for TPUs, as well as integration of LWS + Pathways.

To run basic LWS+TPU

axlearn gcp launch run --cluster=$CLUSTER \
--runner_name gke_tpu_lws \
--name=$USER \
--instance_type=tpu-v6e-16 \
--bundler_spec=allow_dirty=True \
--bundler_type=artifactregistry --bundler_spec=image=tpu \
--bundler_spec=dockerfile=Dockerfile --bundler_spec=target=tpu \
-- sleep infinity;

To run LWS+Pathways

axlearn gcp launch run --cluster=$CLUSTER \
--runner_name gke_tpu_lws_pathways \
--name=$USER \
--instance_type=tpu-v6e-16 \
--bundler_spec=allow_dirty=True \
--bundler_type=artifactregistry --bundler_spec=image=tpu \
--bundler_spec=dockerfile=Dockerfile --bundler_spec=target=tpu \
-- sleep infinity;

axlearn/cloud/gcp/job.py

axlearn/cloud/gcp/runners/gke.py

axlearn/cloud/gcp/lws_utils.py

muyangyuapple · 2025-05-13T05:06:52Z

axlearn/cloud/gcp/lws_utils.py

+
+    def __call__(self) -> Nested[Any]:
+        system = USER_FACING_NAME_TO_SYSTEM_CHARACTERISTICS[self._tpu_type]
+        return dict(


What's the retry policy of LWS?

Could you help me what happen when:

leader fails/is preempted?

a worker fails/is preempted?

The default behavior is that if any pod in the group fails, regardless if it is a leader or a worker, the whole group fails. LWS also supports not restarting the whole group by setting RestartPolicy:None

How about failures between groups? Could you help me compare the failure handling at all levels between LWS and jobset?

There are no failure policies between groups. Each group is independent of each other, so if one group fails, the other will continue running

muyangyuapple · 2025-05-13T05:07:36Z

axlearn/cloud/gcp/lws_utils.py

+    def __call__(self) -> Nested[Any]:
+        system = USER_FACING_NAME_TO_SYSTEM_CHARACTERISTICS[self._tpu_type]
+        return dict(
+            size=system.vms_per_slice,


What's the use case where a leader worker set w/o leader?

All use cases that LWS has still apply on an LWS without a leader. The only difference is that the dual-template feature is not being used.

I made the generic TPULeaderWorkerTemplate be single template to mirror the TPUReplicatedJob

muyangyuapple

Could you also cover the PathwaysMultiheadReplicatedJob, where we creates multiple pathways cluster replicas at a time?

axlearn/cloud/gcp/lws_utils.py

muyangyuapple · 2025-05-20T00:42:04Z

axlearn/cloud/gcp/lws_utils.py

+        if self._tpu_type not in USER_FACING_NAME_TO_SYSTEM_CHARACTERISTICS:
+            raise NotImplementedError(f"Missing system characteristics for {self._tpu_type}")
+
+    def _build_container(self) -> dict:


@markblee _build_container and _build_pod should be able to shared w/ the jobset version. Do you have preference to extract them to a parent class or use the modifier pattern?

muyangyuapple · 2025-05-20T00:42:56Z

axlearn/cloud/gcp/utils.py

@@ -378,3 +378,28 @@ class GCPAPI(str, enum.Enum):
    """GCP API to submit resource requests to."""

    GKE = "GKE"
+
+
+def delete_k8s_leaderworkerset(name: str, *, namespace: str):


Can you also define list_k8s_leaderworkerset? It will used by some tooling.

lkolluru05 · 2025-05-20T23:11:41Z

Could you also cover the PathwaysMultiheadReplicatedJob, where we creates multiple pathways cluster replicas at a time?

so create something like LWSMultiheadReplicatedJob ?

ruomingp

Will defer to @Ethanlm and @markblee for approval.

Edwinhr716 · 2025-05-27T14:20:42Z

Could you also cover the PathwaysMultiheadReplicatedJob, where we creates multiple pathways cluster replicas at a time?

By this you mean creating multiple multi-host inference deployments?

muyangyuapple · 2025-06-02T23:17:13Z

Could you also cover the PathwaysMultiheadReplicatedJob, where we creates multiple pathways cluster replicas at a time?

By this you mean creating multiple multi-host inference deployments?

I want to make sure N replicas of the Pathways cluster will be created.

Different from Jobset which uses replicated job to control the replication. LWS will replicates a group as a whole, right? E.g. if you set --num_replicas=N, then N head node and N TPU worker group will be created?

If this is the case, then I think gke_tpu_lws_pathways already cover this case.

Could you confirm it?

Edwinhr716 · 2025-06-02T23:37:00Z

Could you also cover the PathwaysMultiheadReplicatedJob, where we creates multiple pathways cluster replicas at a time?

By this you mean creating multiple multi-host inference deployments?

I want to make sure N replicas of the Pathways cluster will be created.

Different from Jobset which uses replicated job to control the replication. LWS will replicates a group as a whole, right? E.g. if you set --num_replicas=N, then N head node and N TPU worker group will be created?

If this is the case, then I think gke_tpu_lws_pathways already cover this case.

Could you confirm it?

That is correct, if --num_replicas=N, it will create N replicas of the Pathways cluster

Edwinhr716 · 2025-06-02T23:41:39Z

The number of workers is not set by --num-replicas however, it is determined by the machine type. So for a TPU 4x4 multi-slice, it will create 4 workers

axlearn/axlearn/cloud/gcp/lws_utils.py

Line 194 in cd3ffe1

size=system.vms_per_slice,

Edwinhr716 · 2025-06-09T15:23:52Z

What else is needed to merge this PR?

samos123 · 2025-06-09T15:50:27Z

@muyangyuapple or @Ethanlm could you please provide another review? It takes effort to keep a large PR open because main frequently changes. Right now this branch has conflicts with main. After your approval, we'll also have to get approval from Mark.

muyangyuapple · 2025-06-17T01:14:00Z

axlearn/cloud/gcp/runners/__init__.py

+                builder=TPULeaderWorkerTemplate.default_config(),
+                annotations=config_for_function(exclusive_topology_annotations_leaderworkerset),
+            )
+        ),


Please add

pre_provisioner=TPUNodePoolProvisioner.default_config(),

for these two runners.

Ethanlm

Haven't finished my review yet. Left some initial minor comments.

Can you please provide some concrete test examples in the PR summary, and demonstrate what a LWS TPU job and a LWS pathways job would look like on k8s?

Like what services and pods are created on k8s, and what the naming convention look like, and what env variables or annotations are added by LWS controller automatically?

axlearn/cloud/gcp/job.py

axlearn/cloud/gcp/lws_utils.py

Co-authored-by: Meng (Ethan) Li <[email protected]>

muyangyuapple · 2025-06-17T22:54:18Z

axlearn/cloud/gcp/pathways_utils.py

+            inner: The wrapped TPULeaderWorkerTemplate configuration
+        """
+
+        inner: Required[TPULeaderWorkerTemplate.Config] = REQUIRED


We should probably get rid of inner builder and flatten the logic in inner to this level.

The reason why inner was used in the jobset version was to reuse some codes from the non-pathways training job. But there is no re-usage of inner there.

Having inner makes it hard to reuse components like pre-provisioner.

muyangyuapple · 2025-06-17T22:56:41Z

axlearn/cloud/gcp/pathways_utils.py

@@ -556,3 +565,147 @@ def __call__(self) -> Sequence[Nested[Any]]:
            )

        return replicated_jobs
+
+
+class PathwaysLeaderWorkerTemplate(BaseLeaderWorkerTemplate):


We also need to set these env var in this builder: https://github.com/apple/axlearn/blob/main/axlearn/cloud/gcp/pathways_utils.py#L240-L253

Edwinhr716 and others added 15 commits May 9, 2025 21:40

basic working example

0d70efb

runner for lws added

d6e6ba1

added updatig condition

fc6d756

added spec for pathways in lws

373b881

updated to variables

cccdbc4

refactored to match JobSet Pathways implementation

990207e

added new runner for jetstream-pathways

930e0da

bug fixes, made num_replicas a parameter for an LWS object

789516a

ran precommits

a03bab9

added pathways utils tests

574beea

added more unit tests

9bf36e0

removed mentions of jetstream, fixed bug in runner

fe6f589

added tests for gke runner

f6d11c1

minor fixes

a196bcc

cleaned up tests

bd75476

Edwinhr716 requested review from ruomingp, markblee and a team as code owners May 12, 2025 16:10

removed jetstream from runner name

90bed66

muyangyuapple reviewed May 13, 2025

View reviewed changes

addressed comments

cd3ffe1

muyangyuapple reviewed May 19, 2025

View reviewed changes

muyangyuapple reviewed May 20, 2025

View reviewed changes

ruomingp requested a review from Ethanlm May 21, 2025 13:35

ruomingp reviewed May 21, 2025

View reviewed changes

test

051dbcf

muyangyuapple reviewed Jun 17, 2025

View reviewed changes

Ethanlm reviewed Jun 17, 2025

View reviewed changes

service and PR changes

95e6a97

lkolluru05 requested a review from a team as a code owner June 17, 2025 18:47

Edwinhr716 and others added 4 commits June 17, 2025 16:20

Update axlearn/cloud/gcp/job.py

1f4fe83

Co-authored-by: Meng (Ethan) Li <[email protected]>

Update axlearn/cloud/gcp/job.py

27509e2

Co-authored-by: Meng (Ethan) Li <[email protected]>

Update axlearn/cloud/gcp/job.py

6382323

Co-authored-by: Meng (Ethan) Li <[email protected]>

Update axlearn/cloud/gcp/job.py

51d691b

Co-authored-by: Meng (Ethan) Li <[email protected]>

muyangyuapple reviewed Jun 17, 2025

View reviewed changes

Adding LWS Integration #1174

Are you sure you want to change the base?

Adding LWS Integration #1174

Conversation

Edwinhr716 commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Edwinhr716 May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

muyangyuapple left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lkolluru05 commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ruomingp left a comment

Choose a reason for hiding this comment

Uh oh!

Edwinhr716 commented May 27, 2025

Uh oh!

muyangyuapple commented Jun 2, 2025

Uh oh!

Edwinhr716 commented Jun 2, 2025

Uh oh!

Edwinhr716 commented Jun 2, 2025

Uh oh!

Edwinhr716 commented Jun 9, 2025

Uh oh!

samos123 commented Jun 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ethanlm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Edwinhr716 commented May 12, 2025 •

edited

Loading

Edwinhr716 May 13, 2025 •

edited

Loading

lkolluru05 commented May 20, 2025 •

edited

Loading

Ethanlm left a comment •

edited

Loading