[Core] Add TPU utility functions to support slice placement groups #56723

chiayi · 2025-09-18T22:30:54Z

Why are these changes needed?

This PR adds to the utility library for TPU slice placement group scheduling. We generalize the 2 phase approach that the JaxTrainer uses to reserve and schedule the workers on the TPU slice.

Related issue number

#55162

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

nrghosh

thanks for the contribution @chiayi - just a reminder to run pre-commit linting to unblock CI checks. ./ci/lint/lint.sh code_format and ./ci/lint/lint.sh pre_commit should do it.

chiayi · 2025-10-14T18:24:29Z

After building the ray image, I manually ran the slice placement group against v6e 4x4 multihost tpus using the following testcode:

import ray
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
from ray.util.accelerators.tpu import SlicePlacementGroup

ray.init()
slice_handle = SlicePlacementGroup(topology="4x4", accelerator_version="v6e")
slice_pg = slice_handle.placement_group
ray.get(slice_pg.ready(), timeout=10)
            
@ray.remote(num_cpus=0, resources={'TPU': 4})
def spmd_task(world, rank):
    print(f"Current TPU is rank {rank} of {world}")
    tasks = [
        spmd_task.options(
            scheduling_strategy=PlacementGroupSchedulingStrategy(
                placement_group=slice_pg,
            )
        ).remote(world=4, rank=i)
        for i in range(slice_handle.num_workers)]

chiayi · 2025-10-14T18:24:54Z

@ryanaoleary Please take a look when you get the chance. Thank you!

ryanaoleary · 2025-10-20T23:46:33Z

cc: @MengjinYan @edoakes

Signed-off-by: Ryan O'Leary <[email protected]>

Signed-off-by: Aaron Liang <[email protected]>

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary · 2025-10-21T09:35:40Z

cc: @andrewsykim @liulehui @matthewdeng for review, for the JaxTrainer we'd call slice_placement_group or multi_slice_placement_group when creating the placement group for the training function. We'll need to add a new argument to the ScalingConfig for num_slices which we'd pass to these util helpers.

liulehui · 2025-10-21T19:57:22Z

cc: @andrewsykim @liulehui @matthewdeng for review, for the JaxTrainer we'd call slice_placement_group or multi_slice_placement_group when creating the placement group for the training function. We'll need to add a new argument to the ScalingConfig for num_slices which we'd pass to these util helpers.

I feel the for the Ray Train side, the ScalingConfig params list is getting a bit long, e.g. the topology and num_slice is TPU specific parameters, GPU user shouldn't worry about them.

Wondering if we should create something like AcceleratorConfig and start aggregate the terminology like:

GPUAcceleratorConfig(accelerator_type="T4")
GPUAcceleratorConfig(accelerator_type="A100")
TPUAcceleratorConfig(accelerator_type="TPU-V6E", topology="4x4", num_slices = 1)
TPUAcceleratorConfig(accelerator_type="TPU-V6E", topology="4x4", num_slices = 2)

then maybe get rid of the use_gpu and use_tpu.

cc @matthewdeng

ryanaoleary · 2025-10-21T23:27:39Z

cc: @andrewsykim @liulehui @matthewdeng for review, for the JaxTrainer we'd call slice_placement_group or multi_slice_placement_group when creating the placement group for the training function. We'll need to add a new argument to the ScalingConfig for num_slices which we'd pass to these util helpers.

I feel the for the Ray Train side, the ScalingConfig params list is getting a bit long, e.g. the topology and num_slice is TPU specific parameters, GPU user shouldn't worry about them.

Wondering if we should create something like AcceleratorConfig and start aggregate the terminology like:
GPUAcceleratorConfig(accelerator_type="T4")
GPUAcceleratorConfig(accelerator_type="A100")
TPUAcceleratorConfig(accelerator_type="TPU-V6E", topology="4x4", num_slices = 1)
TPUAcceleratorConfig(accelerator_type="TPU-V6E", topology="4x4", num_slices = 2)
then maybe get rid of the use_gpu and use_tpu.

cc @matthewdeng

That sounds good to me, in the follow-up change to this PR that adds the multi-slice API to the Ray Train side I could add it in a new TPUAcceleratorConfig, and then if we need to add additional fields in the future the TPU specific logic can be contained there.

edoakes · 2025-10-22T19:38:40Z

python/ray/util/accelerators/tpu.py

+    A handle to a placement group reservation for a TPU slice.
+
+    The following definitions are added for clarity:
+        - Accelerator type: A string describing the accelerator type and version (e.g. TPU-V2, TPU-V6E).


do these match the existing accelerator type resource & label names? @ryanaoleary

All do except for "Accelerator version" which is just the TPU generation:

- Accelerator type -> ray.io/accelerator-type - Accelerator version -> No exact equivalent label is set, but we can derive it from the other info - Pod type -> ray.io/tpu-pod-type - Accelerator topology -> ray.io/tpu-topology

Oh and yeah in the example the accelerator type names match:

ray/python/ray/util/accelerators/accelerators.py

Line 36 in 4130e4d

GOOGLE_TPU_V6E = "TPU-V6E"

.

python/ray/util/accelerators/tpu.py

andrewsykim · 2025-10-22T20:59:04Z

python/ray/util/tpu.py

+
+
+@PublicAPI
+class SlicePlacementGroup:


I suggest moving SlicePlacementGroup API (and any future scheduling API) to python/ray/util/tpu and keeping ‎python/ray/util/accelerators/tpu.py for small helper functions for getting specific information about TPUs.

Slightly confused about what the structure should be, there's currently 2 locations with a tpu.py:

_private/accelerators/tpu.py: This is where helper functions and those called internally by other libraries are (i.e. TPUAcceleratorManager, reserve_tpu_slice, etc.)
-util/accelerators/tpu.py: This is where the public APIs currently are implemented.

Should the structure be:

_private/accelerators/tpu.py: internal functions and helpers

util/accelerators/tpu.py: public APIs but only those like get_current_pod_name which are smaller helpers for getting info about the accelerator

util/tpu.py: This would be a new file where we implement the APIs related to TPUs and scheduling. I should probably move some functions like reserve_tpu_slice here.

I can change it to the above, my only concern would be that the logic is getting kind of spread out/complicated.

If possible, we should combine everything in util/accelerators/tpu.py into util/tpu.py? Those are all public alpha APIs. If we want to avoid breaking we could have the same functions in util/accelerators/tpu.py call util/tpu.py

_private/accelerators/tpu.py seems fine as it's for internal calls for TPUs as you mentioned

Done in a23aed9

Signed-off-by: Ryan O'Leary <[email protected]>

edoakes

LGTM. As next step, should we add a user guide to the Ray docs about how to use Ray w/ TPUs?

cursor · 2025-10-27T20:16:10Z

python/ray/util/tpu.py

+        for _ in range(self.num_slices):
+            # Reserving a slice is done through constructing num_workers bundles, each with a label selector for
+            # the unique name of an available TPU slice.
+            slice_name = reserve_tpu_slice(self._topology, accelerator_type)


Bug: TPU Reservation Logic Inconsistency

The reserve_tpu_slice function creates a placement group to reserve the TPU slice head and returns only the slice name, but the placement group reference is lost. This orphaned placement group will hold resources but won't be tracked. The subsequent placement group created at lines 183-189 doesn't include the head resource reservation that reserve_tpu_slice was supposed to provide, potentially causing scheduling issues. The function should either return both the slice name and placement group reference, or the reservation logic should be integrated directly into _reserve_slice.

This is covered in a TODO comment, I plan to update it in the PR to add multi-slice to the JaxTrainer since it will require editing the call there.

ryanaoleary · 2025-10-27T20:17:33Z

LGTM. As next step, should we add a user guide to the Ray docs about how to use Ray w/ TPUs?

Yeah that sounds good to me, I can write up a guide on the new API. There's this existing overview of TPUs with KubeRay: https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/tpu.html, and I could add something similar but for the util library / Ray core support under something like Ray Core > User Guides > Advanced Topics.

…ay-project#56723) This PR adds to the utility library for TPU slice placement group scheduling. We generalize the 2 phase approach that the JaxTrainer uses to reserve and schedule the workers on the TPU slice. ray-project#55162 --------- Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Aaron Liang <[email protected]> Co-authored-by: Ryan O'Leary <[email protected]> Co-authored-by: Ryan O'Leary <[email protected]>

…ay-project#56723) This PR adds to the utility library for TPU slice placement group scheduling. We generalize the 2 phase approach that the JaxTrainer uses to reserve and schedule the workers on the TPU slice. ray-project#55162 --------- Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Aaron Liang <[email protected]> Co-authored-by: Ryan O'Leary <[email protected]> Co-authored-by: Ryan O'Leary <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>

## Description This PR adds an API overview and example usage for the TPU utility library added in this PR: #56723. I added this section to the existing "Using TPUs with KubeRay guide", because the utility library would be primarily used with KubeRay on GKE (the values used for default labels are set on GKE with a mutating webhook). ## Related issues #55162 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ay-project#56723) This PR adds to the utility library for TPU slice placement group scheduling. We generalize the 2 phase approach that the JaxTrainer uses to reserve and schedule the workers on the TPU slice. ray-project#55162 --------- Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Aaron Liang <[email protected]> Co-authored-by: Ryan O'Leary <[email protected]> Co-authored-by: Ryan O'Leary <[email protected]> Signed-off-by: Future-Outlier <[email protected]>

nrghosh reviewed Sep 18, 2025

View reviewed changes

chiayi force-pushed the slice-placement-group branch 2 times, most recently from ea13d03 to 1ac7a80 Compare September 18, 2025 23:03

chiayi force-pushed the slice-placement-group branch from 1ac7a80 to a00cb87 Compare September 29, 2025 20:35

chiayi force-pushed the slice-placement-group branch from a00cb87 to e1cd5bb Compare October 7, 2025 23:39

ryanaoleary changed the title ~~Add slice placement groups~~ [Core] Add TPU utility functions to support slice placement groups Oct 16, 2025

ryanaoleary marked this pull request as ready for review October 16, 2025 10:46

ryanaoleary requested a review from a team as a code owner October 16, 2025 10:46

This comment was marked as outdated.

Sign in to view

ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Oct 16, 2025

ryanaoleary force-pushed the slice-placement-group branch from de1ba7c to a8486d2 Compare October 21, 2025 09:16

This comment was marked as outdated.

Sign in to view

ryanaoleary force-pushed the slice-placement-group branch from 94c013a to 4cae047 Compare October 21, 2025 09:27

chiayi and others added 4 commits October 21, 2025 09:29

add slice placement groups

593b36d

Signed-off-by: Ryan O'Leary <[email protected]>

add slice placement groups

142fb32

Signed-off-by: Aaron Liang <[email protected]>

Fix bugs and tests

9f59bb4

Signed-off-by: Ryan O'Leary <[email protected]>

Fix capitalize

0ce9389

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary force-pushed the slice-placement-group branch from 4cae047 to 0ce9389 Compare October 21, 2025 09:29

ryanaoleary requested a review from nrghosh October 21, 2025 09:30

Merge branch 'master' into slice-placement-group

c476dbc

edoakes reviewed Oct 22, 2025

View reviewed changes

andrewsykim reviewed Oct 22, 2025

View reviewed changes

Fix apis and make them alpha status

ca5380c

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary requested review from andrewsykim and edoakes October 23, 2025 14:34

This comment was marked as outdated.

Sign in to view

Add api reference

5c54956

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary requested a review from a team as a code owner October 23, 2025 23:13

fix doc build and num_slices type

ca53734

Signed-off-by: Ryan O'Leary <[email protected]>

This comment was marked as outdated.

Sign in to view

improve validate function

a447c66

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary force-pushed the slice-placement-group branch from 7a5ff6b to a447c66 Compare October 24, 2025 02:50

ryanaoleary and others added 5 commits October 25, 2025 01:58

move public tpu utils up a folder level

a23aed9

Signed-off-by: Ryan O'Leary <[email protected]>

Fix import

f1bda01

Signed-off-by: Ryan O'Leary <[email protected]>

add test to build file

10f7856

Signed-off-by: Ryan O'Leary <[email protected]>

Skip test code because requires raycluster with tpu node

91c2bbb

Signed-off-by: Ryan O'Leary <[email protected]>

Merge branch 'master' into slice-placement-group

a8f8650

edoakes approved these changes Oct 27, 2025

View reviewed changes

edoakes added the go add ONLY when ready to merge, run all tests label Oct 27, 2025

ryanaoleary added 2 commits October 27, 2025 18:08

Merge branch 'master' into slice-placement-group

5b106c4

Merge branch 'master' into slice-placement-group

6ef702a

cursor bot reviewed Oct 27, 2025

View reviewed changes

edoakes enabled auto-merge (squash) October 27, 2025 20:49

edoakes merged commit ae0e8e4 into ray-project:master Oct 27, 2025
6 of 7 checks passed

ryanaoleary mentioned this pull request Oct 29, 2025

[Docs] Add API overview for SlicePlacementGroup to TPU docs #58272

Merged



		@PublicAPI
		class SlicePlacementGroup:

[Core] Add TPU utility functions to support slice placement groups #56723

[Core] Add TPU utility functions to support slice placement groups #56723

Uh oh!

Conversation

chiayi commented Sep 18, 2025 • edited by ryanaoleary Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

nrghosh left a comment

Choose a reason for hiding this comment

Uh oh!

chiayi commented Oct 14, 2025

Uh oh!

chiayi commented Oct 14, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

ryanaoleary commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

ryanaoleary commented Oct 21, 2025

Uh oh!

liulehui commented Oct 21, 2025

Uh oh!

ryanaoleary commented Oct 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Oct 27, 2025

Choose a reason for hiding this comment

Bug: TPU Reservation Logic Inconsistency

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryanaoleary commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

chiayi commented Sep 18, 2025 •

edited by ryanaoleary

Loading

ryanaoleary commented Oct 20, 2025 •

edited

Loading

ryanaoleary commented Oct 27, 2025 •

edited

Loading