Skip to content

Conversation

@chiayi
Copy link
Contributor

@chiayi chiayi commented Sep 18, 2025

Why are these changes needed?

This PR adds to the utility library for TPU slice placement group scheduling. We generalize the 2 phase approach that the JaxTrainer uses to reserve and schedule the workers on the TPU slice.

Related issue number

#55162

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@nrghosh nrghosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the contribution @chiayi - just a reminder to run pre-commit linting to unblock CI checks. ./ci/lint/lint.sh code_format and ./ci/lint/lint.sh pre_commit should do it.

@chiayi chiayi force-pushed the slice-placement-group branch 2 times, most recently from ea13d03 to 1ac7a80 Compare September 18, 2025 23:03
@chiayi chiayi force-pushed the slice-placement-group branch from 1ac7a80 to a00cb87 Compare September 29, 2025 20:35
@chiayi chiayi force-pushed the slice-placement-group branch from a00cb87 to e1cd5bb Compare October 7, 2025 23:39
@chiayi
Copy link
Contributor Author

chiayi commented Oct 14, 2025

After building the ray image, I manually ran the slice placement group against v6e 4x4 multihost tpus using the following testcode:

import ray
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
from ray.util.accelerators.tpu import SlicePlacementGroup

ray.init()
slice_handle = SlicePlacementGroup(topology="4x4", accelerator_version="v6e")
slice_pg = slice_handle.placement_group
ray.get(slice_pg.ready(), timeout=10)
            
@ray.remote(num_cpus=0, resources={'TPU': 4})
def spmd_task(world, rank):
    print(f"Current TPU is rank {rank} of {world}")
    tasks = [
        spmd_task.options(
            scheduling_strategy=PlacementGroupSchedulingStrategy(
                placement_group=slice_pg,
            )
        ).remote(world=4, rank=i)
        for i in range(slice_handle.num_workers)]

@chiayi
Copy link
Contributor Author

chiayi commented Oct 14, 2025

@ryanaoleary Please take a look when you get the chance. Thank you!

@ryanaoleary ryanaoleary changed the title Add slice placement groups [Core] Add TPU utility functions to support slice placement groups Oct 16, 2025
@ryanaoleary ryanaoleary marked this pull request as ready for review October 16, 2025 10:46
@ryanaoleary ryanaoleary requested a review from a team as a code owner October 16, 2025 10:46
cursor[bot]

This comment was marked as outdated.

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Oct 16, 2025
@ryanaoleary
Copy link
Contributor

ryanaoleary commented Oct 20, 2025

cc: @MengjinYan @edoakes

@ryanaoleary ryanaoleary force-pushed the slice-placement-group branch from de1ba7c to a8486d2 Compare October 21, 2025 09:16
cursor[bot]

This comment was marked as outdated.

@ryanaoleary ryanaoleary force-pushed the slice-placement-group branch from 94c013a to 4cae047 Compare October 21, 2025 09:27
chiayi and others added 4 commits October 21, 2025 09:29
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
@ryanaoleary ryanaoleary force-pushed the slice-placement-group branch from 4cae047 to 0ce9389 Compare October 21, 2025 09:29
@ryanaoleary ryanaoleary requested a review from nrghosh October 21, 2025 09:30
@ryanaoleary
Copy link
Contributor

cc: @andrewsykim @liulehui @matthewdeng for review, for the JaxTrainer we'd call slice_placement_group or multi_slice_placement_group when creating the placement group for the training function. We'll need to add a new argument to the ScalingConfig for num_slices which we'd pass to these util helpers.

@liulehui
Copy link
Contributor

cc: @andrewsykim @liulehui @matthewdeng for review, for the JaxTrainer we'd call slice_placement_group or multi_slice_placement_group when creating the placement group for the training function. We'll need to add a new argument to the ScalingConfig for num_slices which we'd pass to these util helpers.

I feel the for the Ray Train side, the ScalingConfig params list is getting a bit long, e.g. the topology and num_slice is TPU specific parameters, GPU user shouldn't worry about them.

Wondering if we should create something like AcceleratorConfig and start aggregate the terminology like:

GPUAcceleratorConfig(accelerator_type="T4")
GPUAcceleratorConfig(accelerator_type="A100")
TPUAcceleratorConfig(accelerator_type="TPU-V6E", topology="4x4", num_slices = 1)
TPUAcceleratorConfig(accelerator_type="TPU-V6E", topology="4x4", num_slices = 2)

then maybe get rid of the use_gpu and use_tpu.

cc @matthewdeng

@ryanaoleary
Copy link
Contributor

cc: @andrewsykim @liulehui @matthewdeng for review, for the JaxTrainer we'd call slice_placement_group or multi_slice_placement_group when creating the placement group for the training function. We'll need to add a new argument to the ScalingConfig for num_slices which we'd pass to these util helpers.

I feel the for the Ray Train side, the ScalingConfig params list is getting a bit long, e.g. the topology and num_slice is TPU specific parameters, GPU user shouldn't worry about them.

Wondering if we should create something like AcceleratorConfig and start aggregate the terminology like:

GPUAcceleratorConfig(accelerator_type="T4")
GPUAcceleratorConfig(accelerator_type="A100")
TPUAcceleratorConfig(accelerator_type="TPU-V6E", topology="4x4", num_slices = 1)
TPUAcceleratorConfig(accelerator_type="TPU-V6E", topology="4x4", num_slices = 2)

then maybe get rid of the use_gpu and use_tpu.

cc @matthewdeng

That sounds good to me, in the follow-up change to this PR that adds the multi-slice API to the Ray Train side I could add it in a new TPUAcceleratorConfig, and then if we need to add additional fields in the future the TPU specific logic can be contained there.

A handle to a placement group reservation for a TPU slice.
The following definitions are added for clarity:
- Accelerator type: A string describing the accelerator type and version (e.g. TPU-V2, TPU-V6E).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do these match the existing accelerator type resource & label names? @ryanaoleary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All do except for "Accelerator version" which is just the TPU generation:

- Accelerator type -> ray.io/accelerator-type
- Accelerator version -> No exact equivalent label is set, but we can derive it from the other info
- Pod type -> ray.io/tpu-pod-type
- Accelerator topology -> ray.io/tpu-topology

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and yeah in the example the accelerator type names match:

GOOGLE_TPU_V6E = "TPU-V6E"
.



@PublicAPI
class SlicePlacementGroup:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest moving SlicePlacementGroup API (and any future scheduling API) to python/ray/util/tpu and keeping ‎python/ray/util/accelerators/tpu.py for small helper functions for getting specific information about TPUs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly confused about what the structure should be, there's currently 2 locations with a tpu.py:

  • _private/accelerators/tpu.py: This is where helper functions and those called internally by other libraries are (i.e. TPUAcceleratorManager, reserve_tpu_slice, etc.)
    -util/accelerators/tpu.py: This is where the public APIs currently are implemented.

Should the structure be:

  • _private/accelerators/tpu.py: internal functions and helpers
  • util/accelerators/tpu.py: public APIs but only those like get_current_pod_name which are smaller helpers for getting info about the accelerator
  • util/tpu.py: This would be a new file where we implement the APIs related to TPUs and scheduling. I should probably move some functions like reserve_tpu_slice here.

I can change it to the above, my only concern would be that the logic is getting kind of spread out/complicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, we should combine everything in util/accelerators/tpu.py into util/tpu.py? Those are all public alpha APIs. If we want to avoid breaking we could have the same functions in util/accelerators/tpu.py call util/tpu.py

_private/accelerators/tpu.py seems fine as it's for internal calls for TPUs as you mentioned

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a23aed9

@cursor

This comment was marked as outdated.

Signed-off-by: Ryan O'Leary <[email protected]>
@ryanaoleary ryanaoleary requested a review from a team as a code owner October 23, 2025 23:13
cursor[bot]

This comment was marked as outdated.

Signed-off-by: Ryan O'Leary <[email protected]>
@ryanaoleary ryanaoleary force-pushed the slice-placement-group branch from 7a5ff6b to a447c66 Compare October 24, 2025 02:50
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. As next step, should we add a user guide to the Ray docs about how to use Ray w/ TPUs?

@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Oct 27, 2025
for _ in range(self.num_slices):
# Reserving a slice is done through constructing num_workers bundles, each with a label selector for
# the unique name of an available TPU slice.
slice_name = reserve_tpu_slice(self._topology, accelerator_type)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: TPU Reservation Logic Inconsistency

The reserve_tpu_slice function creates a placement group to reserve the TPU slice head and returns only the slice name, but the placement group reference is lost. This orphaned placement group will hold resources but won't be tracked. The subsequent placement group created at lines 183-189 doesn't include the head resource reservation that reserve_tpu_slice was supposed to provide, potentially causing scheduling issues. The function should either return both the slice name and placement group reference, or the reservation logic should be integrated directly into _reserve_slice.

Fix in Cursor Fix in Web

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is covered in a TODO comment, I plan to update it in the PR to add multi-slice to the JaxTrainer since it will require editing the call there.

@ryanaoleary
Copy link
Contributor

ryanaoleary commented Oct 27, 2025

LGTM. As next step, should we add a user guide to the Ray docs about how to use Ray w/ TPUs?

Yeah that sounds good to me, I can write up a guide on the new API. There's this existing overview of TPUs with KubeRay: https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/tpu.html, and I could add something similar but for the util library / Ray core support under something like Ray Core > User Guides > Advanced Topics.

@edoakes edoakes enabled auto-merge (squash) October 27, 2025 20:49
@edoakes edoakes merged commit ae0e8e4 into ray-project:master Oct 27, 2025
6 of 7 checks passed
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ay-project#56723)

This PR adds to the utility library for TPU slice placement group
scheduling. We generalize the 2 phase approach that the JaxTrainer uses
to reserve and schedule the workers on the TPU slice.

ray-project#55162

---------

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Aaron Liang <[email protected]>
Co-authored-by: Ryan O'Leary <[email protected]>
Co-authored-by: Ryan O'Leary <[email protected]>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ay-project#56723)

This PR adds to the utility library for TPU slice placement group
scheduling. We generalize the 2 phase approach that the JaxTrainer uses
to reserve and schedule the workers on the TPU slice.

ray-project#55162

---------

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Aaron Liang <[email protected]>
Co-authored-by: Ryan O'Leary <[email protected]>
Co-authored-by: Ryan O'Leary <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
edoakes pushed a commit that referenced this pull request Nov 30, 2025
## Description

This PR adds an API overview and example usage for the TPU utility
library added in this PR: #56723.
I added this section to the existing "Using TPUs with KubeRay guide",
because the utility library would be primarily used with KubeRay on GKE
(the values used for default labels are set on GKE with a mutating
webhook).

## Related issues
#55162

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…ay-project#56723)

This PR adds to the utility library for TPU slice placement group
scheduling. We generalize the 2 phase approach that the JaxTrainer uses
to reserve and schedule the workers on the TPU slice.

ray-project#55162

---------

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Aaron Liang <[email protected]>
Co-authored-by: Ryan O'Leary <[email protected]>
Co-authored-by: Ryan O'Leary <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants