Skip to content

Conversation

@antfin-oss
Copy link

This Pull Request was created automatically to merge the latest changes from master into main branch.

πŸ“… Created: 2025-11-24
πŸ”€ Merge direction: master β†’ main
πŸ€– Triggered by: Scheduled

Please review and merge if everything looks good.

abrarsheikh and others added 30 commits November 6, 2025 13:33
…58345)

## Summary
Adds a new method to expose all downstream deployments that a replica
calls into, enabling dependency graph construction.

## Motivation
Deployments call downstream deployments via handles in two ways:
1. **Stored handles**: Passed to `__init__()` and stored as attributes β†’
`self.model.func.remote()`
2. **Dynamic handles**: Obtained at runtime via
`serve.get_deployment_handle()` β†’ `model.func.remote()`

Previously, there was no way to programmatically discover these
dependencies from a running replica.

## Implementation

### Core Changes
- **`ReplicaActor.list_outbound_deployments()`**: Returns
`List[DeploymentID]` of all downstream deployments
- Recursively inspects user callable attributes to find stored handles
(including nested in dicts/lists)
- Tracks dynamic handles created via `get_deployment_handle()` at
runtime using a callback mechanism

- **Runtime tracking**: Modified `get_deployment_handle()` to register
handles when called from within a replica via
`ReplicaContext._handle_registration_callback`


Next PR: ray-project#58350

---------

Signed-off-by: abrar <[email protected]>
This PR: 
- adds a new page to the Ray Train docs called "Monitor your
Application" that lists and describes the Prometheus metrics emitted by
Ray Train
- Updates the Ray Core system metrics docs to include some missing
metrics

Link to example build:
https://anyscale-ray--58235.com.readthedocs.build/en/58235/train/user-guides/monitor-your-application.html

Preview Screenshot:

<img width="1630" height="662" alt="Screenshot 2025-10-29 at 2 46 07β€―PM"
src="https://github.com/user-attachments/assets/9ca7ea6d-522b-4033-909a-2ee626960e8a"
/>

---------

Signed-off-by: JasonLi1909 <[email protected]>
Currently, users that import ray.tune can run into an ImportError if
they do not have pydantic installed. This is because ray.tune imports
ray.train, which requires pydantic. This PR prevents this error by
adding pydantic as a ray tune dependency.

Relevant user issue: ray-project#58280

---------

Signed-off-by: JasonLi1909 <[email protected]>
The timeout is due to `moto-server` which mocks the s3. Remove the
remote storage for now.

---------

Signed-off-by: xgui <[email protected]>
…project#58229)

Ray Train's framework agnostic collective utilities
(`ray.train.collective.barrier`,
`ray.train.collective.broadcast_from_rank_zero`) currently timeout after
30 minutes if not all ranks join the operation. `ray.train.report` uses
these collective utilities internally, so users who don't call report on
every rank can run into deadlocks. For example, the report barrier can
deadlock with another worker waiting on others to join a backward pass
collective.

This PR changes the default Ray Train collective behavior to never
timeout and to only log warning messages about the missing ranks. User
code typically already has timeouts such as NCCL timeouts (also 30
minutes by default), so the extra timeout here doesn't really help and
increases the user burden of keeping track of environment variables to
set when debugging hanging jobs.

New default: `RAY_TRAIN_COLLECTIVE_TIMEOUT_S=-1`

This PR also generalizes the environment variable name:
`RAY_TRAIN_REPORT_BARRIER` -> `RAY_TRAIN_COLLECTIVE`.

---------

Signed-off-by: Justin Yu <[email protected]>
which depends on datalbuild

Signed-off-by: Lonnie Liu <[email protected]>
Removing unnecessary --strip-extras flag from raydepsets (only updates
depset lock file headers):

[--strip-extras](https://docs.astral.sh/uv/reference/cli/#uv-pip-compile--no-strip-extras)
Include extras in the output file.

By default, uv strips extras, as any packages pulled in by the extras
are already included as dependencies in the output file directly.
Further, output files generated with --no-strip-extras cannot be used as
constraints files in install and sync invocations.

---------

Signed-off-by: elliot-barn <[email protected]>
…#58444)

fix a typo (missing semicolon) in authentication_token_loader_test and
use cross env compatible `ray::UnsetEnv()` and `ray::setEnv()` in tests

---------

Signed-off-by: sampan <[email protected]>
Co-authored-by: sampan <[email protected]>
…rough ray.remote (ray-project#58439)

Fixes static type hints for ActorClass when setting options through ray.remote
Closes ray-project#58401 and ray-project#58402

cc @richardliaw @edoakes @pcmoritz 

---------

Signed-off-by: will.lin <[email protected]>
upgrading long running tests to run on py3.10
Successful release test build:
https://buildkite.com/ray-project/release/builds/66618

---------

Signed-off-by: elliot-barn <[email protected]>
…t stats (ray-project#58422)

## Why These Changes Are Needed

This PR adds a new metric to track the time spent retrieving `RefBundle`
objects during dataset iteration. This metric provides better visibility
into the performance breakdown of batch iteration, specifically
capturing the time spent in `get_next_ref_bundle()` calls within the
`prefetch_batches_locally` function.

## Related Issue Number

N/A

## Example

```
  dataloader/train = {'producer_throughput': 8361.841782656593, 'iter_stats': {'prefetch_block-avg': inf, 'prefetch_block-min': inf, 'prefetch_block-max': 0, 'prefetch_block-total': 0, 'get_ref_bundles-avg': 0.05172277254545271, 'get_ref_bundles-min': 1.1991999997462699e-05, 'get_ref_bundles-max': 11.057470971999976, 'get_ref_bundles-total': 15.361663445999454, 'fetch_block-avg': 0.31572694455743233, 'fetch_block-min': 0.0006362799999806157, 'fetch_block-max': 2.1665870369999993, 'fetch_block-total': 93.45517558899996, 'block_to_batch-avg': 0.001048687573988573, 'block_to_batch-min': 2.10620000302697e-05, 'block_to_batch-max': 0.049948245999985375, 'block_to_batch-total': 2.048086831999683, 'format_batch-avg': 0.0001013781433686053, 'format_batch-min': 1.415700000961806e-05, 'format_batch-max': 0.009682661999988795, 'format_batch-total': 0.19799151399888615, 'collate-avg': 0.01303446213312943, 'collate-min': 0.00025646699998560507, 'collate-max': 0.9855495820000328, 'collate-total': 25.456304546001775, 'finalize-avg': 0.012211385266257683, 'finalize-min': 0.004209667999987232, 'finalize-max': 0.3785081949999949, 'finalize-total': 23.848835425001255, 'time_spent_blocked-avg': 0.04783407008137157, 'time_spent_blocked-min': 1.2316999971062614e-05, 'time_spent_blocked-max': 12.46102861700001, 'time_spent_blocked-total': 93.46777293900004, 'time_spent_training-avg': 0.015053571562211652, 'time_spent_training-min': 1.3704999958008557e-05, 'time_spent_training-max': 1.079616685000019, 'time_spent_training-total': 29.399625260999358}}
```

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: xgui <[email protected]>
Signed-off-by: Xinyuan <[email protected]>
## Description:
when token auth is enabled, the dashboard prompts the user to enter the
valid auth token and caches it (as a browser cookie). when token based
auth is disabled, existing behaviour is retained.

all dashboard ui based rpc's to to the ray cluster set the authorization
header in their requests

## Screenshots

token popup
<img width="3440" height="2146" alt="image"
src="https://github.com/user-attachments/assets/004c23a3-991e-4a2c-a2ad-5a0ce2e60893"
/>


on entering an invalid token
<img width="3440" height="2146" alt="image"
src="https://github.com/user-attachments/assets/7183a798-ceb7-4657-8706-39ce5fe8e61e"
/>

---------

Signed-off-by: sampan <[email protected]>
Co-authored-by: sampan <[email protected]>
…ants (ray-project#57910)

1. **Remove direct environment variable access patterns**
- Replace all instances of `os.getenv("RAY_enable_open_telemetry") ==
"1"`
- Standardize to use `ray_constants.RAY_ENABLE_OPEN_TELEMETRY`
consistently throughout the codebase

2. **Unify default value format for RAY_enable_open_telemetry**
   - Standardize the default value to `"true"` | `"false"` 
- Previously, the codebase had mixed usage of `"1"` and `"true"`, which
is now unified

3. **Backward compatibility maintained**
- Carefully verified that the existing `RAY_ENABLE_OPEN_TELEMETRY`
constant properly handles both `"1"` and `"true"` values
   - This change will not introduce any breaking behavior
   - The `env_bool` helper function already supports both formats:
```python
RAY_ENABLE_OPEN_TELEMETRY = env_bool("RAY_enable_open_telemetry", False)
def env_bool(key, default):
    if key in os.environ:
        return (
            True
            if os.environ[key].lower() == "true" or os.environ[key] == "1"
            else False
        )
    return default
```

---
Most of the current code uses: `RAY_enable_open_telemetry: "1"`

A smaller portion (not zero) uses: `RAY_enable_open_telemetry: "true"`

https://github.com/ray-project/ray/blob/fe7ad00f9720a722fde5fecba5bb681234bcdb63/python/ray/tests/test_metrics_agent.py#L497

My personal preference is "true"β€”it’s concise and unambiguous. If it’s
"1", I have to think/guess whether it means "true" or "false".

---------

Signed-off-by: justwph <[email protected]>
…y-project#58217)

Change the unit of `scheduler_placement_time` from seconds to
mili-seconds. The current bucket is in the range of 0.1s to 2.5 hours
which doesn't make sense. According to a sample of data, the range we
are interested in would be from us to s. Thanks @ZacAttack for pointing
this out.

```
Note: This is an internal (non–public-facing) metric, so we only need to update its usage within Ray (e.g., the dashboard). A simple code change should suffice.
```

<img width="1609" height="421"
alt="505491038-c5d81017-b86c-406f-acf4-614560752062"
src="https://github.com/user-attachments/assets/cc647b97-42ec-42eb-bf01-4d1867940207"
/>

Test:
- CI

Signed-off-by: Cuong Nguyen <[email protected]>
…s in the Raylet (ray-project#58342)

Found it very hard to parse what was happening here, so helping future
me (or you!).

Also:

- Deleted vestigial `next_resource_seq_no_`.
- Converted from non-monotonic clock to a monotonically incremented
`uint64_t` for the version number for commands.
- Added logs when we drop messages with stale versions.

---------

Signed-off-by: Edward Oakes <[email protected]>
## Description
There was a typo

## Related issues
N/A

## Additional information
N/A

Signed-off-by: Daniel Shin <[email protected]>
be consistent with the CI base env specified in `--build-name`

Signed-off-by: Lonnie Liu <[email protected]>
getting ready to run things on python 3.10

Signed-off-by: Lonnie Liu <[email protected]>
…tion on a single node (ray-project#58456)

> Thank you for contributing to Ray! πŸš€
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description

Currently, finalization is scheduled in batches sequentially -- ie batch
of N adjacent partitions is finalized at once (in a sliding window).

This creates a lensing effect since:

1. Adjacent partitions i and i+1 get scheduled onto adjacent aggregators
j and j+i (since membership is determined as j = i % num_aggregators)
2. Adjacent aggregators have high likelihood of getting scheduled on the
same node (due to similarly being scheduled at about the same time in
sequence)

To address that this change applies random sampling when choosing next
partitions to finalize to make sure partitions are chosen uniformly
reducing concurrent finalization of the adjacent partitions.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
## Description
> Briefly describe what this PR accomplishes and why it's needed.

Making NotifyGCSRestart RPC Fault Tolerant and Idempotent. There were
multiple places where we were always returning Status::OK() in the
gcs_subscriber making idempotency harder to understand and there was
dead code for one of the resubscribes, so did a minor clean up. Added a
python integration test to verify retry behavior, left out the cpp test
since on the raylet side there's nothing to test since its just making a
gcs_client rpc call

---------

Signed-off-by: joshlee <[email protected]>
…ct#58445)

## Summary
Creates a dedicated `tests/unit/` directory for unit tests that don't
require Ray runtime or external dependencies.

## Changes
- Created `tests/unit/` directory structure
- Moved 13 pure unit tests to `tests/unit/`
- Added `conftest.py` with fixtures to prevent `ray.init()` and
`time.sleep()`
- Added `README.md` documenting unit test requirements
- Updated `BUILD.bazel` to run unit tests with "small" size tag

## Test Files Moved
1. test_arrow_type_conversion.py
2. test_block.py
3. test_block_boundaries.py
4. test_data_batch_conversion.py
5. test_datatype.py
6. test_deduping_schema.py
7. test_expression_evaluator.py
8. test_expressions.py
9. test_filename_provider.py
10. test_logical_plan.py
11. test_object_extension.py
12. test_path_util.py
13. test_ruleset.py

These tests are fast (<1s each), isolated (no Ray runtime), and
deterministic (no time.sleep or randomness).

---------

Signed-off-by: Balaji Veeramani <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
> Thank you for contributing to Ray! πŸš€
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.


### [Data] Concurrency Cap Backpressure tuning
- Maintain asymmetric EWMA of total queued bytes (this op + downstream)
as the typical level: level.
- Maintain asymmetric EWMA of absolute residual vs the previous level as
a scale proxy: dev = EWMA(|q - level_prev|).
- Define deadband: [lower, upper] = [level - K_DEVdev, level +
K_DEVdev].
If q > upper -> target cap = running - BACKOFF_FACTOR (back off)
If q < lower -> target cap = running + RAMPUP_FACTOR (ramp up)
Else -> target cap = running (hold)
- Clamp to [1, configured_cap], admit iff running < target cap.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
… in read-only mode (ray-project#58460)

This ensures node type names are correctly reported even when the
autoscaler is disabled (read-only mode).

## Description

Autoscaler v2 fails to report prometheus metrics when operating in
read-only mode on KubeRay with the following KeyError error:

```
2025-11-08 12:06:57,402	ERROR autoscaler.py:215 -- 'small-group'
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 200, in update_autoscaling_state
    return Reconciler.reconcile(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 120, in reconcile
    Reconciler._step_next(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 275, in _step_next
    Reconciler._scale_cluster(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1125, in _scale_cluster
    reply = scheduler.schedule(sched_request)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 933, in schedule
    ResourceDemandScheduler._enforce_max_workers_per_type(ctx)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 1006, in _enforce_max_workers_per_type
    node_config = ctx.get_node_type_configs()[node_type]
KeyError: 'small-group'
```

This happens because the `ReadOnlyProviderConfigReader` populates
`ctx.get_node_type_configs()` using node IDs as node types, which is
correct for local Ray (where local ray does not have
`RAY_NODE_TYPE_NAME` set), but incorrect for KubeRay where
`ray_node_type_name` is present and expected with `RAY_NODE_TYPE_NAME`
set.

As a result, in read-only mode the scheduler sees a node type name (ex.
small-group) that never exists in the populated configs.

This PR fixes the issue by using `ray_node_type_name` when it exists,
and only falling back to node ID when it does not.
## Related issues
Fixes ray-project#58227

Signed-off-by: Rueian <[email protected]>
aslonnie and others added 22 commits November 20, 2025 15:14
rather than ray-project/ray

Signed-off-by: Lonnie Liu <[email protected]>
…r pool scaling (ray-project#58726)

## Summary

Add support for configurable upscaling step size in the actor pool
autoscaler. This enables rapid scale-up and efficient resource
utilization by allowing the autoscaler to scale up multiple actors at
once, instead of scaling up one actor at a time.

## Description

### Background

Currently, the actor pool autoscaler scales up actors one at a time,
which can be slow in certain scenarios:

1. **Slow actor startup**: When actor initialization logic is complex,
actors may remain in pending state for extended periods. The autoscaler
skips scaling when it encounters pending actors, preventing further
scaling.

2. **Elastic cluster with unstable resources**: In environments where
available resources are uncertain, users often configure large
concurrency ranges (e.g., (10,1000)) for `map_batches`. In these cases,
rapid startup and scaling are critical to utilize available resources
efficiently.

### Solution

This PR adds support for configurable upscaling step size in the actor
pool autoscaler. Instead of always scaling up by 1 actor at a time, the
autoscaler can now scale up multiple actors based on utilization
metrics, while respecting resource constraints.

## Related issues

<!-- Add related issue numbers if applicable -->

Signed-off-by: dragongu <[email protected]>
```
REGRESSION 40.86%: single_client_put_gigabytes (THROUGHPUT) regresses from 18.324991353469613 to 10.83745250237838 in microbenchmark.json
REGRESSION 3.21%: tasks_per_second (THROUGHPUT) regresses from 571.2270630108624 to 552.9028002075252 in benchmarks/many_tasks.json
REGRESSION 1.49%: pgs_per_second (THROUGHPUT) regresses from 17.897951502183457 to 17.630807526575012 in benchmarks/many_pgs.json
REGRESSION 72.70%: dashboard_p99_latency_ms (LATENCY) regresses from 4098.851 to 7078.797 in benchmarks/many_actors.json
REGRESSION 24.76%: stage_3_creation_time (LATENCY) regresses from 1.4687559604644775 to 1.8323874473571777 in stress_tests/stress_test_many_tasks.json
REGRESSION 9.00%: avg_pg_create_time_ms (LATENCY) regresses from 1.503374489489326 to 1.6386530375375787 in stress_tests/stress_test_placement_group.json
REGRESSION 1.36%: stage_3_time (LATENCY) regresses from 1885.3751878738403 to 1911.0371930599213 in stress_tests/stress_test_many_tasks.json
REGRESSION 0.40%: stage_1_avg_iteration_time (LATENCY) regresses from 13.989246034622193 to 14.045441269874573 in stress_tests/stress_test_many_tasks.json
REGRESSION 0.35%: dashboard_p95_latency_ms (LATENCY) regresses from 3829.61 to 3843.186 in benchmarks/many_actors.json
```

Signed-off-by: Lonnie Liu <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
…ration (ray-project#58888)

## Description
> Fixed "dictionary changed size during iteration" error that occurs
when shutdown() iterates over task_status_dict while background threads
modify it concurrently.

## Additional information
> Why not use thread lock? The bug is in the shutdown part, and no other
parts iterate it.

Signed-off-by: Cai Zhanqi <[email protected]>
Co-authored-by: Cai Zhanqi <[email protected]>
## Description
Removed unused script, which also has leaked tokens.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Cindy Zhang <[email protected]>
## Description

### Status Quo
Previously, `.gitignore` files handled both uploading to cluster _and_
uploading to github. This PR essentially allows the ability to break
those 2 functionalities apart by creating a `.rayignore` file which will
handle uploading to cluster.

### Purpose
Any path or file specified in `.rayignore` will be ignored when
uploading to the cluster. This is useful for local development when you
don't want random files being uploaded and taking up space.

### How it works
By default, directories containing both `.gitignore` and `.rayignore`
will both be considered (so existing behavior is preserved). To make
`.gitignore` only ignore files uploaded to github, and `.rayignore` only
ignore files uploaded to cluster (essentially making them independent of
each other), you can use the existing `RAY_RUNTIME_ENV_IGNORE_GITIGNORE`
and set that to `1`

## Related issues
ray-project#53648

## Additional information
Since `.rayignore` is part of the ray ecosystem, I did not create an env
var to disable ignoring all-together. If users do not want to ignore
files, they can leave `.rayignore` empty, or not create the file at all.

---------

Signed-off-by: iamjustinhsu <[email protected]>
…t#58603)

## Description
`HashShuffleAggregator` currently doesn't break big blocks into smaller
blocks (or combine smaller blocks into bigger ones). For large blocks,
this can be very problematic. This PR addresses this by using
`OutputBlockBuffer` to reshape the blocks back to
`data_context.target_max_block_size`

## Related issues
None
## Additional information
Encountered this personally with 180GiB block, which would OOD

---------

Signed-off-by: iamjustinhsu <[email protected]>
…ay-project#58355)

### Summary

This PR exposes deployment topology information in Ray Serve instance
details, allowing users to visualize and understand the dependency graph
of deployments within their applications.

### What's Changed

#### New Data Structures

Added two new schema classes to represent deployment topology:

- **`DeploymentNode`** - Represents a node in the deployment DAG

- **`DeploymentTopology`** - Represents the full dependency graph

#### Implementation

**Controller Integration**
- Updated `ServeController` to include `deployment_topology` in
`ApplicationDetails` when serving instance details
- Topology is now accessible via the `get_serve_details()` API

---

**Example Output:**

```python
{
    "app_name": "my_app",
    "ingress_deployment": "Ingress",
    "nodes": {
        "Ingress": {
            "name": "Ingress",
            "is_ingress": True,
            "outbound_deployments": [
                {"name": "ServiceA", "app_name": "my_app"}
            ]
        },
        "ServiceA": {
            "name": "ServiceA",
            "is_ingress": False,
            "outbound_deployments": [
                {"name": "Database", "app_name": "my_app"}
            ]
        },
        "Database": {
            "name": "Database",
            "is_ingress": False,
            "outbound_deployments": []
        }
    }
}
```

---------

Signed-off-by: abrar <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
…ckpoint` (ray-project#58537)

RLlib uses nested metric structure (like
`"{ENV_RUNNER_RESULTS}/{EPISODE_RETURN_MEAN}"`) which
`Result.get_best_checkpoint` doesn't support.
Following `ResultGrid.get_best_result()` to use `unflattened_lookup`,
I've added that to `get_best_checkpoint` along with testing for nested
structures (and its backward compatibility)

---------

Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Co-authored-by: Mark Towers <[email protected]>
Co-authored-by: Justin Yu <[email protected]>
…ownscaling (ray-project#52929)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->
This PR improves the downscaling behavior in Ray Serve by modifying the
logic in `_get_replicas_to_stop()` within Default `DeploymentScheduler`.

Previously, the scheduler selected replicas to stop by traversing the
least loaded nodes in ascending order. This often resulted in stopping
replicas that had been scheduled earlier and placed optimally using the
`_best_fit_node()` strategy.

This led to several drawbacks:
- Long-lived replicas, which were scheduled on best-fit nodes, were
removed first β€” leading to inefficient reuse of resources.
- Recently scaled-up replicas, which were placed on less utilized nodes,
were kept longer despite being suboptimal.
- Cold-start overhead increased, as newer replicas were removed before
fully warming up.

This PR reverses the node traversal order during downscaling so that
**more recently added replicas are prioritized for termination**, *in
cases where other conditions (e.g., running state and number of replicas
per node) are equal*. These newer replicas are typically less optimal in
placement and not yet fully warmed up.

Preserving long-lived replicas improves performance stability and
reduces unnecessary resource fragmentation.
## Related issue number

<!-- For example: "Closes ray-project#1234" -->
N/A
## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: kitae <[email protected]>
no python 3.9 wheels any more

Signed-off-by: Lonnie Liu <[email protected]>
also uses `curl -fsSL` as much as possible

otherwise the release blocker checker is not working.

also removes unnecessary sudos.

Signed-off-by: Lonnie Liu <[email protected]>
## Description
- Expose a version parameter on ray.data.read_lance to read historical
Lance dataset versions.
- Add unit test
python/ray/data/tests/test_lance.py::test_lance_read_with_version that
writes an initial dataset, records the initial version, merges new data,
and asserts default read returns the latest while read_lance(path,
version=initial_version) returns the original columns and rows.

## Related issues
> Closes ray-project#58226 

## Additional information
As mentioned in the original issue, exposed version parameter in
```read_lance``` function. The parameter is passed down to
```LanceDatasource``` which is updated as well. Ultimately,
```lance.dataset``` takes this version param to read the specific
version.

---------

Signed-off-by: Simeet Nayan <[email protected]>
Signed-off-by: Simeet Nayan <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…oject#58906)

Created by release automation bot.

Update with commit ae94ff4

Signed-off-by: Lonnie Liu <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
…ject#58821)

> Thank you for contributing to Ray! πŸš€
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.

### [Data] Parallelize DefaultCollateFn - arrow_batch_to_tensors

In `arrow_batch_to_tensors`, use `make_async_gen` to set up multiple
workers to speed up processing of columns for Tensor conversion in
`convert_ndarray_to_torch_tensor`, so `DefaultCollateFn` can be speedup.


## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
ray-project#58911)

This reverts commit 3663299.

````

[2025-11-22T01:29:32Z]   File "/rayci/python/ray/dashboard/modules/metrics/dashboards/data_dashboard_panels.py", line 1451, in <module>
--
[2025-11-22T01:29:32Z]     assert len(all_panel_ids) == len(
[2025-11-22T01:29:32Z] AssertionError: Duplicated id found. Use unique id for each panel. [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 43, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 70, 71, 72, 73, 74, 75, 76, 77, 78, 78, 79, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108]


````

Co-authored-by: Lonnie Liu <[email protected]>
…#58872)

removing flag from raydepsets: --index-url https://pypi.org/simple
(included by default
https://docs.astral.sh/uv/reference/cli/#uv-pip-compile--default-index)
adding flag to raydepsets: --no-header
updating unit tests

This will prevent updating all lock files updating when default or
config level flags are updated

---------

Signed-off-by: elliot-barn <[email protected]>
…er (ray-project#58739)

## Description
The algorithm config isn't updating `rl_module_spec.model_config` when a
custom one is specified which means that the learner and env-runner. As
a result, the runner model wasn't been updated.
The reason this problem wasn't detected previous was that when updating
the model state-dict is we used `strict=False`.
Therefore, I've added an error checker that the missing keys should
always be empty and will detect when env-runner is missing components
from the learner update model.

```python
from ray.rllib.algorithms import PPOConfig
from ray.rllib.core.rl_module import RLModuleSpec
from ray.rllib.policy.sample_batch import DEFAULT_POLICY_ID


config = (
    PPOConfig()
    .environment('CartPole-v1')
    .env_runners(
        num_env_runners=0,
        num_envs_per_env_runner=1,
    )
    .rl_module(
        rl_module_spec=RLModuleSpec(
            model_config={
                "head_fcnet_hiddens": (32,), # This used to cause encoder.config.shared mismatch
            }
        )
    )
)

algo = config.build_algo()

learner_module = algo.learner_group._learner._module[DEFAULT_POLICY_ID]
env_runner_modules = algo.env_runner_group.foreach_env_runner(lambda runner: runner.module)

print(f'{learner_module.encoder.config.shared=}')
print(f'{[mod.encoder.config.shared for mod in env_runner_modules]=}')

algo.train()
```

## Related issues
Closes ray-project#58715

---------

Signed-off-by: Mark Towers <[email protected]>
Co-authored-by: Mark Towers <[email protected]>
Co-authored-by: Hassam Ullah Sheikh <[email protected]>
resolves logical merge conflicts and fix ci test

Signed-off-by: Lonnie Liu <[email protected]>
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request #687 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5453.

@gemini-code-assist
Copy link

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request performs a daily merge from the master branch into the main branch, primarily focusing on updating and refactoring the project's build system and CI/CD configurations. The changes streamline Docker image creation, standardize Python environments using uv and miniforge, and refine Bazel build rules across various components and platforms.

Highlights

  • Bazel Configuration Updates: Enabled strict action environment by default, added a workspace status command for Linux builds, included /utf-8 cxxopt for Windows, and configured Bazel to ignore warnings from src/ray/thirdparty/.
  • Buildkite CI Pipeline Refactoring: Introduced a new _images.rayci.yml file to centralize Docker image building steps. Python 3.10 was added to several CI jobs, while Python 3.9 was removed from others. Instance types and dependencies for various CI groups (core, data, llm, ml, serve, kuberay, lint, linux_aarch64, macos, others) were updated.
  • Python Environment Tooling Migration: Dockerfiles for forge_arm64 and forge_x86_64 were updated to use ubuntu:22.04 and transitioned from Miniconda to Miniforge and uv for Python environment management, aiming for faster and more reliable dependency resolution.
  • Core C++ API and Runtime Changes: Numerous updates were made across the C++ codebase, including changes to metric recording, IP address retrieval, object store initialization, and task execution logic, reflecting ongoing development and refinement of Ray's core components.
  • Documentation and Project Metadata Updates: The .github/CODEOWNERS and PULL_REQUEST_TEMPLATE.md files were updated. The .rayciversion was incremented from 0.12.0 to 0.21.0, and .readthedocs.yaml was adjusted to use Python 3.10 and a new requirements lock file.
Ignored Files
  • Ignored by pattern: .gemini/** (1)
    • .gemini/config.yaml
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/stale_pull_request.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with πŸ‘ and πŸ‘Ž on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is an automated daily merge from master to main, containing a large number of changes and refactorings across the codebase. The most significant changes include a major overhaul of the CI/CD pipeline and build system, with a move towards more modular and centralized configurations. This includes the introduction of a new dependency management tool raydepsets, the adoption of uv for Python package management, and refactoring of Bazel build files. There are also substantial improvements to code quality checks through pre-commit hooks. I've identified a couple of issues in the test suite that seem to have been missed during this large refactoring. Please see my comments for details.

Comment on lines 8 to +11
def test_get_docker_image() -> None:
assert (
LinuxContainer("test")._get_docker_image()
== "029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:unknown-test"
== "029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:test"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The expected docker image name is incorrect. The _get_docker_image method now prepends the RAYCI_BUILD_ID to the docker tag. In the test environment, RAYCI_BUILD_ID is set to a1b2c3d4, so the expected image tag should be ...:a1b2c3d4-test.

    assert (
        LinuxContainer("test")._get_docker_image()
        == "029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:a1b2c3d4-test"
    )

install_ray_cmds.append(inputs)

with mock.patch("subprocess.check_call", side_effect=_mock_subprocess):
LinuxTesterContainer("team", build_type="debug")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The docker_image variable is incorrect. The LinuxTesterContainer will build an image with a tag that includes the RAYCI_BUILD_ID prefix, which is a1b2c3d4 in the test environment. The assertion should check for ...:a1b2c3d4-team.

Suggested change
LinuxTesterContainer("team", build_type="debug")
docker_image = f"{_DOCKER_ECR_REPO}:a1b2c3d4-team"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.