🔄 daily merge: master → main 2025-11-24 #687

antfin-oss · 2025-11-24T03:10:26Z

This Pull Request was created automatically to merge the latest changes from master into main branch.

📅 Created: 2025-11-24
🔀 Merge direction: master → main
🤖 Triggered by: Scheduled

Please review and merge if everything looks good.

…58345) ## Summary Adds a new method to expose all downstream deployments that a replica calls into, enabling dependency graph construction. ## Motivation Deployments call downstream deployments via handles in two ways: 1. **Stored handles**: Passed to `__init__()` and stored as attributes → `self.model.func.remote()` 2. **Dynamic handles**: Obtained at runtime via `serve.get_deployment_handle()` → `model.func.remote()` Previously, there was no way to programmatically discover these dependencies from a running replica. ## Implementation ### Core Changes - **`ReplicaActor.list_outbound_deployments()`**: Returns `List[DeploymentID]` of all downstream deployments - Recursively inspects user callable attributes to find stored handles (including nested in dicts/lists) - Tracks dynamic handles created via `get_deployment_handle()` at runtime using a callback mechanism - **Runtime tracking**: Modified `get_deployment_handle()` to register handles when called from within a replica via `ReplicaContext._handle_registration_callback` Next PR: ray-project#58350 --------- Signed-off-by: abrar <[email protected]>

This PR: - adds a new page to the Ray Train docs called "Monitor your Application" that lists and describes the Prometheus metrics emitted by Ray Train - Updates the Ray Core system metrics docs to include some missing metrics Link to example build: https://anyscale-ray--58235.com.readthedocs.build/en/58235/train/user-guides/monitor-your-application.html Preview Screenshot: <img width="1630" height="662" alt="Screenshot 2025-10-29 at 2 46 07 PM" src="https://github.com/user-attachments/assets/9ca7ea6d-522b-4033-909a-2ee626960e8a" /> --------- Signed-off-by: JasonLi1909 <[email protected]>

Currently, users that import ray.tune can run into an ImportError if they do not have pydantic installed. This is because ray.tune imports ray.train, which requires pydantic. This PR prevents this error by adding pydantic as a ray tune dependency. Relevant user issue: ray-project#58280 --------- Signed-off-by: JasonLi1909 <[email protected]>

The timeout is due to `moto-server` which mocks the s3. Remove the remote storage for now. --------- Signed-off-by: xgui <[email protected]>

…-project#58330) Signed-off-by: Nikhil Ghosh <[email protected]>

…project#58229) Ray Train's framework agnostic collective utilities (`ray.train.collective.barrier`, `ray.train.collective.broadcast_from_rank_zero`) currently timeout after 30 minutes if not all ranks join the operation. `ray.train.report` uses these collective utilities internally, so users who don't call report on every rank can run into deadlocks. For example, the report barrier can deadlock with another worker waiting on others to join a backward pass collective. This PR changes the default Ray Train collective behavior to never timeout and to only log warning messages about the missing ranks. User code typically already has timeouts such as NCCL timeouts (also 30 minutes by default), so the extra timeout here doesn't really help and increases the user burden of keeping track of environment variables to set when debugging hanging jobs. New default: `RAY_TRAIN_COLLECTIVE_TIMEOUT_S=-1` This PR also generalizes the environment variable name: `RAY_TRAIN_REPORT_BARRIER` -> `RAY_TRAIN_COLLECTIVE`. --------- Signed-off-by: Justin Yu <[email protected]>

which depends on datalbuild Signed-off-by: Lonnie Liu <[email protected]>

Removing unnecessary --strip-extras flag from raydepsets (only updates depset lock file headers): [--strip-extras](https://docs.astral.sh/uv/reference/cli/#uv-pip-compile--no-strip-extras) Include extras in the output file. By default, uv strips extras, as any packages pulled in by the extras are already included as dependencies in the output file directly. Further, output files generated with --no-strip-extras cannot be used as constraints files in install and sync invocations. --------- Signed-off-by: elliot-barn <[email protected]>

…#58444) fix a typo (missing semicolon) in authentication_token_loader_test and use cross env compatible `ray::UnsetEnv()` and `ray::setEnv()` in tests --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>

@richardliaw

…rough ray.remote (ray-project#58439) Fixes static type hints for ActorClass when setting options through ray.remote Closes ray-project#58401 and ray-project#58402 cc @richardliaw @edoakes @pcmoritz --------- Signed-off-by: will.lin <[email protected]>

upgrading long running tests to run on py3.10 Successful release test build: https://buildkite.com/ray-project/release/builds/66618 --------- Signed-off-by: elliot-barn <[email protected]>

Signed-off-by: Future-Outlier <[email protected]>

…t stats (ray-project#58422) ## Why These Changes Are Needed This PR adds a new metric to track the time spent retrieving `RefBundle` objects during dataset iteration. This metric provides better visibility into the performance breakdown of batch iteration, specifically capturing the time spent in `get_next_ref_bundle()` calls within the `prefetch_batches_locally` function. ## Related Issue Number N/A ## Example ``` dataloader/train = {'producer_throughput': 8361.841782656593, 'iter_stats': {'prefetch_block-avg': inf, 'prefetch_block-min': inf, 'prefetch_block-max': 0, 'prefetch_block-total': 0, 'get_ref_bundles-avg': 0.05172277254545271, 'get_ref_bundles-min': 1.1991999997462699e-05, 'get_ref_bundles-max': 11.057470971999976, 'get_ref_bundles-total': 15.361663445999454, 'fetch_block-avg': 0.31572694455743233, 'fetch_block-min': 0.0006362799999806157, 'fetch_block-max': 2.1665870369999993, 'fetch_block-total': 93.45517558899996, 'block_to_batch-avg': 0.001048687573988573, 'block_to_batch-min': 2.10620000302697e-05, 'block_to_batch-max': 0.049948245999985375, 'block_to_batch-total': 2.048086831999683, 'format_batch-avg': 0.0001013781433686053, 'format_batch-min': 1.415700000961806e-05, 'format_batch-max': 0.009682661999988795, 'format_batch-total': 0.19799151399888615, 'collate-avg': 0.01303446213312943, 'collate-min': 0.00025646699998560507, 'collate-max': 0.9855495820000328, 'collate-total': 25.456304546001775, 'finalize-avg': 0.012211385266257683, 'finalize-min': 0.004209667999987232, 'finalize-max': 0.3785081949999949, 'finalize-total': 23.848835425001255, 'time_spent_blocked-avg': 0.04783407008137157, 'time_spent_blocked-min': 1.2316999971062614e-05, 'time_spent_blocked-max': 12.46102861700001, 'time_spent_blocked-total': 93.46777293900004, 'time_spent_training-avg': 0.015053571562211652, 'time_spent_training-min': 1.3704999958008557e-05, 'time_spent_training-max': 1.079616685000019, 'time_spent_training-total': 29.399625260999358}} ``` ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: xgui <[email protected]> Signed-off-by: Xinyuan <[email protected]>

## Description: when token auth is enabled, the dashboard prompts the user to enter the valid auth token and caches it (as a browser cookie). when token based auth is disabled, existing behaviour is retained. all dashboard ui based rpc's to to the ray cluster set the authorization header in their requests ## Screenshots token popup <img width="3440" height="2146" alt="image" src="https://github.com/user-attachments/assets/004c23a3-991e-4a2c-a2ad-5a0ce2e60893" /> on entering an invalid token <img width="3440" height="2146" alt="image" src="https://github.com/user-attachments/assets/7183a798-ceb7-4657-8706-39ce5fe8e61e" /> --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>

…ants (ray-project#57910) 1. **Remove direct environment variable access patterns** - Replace all instances of `os.getenv("RAY_enable_open_telemetry") == "1"` - Standardize to use `ray_constants.RAY_ENABLE_OPEN_TELEMETRY` consistently throughout the codebase 2. **Unify default value format for RAY_enable_open_telemetry** - Standardize the default value to `"true"` | `"false"` - Previously, the codebase had mixed usage of `"1"` and `"true"`, which is now unified 3. **Backward compatibility maintained** - Carefully verified that the existing `RAY_ENABLE_OPEN_TELEMETRY` constant properly handles both `"1"` and `"true"` values - This change will not introduce any breaking behavior - The `env_bool` helper function already supports both formats: ```python RAY_ENABLE_OPEN_TELEMETRY = env_bool("RAY_enable_open_telemetry", False) def env_bool(key, default): if key in os.environ: return ( True if os.environ[key].lower() == "true" or os.environ[key] == "1" else False ) return default ``` --- Most of the current code uses: `RAY_enable_open_telemetry: "1"` A smaller portion (not zero) uses: `RAY_enable_open_telemetry: "true"` https://github.com/ray-project/ray/blob/fe7ad00f9720a722fde5fecba5bb681234bcdb63/python/ray/tests/test_metrics_agent.py#L497 My personal preference is "true"—it’s concise and unambiguous. If it’s "1", I have to think/guess whether it means "true" or "false". --------- Signed-off-by: justwph <[email protected]>

…age (ray-project#53841) (ray-project#54618)

…b/utils`` (ray-project#56734)

…ay-project#58324)

@ZacAttack

…y-project#58217) Change the unit of `scheduler_placement_time` from seconds to mili-seconds. The current bucket is in the range of 0.1s to 2.5 hours which doesn't make sense. According to a sample of data, the range we are interested in would be from us to s. Thanks @ZacAttack for pointing this out. ``` Note: This is an internal (non–public-facing) metric, so we only need to update its usage within Ray (e.g., the dashboard). A simple code change should suffice. ``` <img width="1609" height="421" alt="505491038-c5d81017-b86c-406f-acf4-614560752062" src="https://github.com/user-attachments/assets/cc647b97-42ec-42eb-bf01-4d1867940207" /> Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>

…s in the Raylet (ray-project#58342) Found it very hard to parse what was happening here, so helping future me (or you!). Also: - Deleted vestigial `next_resource_seq_no_`. - Converted from non-monotonic clock to a monotonically incremented `uint64_t` for the version number for commands. - Added logs when we drop messages with stale versions. --------- Signed-off-by: Edward Oakes <[email protected]>

## Description There was a typo ## Related issues N/A ## Additional information N/A Signed-off-by: Daniel Shin <[email protected]>

be consistent with the CI base env specified in `--build-name` Signed-off-by: Lonnie Liu <[email protected]>

getting ready to run things on python 3.10 Signed-off-by: Lonnie Liu <[email protected]>

…tion on a single node (ray-project#58456) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description Currently, finalization is scheduled in batches sequentially -- ie batch of N adjacent partitions is finalized at once (in a sliding window). This creates a lensing effect since: 1. Adjacent partitions i and i+1 get scheduled onto adjacent aggregators j and j+i (since membership is determined as j = i % num_aggregators) 2. Adjacent aggregators have high likelihood of getting scheduled on the same node (due to similarly being scheduled at about the same time in sequence) To address that this change applies random sampling when choosing next partitions to finalize to make sure partitions are chosen uniformly reducing concurrent finalization of the adjacent partitions. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <[email protected]>

## Description > Briefly describe what this PR accomplishes and why it's needed. Making NotifyGCSRestart RPC Fault Tolerant and Idempotent. There were multiple places where we were always returning Status::OK() in the gcs_subscriber making idempotency harder to understand and there was dead code for one of the resubscribes, so did a minor clean up. Added a python integration test to verify retry behavior, left out the cpp test since on the raylet side there's nothing to test since its just making a gcs_client rpc call --------- Signed-off-by: joshlee <[email protected]>

…ct#58445) ## Summary Creates a dedicated `tests/unit/` directory for unit tests that don't require Ray runtime or external dependencies. ## Changes - Created `tests/unit/` directory structure - Moved 13 pure unit tests to `tests/unit/` - Added `conftest.py` with fixtures to prevent `ray.init()` and `time.sleep()` - Added `README.md` documenting unit test requirements - Updated `BUILD.bazel` to run unit tests with "small" size tag ## Test Files Moved 1. test_arrow_type_conversion.py 2. test_block.py 3. test_block_boundaries.py 4. test_data_batch_conversion.py 5. test_datatype.py 6. test_deduping_schema.py 7. test_expression_evaluator.py 8. test_expressions.py 9. test_filename_provider.py 10. test_logical_plan.py 11. test_object_extension.py 12. test_path_util.py 13. test_ruleset.py These tests are fast (<1s each), isolated (no Ray runtime), and deterministic (no time.sleep or randomness). --------- Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

> Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Concurrency Cap Backpressure tuning - Maintain asymmetric EWMA of total queued bytes (this op + downstream) as the typical level: level. - Maintain asymmetric EWMA of absolute residual vs the previous level as a scale proxy: dev = EWMA(|q - level_prev|). - Define deadband: [lower, upper] = [level - K_DEVdev, level + K_DEVdev]. If q > upper -> target cap = running - BACKOFF_FACTOR (back off) If q < lower -> target cap = running + RAMPUP_FACTOR (ramp up) Else -> target cap = running (hold) - Clamp to [1, configured_cap], admit iff running < target cap. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: Srinath Krishnamachari <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ay-project#58301)

Signed-off-by: Nikhil Ghosh <[email protected]>

… in read-only mode (ray-project#58460) This ensures node type names are correctly reported even when the autoscaler is disabled (read-only mode). ## Description Autoscaler v2 fails to report prometheus metrics when operating in read-only mode on KubeRay with the following KeyError error: ``` 2025-11-08 12:06:57,402 ERROR autoscaler.py:215 -- 'small-group' Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 200, in update_autoscaling_state return Reconciler.reconcile( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 120, in reconcile Reconciler._step_next( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 275, in _step_next Reconciler._scale_cluster( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1125, in _scale_cluster reply = scheduler.schedule(sched_request) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 933, in schedule ResourceDemandScheduler._enforce_max_workers_per_type(ctx) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 1006, in _enforce_max_workers_per_type node_config = ctx.get_node_type_configs()[node_type] KeyError: 'small-group' ``` This happens because the `ReadOnlyProviderConfigReader` populates `ctx.get_node_type_configs()` using node IDs as node types, which is correct for local Ray (where local ray does not have `RAY_NODE_TYPE_NAME` set), but incorrect for KubeRay where `ray_node_type_name` is present and expected with `RAY_NODE_TYPE_NAME` set. As a result, in read-only mode the scheduler sees a node type name (ex. small-group) that never exists in the populated configs. This PR fixes the issue by using `ray_node_type_name` when it exists, and only falling back to node ID when it does not. ## Related issues Fixes ray-project#58227 Signed-off-by: Rueian <[email protected]>

rather than ray-project/ray Signed-off-by: Lonnie Liu <[email protected]>

…r pool scaling (ray-project#58726) ## Summary Add support for configurable upscaling step size in the actor pool autoscaler. This enables rapid scale-up and efficient resource utilization by allowing the autoscaler to scale up multiple actors at once, instead of scaling up one actor at a time. ## Description ### Background Currently, the actor pool autoscaler scales up actors one at a time, which can be slow in certain scenarios: 1. **Slow actor startup**: When actor initialization logic is complex, actors may remain in pending state for extended periods. The autoscaler skips scaling when it encounters pending actors, preventing further scaling. 2. **Elastic cluster with unstable resources**: In environments where available resources are uncertain, users often configure large concurrency ranges (e.g., (10,1000)) for `map_batches`. In these cases, rapid startup and scaling are critical to utilize available resources efficiently. ### Solution This PR adds support for configurable upscaling step size in the actor pool autoscaler. Instead of always scaling up by 1 actor at a time, the autoscaler can now scale up multiple actors based on utilization metrics, while respecting resource constraints. ## Related issues  Signed-off-by: dragongu <[email protected]>

``` REGRESSION 40.86%: single_client_put_gigabytes (THROUGHPUT) regresses from 18.324991353469613 to 10.83745250237838 in microbenchmark.json REGRESSION 3.21%: tasks_per_second (THROUGHPUT) regresses from 571.2270630108624 to 552.9028002075252 in benchmarks/many_tasks.json REGRESSION 1.49%: pgs_per_second (THROUGHPUT) regresses from 17.897951502183457 to 17.630807526575012 in benchmarks/many_pgs.json REGRESSION 72.70%: dashboard_p99_latency_ms (LATENCY) regresses from 4098.851 to 7078.797 in benchmarks/many_actors.json REGRESSION 24.76%: stage_3_creation_time (LATENCY) regresses from 1.4687559604644775 to 1.8323874473571777 in stress_tests/stress_test_many_tasks.json REGRESSION 9.00%: avg_pg_create_time_ms (LATENCY) regresses from 1.503374489489326 to 1.6386530375375787 in stress_tests/stress_test_placement_group.json REGRESSION 1.36%: stage_3_time (LATENCY) regresses from 1885.3751878738403 to 1911.0371930599213 in stress_tests/stress_test_many_tasks.json REGRESSION 0.40%: stage_1_avg_iteration_time (LATENCY) regresses from 13.989246034622193 to 14.045441269874573 in stress_tests/stress_test_many_tasks.json REGRESSION 0.35%: dashboard_p95_latency_ms (LATENCY) regresses from 3829.61 to 3843.186 in benchmarks/many_actors.json ``` Signed-off-by: Lonnie Liu <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

…ration (ray-project#58888) ## Description > Fixed "dictionary changed size during iteration" error that occurs when shutdown() iterates over task_status_dict while background threads modify it concurrently. ## Additional information > Why not use thread lock? The bug is in the shutdown part, and no other parts iterate it. Signed-off-by: Cai Zhanqi <[email protected]> Co-authored-by: Cai Zhanqi <[email protected]>

## Description Removed unused script, which also has leaked tokens. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Cindy Zhang <[email protected]>

## Description ### Status Quo Previously, `.gitignore` files handled both uploading to cluster _and_ uploading to github. This PR essentially allows the ability to break those 2 functionalities apart by creating a `.rayignore` file which will handle uploading to cluster. ### Purpose Any path or file specified in `.rayignore` will be ignored when uploading to the cluster. This is useful for local development when you don't want random files being uploaded and taking up space. ### How it works By default, directories containing both `.gitignore` and `.rayignore` will both be considered (so existing behavior is preserved). To make `.gitignore` only ignore files uploaded to github, and `.rayignore` only ignore files uploaded to cluster (essentially making them independent of each other), you can use the existing `RAY_RUNTIME_ENV_IGNORE_GITIGNORE` and set that to `1` ## Related issues ray-project#53648 ## Additional information Since `.rayignore` is part of the ray ecosystem, I did not create an env var to disable ignoring all-together. If users do not want to ignore files, they can leave `.rayignore` empty, or not create the file at all. --------- Signed-off-by: iamjustinhsu <[email protected]>

…t#58603) ## Description `HashShuffleAggregator` currently doesn't break big blocks into smaller blocks (or combine smaller blocks into bigger ones). For large blocks, this can be very problematic. This PR addresses this by using `OutputBlockBuffer` to reshape the blocks back to `data_context.target_max_block_size` ## Related issues None ## Additional information Encountered this personally with 180GiB block, which would OOD --------- Signed-off-by: iamjustinhsu <[email protected]>

…ay-project#58355) ### Summary This PR exposes deployment topology information in Ray Serve instance details, allowing users to visualize and understand the dependency graph of deployments within their applications. ### What's Changed #### New Data Structures Added two new schema classes to represent deployment topology: - **`DeploymentNode`** - Represents a node in the deployment DAG - **`DeploymentTopology`** - Represents the full dependency graph #### Implementation **Controller Integration** - Updated `ServeController` to include `deployment_topology` in `ApplicationDetails` when serving instance details - Topology is now accessible via the `get_serve_details()` API --- **Example Output:** ```python { "app_name": "my_app", "ingress_deployment": "Ingress", "nodes": { "Ingress": { "name": "Ingress", "is_ingress": True, "outbound_deployments": [ {"name": "ServiceA", "app_name": "my_app"} ] }, "ServiceA": { "name": "ServiceA", "is_ingress": False, "outbound_deployments": [ {"name": "Database", "app_name": "my_app"} ] }, "Database": { "name": "Database", "is_ingress": False, "outbound_deployments": [] } } } ``` --------- Signed-off-by: abrar <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

…ckpoint` (ray-project#58537) RLlib uses nested metric structure (like `"{ENV_RUNNER_RESULTS}/{EPISODE_RETURN_MEAN}"`) which `Result.get_best_checkpoint` doesn't support. Following `ResultGrid.get_best_result()` to use `unflattened_lookup`, I've added that to `get_best_checkpoint` along with testing for nested structures (and its backward compatibility) --------- Signed-off-by: Mark Towers <[email protected]> Signed-off-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]> Co-authored-by: Justin Yu <[email protected]>

…ownscaling (ray-project#52929)   ## Why are these changes needed?  This PR improves the downscaling behavior in Ray Serve by modifying the logic in `_get_replicas_to_stop()` within Default `DeploymentScheduler`. Previously, the scheduler selected replicas to stop by traversing the least loaded nodes in ascending order. This often resulted in stopping replicas that had been scheduled earlier and placed optimally using the `_best_fit_node()` strategy. This led to several drawbacks: - Long-lived replicas, which were scheduled on best-fit nodes, were removed first — leading to inefficient reuse of resources. - Recently scaled-up replicas, which were placed on less utilized nodes, were kept longer despite being suboptimal. - Cold-start overhead increased, as newer replicas were removed before fully warming up. This PR reverses the node traversal order during downscaling so that **more recently added replicas are prioritized for termination**, *in cases where other conditions (e.g., running state and number of replicas per node) are equal*. These newer replicas are typically less optimal in placement and not yet fully warmed up. Preserving long-lived replicas improves performance stability and reduces unnecessary resource fragmentation. ## Related issue number  N/A ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: kitae <[email protected]>

no python 3.9 wheels any more Signed-off-by: Lonnie Liu <[email protected]>

also uses `curl -fsSL` as much as possible otherwise the release blocker checker is not working. also removes unnecessary sudos. Signed-off-by: Lonnie Liu <[email protected]>

Signed-off-by: dayshah <[email protected]>

## Description - Expose a version parameter on ray.data.read_lance to read historical Lance dataset versions. - Add unit test python/ray/data/tests/test_lance.py::test_lance_read_with_version that writes an initial dataset, records the initial version, merges new data, and asserts default read returns the latest while read_lance(path, version=initial_version) returns the original columns and rows. ## Related issues > Closes ray-project#58226 ## Additional information As mentioned in the original issue, exposed version parameter in ```read_lance``` function. The parameter is passed down to ```LanceDatasource``` which is updated as well. Ultimately, ```lance.dataset``` takes this version param to read the specific version. --------- Signed-off-by: Simeet Nayan <[email protected]> Signed-off-by: Simeet Nayan <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…oject#58906) Created by release automation bot. Update with commit ae94ff4 Signed-off-by: Lonnie Liu <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

… dashboard (ray-project#58525) # Summary Add time to first batch (https://github.com/ray-project/ray/pull/55758/files) to Ray Data dashboard. # Testing Ran release test which produced reasonable dashboard: https://console.anyscale-staging.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_92c7b71w55flm6gv6imv4m6vqg/jobs/prodjob_xntirvctlif8qtdz3wmpz8a2m3/data?job-logs-section-tabs=application_logs&job-tab=metrics&metrics-tab=data <img width="1696" height="609" alt="Screenshot 2025-11-12 at 7 50 17 PM" src="https://github.com/user-attachments/assets/e4195184-9d1d-4489-984a-fa152f479fb9" /> --------- Signed-off-by: Timothy Seah <[email protected]>

…ject#58821) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Parallelize DefaultCollateFn - arrow_batch_to_tensors In `arrow_batch_to_tensors`, use `make_async_gen` to set up multiple workers to speed up processing of columns for Tensor conversion in `convert_ndarray_to_torch_tensor`, so `DefaultCollateFn` can be speedup. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: Srinath Krishnamachari <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

updating llm batch release test depset successful release test run: https://buildkite.com/ray-project/release/builds/68802#019aa374-b0c3-4da1-a003-9296ac07f4e0 --------- Signed-off-by: elliot-barn <[email protected]>

ray-project#58911) This reverts commit 3663299. ```` [2025-11-22T01:29:32Z] File "/rayci/python/ray/dashboard/modules/metrics/dashboards/data_dashboard_panels.py", line 1451, in <module> -- [2025-11-22T01:29:32Z] assert len(all_panel_ids) == len( [2025-11-22T01:29:32Z] AssertionError: Duplicated id found. Use unique id for each panel. [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 43, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 70, 71, 72, 73, 74, 75, 76, 77, 78, 78, 79, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108] ```` Co-authored-by: Lonnie Liu <[email protected]>

…#58872) removing flag from raydepsets: --index-url https://pypi.org/simple (included by default https://docs.astral.sh/uv/reference/cli/#uv-pip-compile--default-index) adding flag to raydepsets: --no-header updating unit tests This will prevent updating all lock files updating when default or config level flags are updated --------- Signed-off-by: elliot-barn <[email protected]>

…er (ray-project#58739) ## Description The algorithm config isn't updating `rl_module_spec.model_config` when a custom one is specified which means that the learner and env-runner. As a result, the runner model wasn't been updated. The reason this problem wasn't detected previous was that when updating the model state-dict is we used `strict=False`. Therefore, I've added an error checker that the missing keys should always be empty and will detect when env-runner is missing components from the learner update model. ```python from ray.rllib.algorithms import PPOConfig from ray.rllib.core.rl_module import RLModuleSpec from ray.rllib.policy.sample_batch import DEFAULT_POLICY_ID config = ( PPOConfig() .environment('CartPole-v1') .env_runners( num_env_runners=0, num_envs_per_env_runner=1, ) .rl_module( rl_module_spec=RLModuleSpec( model_config={ "head_fcnet_hiddens": (32,), # This used to cause encoder.config.shared mismatch } ) ) ) algo = config.build_algo() learner_module = algo.learner_group._learner._module[DEFAULT_POLICY_ID] env_runner_modules = algo.env_runner_group.foreach_env_runner(lambda runner: runner.module) print(f'{learner_module.encoder.config.shared=}') print(f'{[mod.encoder.config.shared for mod in env_runner_modules]=}') algo.train() ``` ## Related issues Closes ray-project#58715 --------- Signed-off-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]> Co-authored-by: Hassam Ullah Sheikh <[email protected]>

resolves logical merge conflicts and fix ci test Signed-off-by: Lonnie Liu <[email protected]>

sourcery-ai

The pull request #687 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5453.

gemini-code-assist · 2025-11-24T03:17:22Z

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request performs a daily merge from the master branch into the main branch, primarily focusing on updating and refactoring the project's build system and CI/CD configurations. The changes streamline Docker image creation, standardize Python environments using uv and miniforge, and refine Bazel build rules across various components and platforms.

Highlights

Bazel Configuration Updates: Enabled strict action environment by default, added a workspace status command for Linux builds, included /utf-8 cxxopt for Windows, and configured Bazel to ignore warnings from src/ray/thirdparty/.
Buildkite CI Pipeline Refactoring: Introduced a new _images.rayci.yml file to centralize Docker image building steps. Python 3.10 was added to several CI jobs, while Python 3.9 was removed from others. Instance types and dependencies for various CI groups (core, data, llm, ml, serve, kuberay, lint, linux_aarch64, macos, others) were updated.
Python Environment Tooling Migration: Dockerfiles for forge_arm64 and forge_x86_64 were updated to use ubuntu:22.04 and transitioned from Miniconda to Miniforge and uv for Python environment management, aiming for faster and more reliable dependency resolution.
Core C++ API and Runtime Changes: Numerous updates were made across the C++ codebase, including changes to metric recording, IP address retrieval, object store initialization, and task execution logic, reflecting ongoing development and refinement of Ray's core components.
Documentation and Project Metadata Updates: The .github/CODEOWNERS and PULL_REQUEST_TEMPLATE.md files were updated. The .rayciversion was incremented from 0.12.0 to 0.21.0, and .readthedocs.yaml was adjusted to use Python 3.10 and a new requirements lock file.

Ignored Files

Ignored by pattern: .gemini/** (1)
- .gemini/config.yaml
Ignored by pattern: .github/workflows/** (1)
- .github/workflows/stale_pull_request.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is an automated daily merge from master to main, containing a large number of changes and refactorings across the codebase. The most significant changes include a major overhaul of the CI/CD pipeline and build system, with a move towards more modular and centralized configurations. This includes the introduction of a new dependency management tool raydepsets, the adoption of uv for Python package management, and refactoring of Bazel build files. There are also substantial improvements to code quality checks through pre-commit hooks. I've identified a couple of issues in the test suite that seem to have been missed during this large refactoring. Please see my comments for details.

gemini-code-assist · 2025-11-24T03:23:02Z

ci/ray_ci/test_linux_container.py

 def test_get_docker_image() -> None:
    assert (
        LinuxContainer("test")._get_docker_image()
-        == "029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:unknown-test"
+        == "029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:test"


The expected docker image name is incorrect. The _get_docker_image method now prepends the RAYCI_BUILD_ID to the docker tag. In the test environment, RAYCI_BUILD_ID is set to a1b2c3d4, so the expected image tag should be ...:a1b2c3d4-test.

assert ( LinuxContainer("test")._get_docker_image() == "029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:a1b2c3d4-test" )

gemini-code-assist · 2025-11-24T03:23:02Z

ci/ray_ci/test_linux_tester_container.py

        install_ray_cmds.append(inputs)

    with mock.patch("subprocess.check_call", side_effect=_mock_subprocess):
        LinuxTesterContainer("team", build_type="debug")


The docker_image variable is incorrect. The LinuxTesterContainer will build an image with a tag that includes the RAYCI_BUILD_ID prefix, which is a1b2c3d4 in the test environment. The assertion should check for ...:a1b2c3d4-team.

Suggested change

LinuxTesterContainer("team", build_type="debug")

docker_image = f"{_DOCKER_ECR_REPO}:a1b2c3d4-team"

abrarsheikh and others added 30 commits November 6, 2025 13:33

[Train] Try to fix the timeout for test_result (ray-project#58432)

00803d9

The timeout is due to `moto-server` which mocks the s3. Remove the remote storage for now. --------- Signed-off-by: xgui <[email protected]>

[docs][data][llm] simplify / add ray data.llm quickstart example (ray…

ad838a3

…-project#58330) Signed-off-by: Nikhil Ghosh <[email protected]>

[core] run modin tests in python 3.10 (ray-project#58433)

59dd271

which depends on datalbuild Signed-off-by: Lonnie Liu <[email protected]>

[release] python 3.10 for long running tests (ray-project#58416)

aad044a

upgrading long running tests to run on py3.10 Successful release test build: https://buildkite.com/ray-project/release/builds/66618 --------- Signed-off-by: elliot-barn <[email protected]>

[Docs][Kuberay] Update version to 1.5.0 (ray-project#58452)

dbb3909

Signed-off-by: Future-Outlier <[email protected]>

fix(rllib): Correct typo and consistency in pyspiel import error mess…

1d57f9a

…age (ray-project#53841) (ray-project#54618)

[RLlib] LINT: Enable ruff imports for rllib/algorithms and ``rlli…

9268934

…b/utils`` (ray-project#56734)

[RLlib] Broken restore from remote - Add missing FileSystem argument (r…

7271b0c

…ay-project#58324)

[Data] Fix Progress Bar Name (ray-project#58451)

3553d8e

## Description There was a typo ## Related issues N/A ## Additional information N/A Signed-off-by: Daniel Shin <[email protected]>

[data] ci: fix data doc test build env (ray-project#58458)

412220e

be consistent with the CI base env specified in `--build-name` Signed-off-by: Lonnie Liu <[email protected]>

[bazel] add python 3.10 runtime pair (ray-project#58455)

a70a1b1

getting ready to run things on python 3.10 Signed-off-by: Lonnie Liu <[email protected]>

[serve][llm] Data Parallel Attention: Public API and Documentation (r…

2691094

…ay-project#58301)

[data][llm] fix vllm ray data quickstart example (ray-project#58463)

654feda

Signed-off-by: Nikhil Ghosh <[email protected]>

aslonnie and others added 22 commits November 20, 2025 15:14

[ci] change CI test bot to use anyscale/ray (ray-project#58844)

96b07e5

rather than ray-project/ray Signed-off-by: Lonnie Liu <[email protected]>

[release] remove python 3.9 windows check (ray-project#58873)

b931d7b

no python 3.9 wheels any more Signed-off-by: Lonnie Liu <[email protected]>

[release auto] add uv and default python in forge (ray-project#58874)

7e51bca

also uses `curl -fsSL` as much as possible otherwise the release blocker checker is not working. also removes unnecessary sudos. Signed-off-by: Lonnie Liu <[email protected]>

[core][doc] Fix kuberay token auth doc link (ray-project#58898)

a06a3c2

Signed-off-by: dayshah <[email protected]>

[docker] Update latest Docker dependencies for 2.52.0 release (ray-pr…

d549fdb

…oject#58906) Created by release automation bot. Update with commit ae94ff4 Signed-off-by: Lonnie Liu <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

[deps] updating llm batch release depset (ray-project#58858)

f09600c

updating llm batch release test depset successful release test run: https://buildkite.com/ray-project/release/builds/68802#019aa374-b0c3-4da1-a003-9296ac07f4e0 --------- Signed-off-by: elliot-barn <[email protected]>

[depset] remove headers on all depsets (ray-project#58917)

ffa560a

resolves logical merge conflicts and fix ci test Signed-off-by: Lonnie Liu <[email protected]>

antfin-oss requested review from SongGuyang and kfstorm as code owners November 24, 2025 03:10

antfin-oss added auto-generated daily-merge labels Nov 24, 2025

antfin-oss assigned ffbin Nov 24, 2025

sourcery-ai bot reviewed Nov 24, 2025

View reviewed changes

gemini-code-assist bot reviewed Nov 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🔄 daily merge: master → main 2025-11-24 #687

🔄 daily merge: master → main 2025-11-24 #687

Uh oh!

antfin-oss commented Nov 24, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

gemini-code-assist bot commented Nov 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Uh oh!

gemini-code-assist bot Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

78 participants

	LinuxTesterContainer("team", build_type="debug")
	docker_image = f"{_DOCKER_ECR_REPO}:a1b2c3d4-team"

🔄 daily merge: master → main 2025-11-24 #687

Are you sure you want to change the base?

🔄 daily merge: master → main 2025-11-24 #687

Uh oh!

Conversation

antfin-oss commented Nov 24, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Nov 24, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

78 participants