-
Notifications
You must be signed in to change notification settings - Fork 25
π daily merge: master β main 2025-11-24 #687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
β¦58345) ## Summary Adds a new method to expose all downstream deployments that a replica calls into, enabling dependency graph construction. ## Motivation Deployments call downstream deployments via handles in two ways: 1. **Stored handles**: Passed to `__init__()` and stored as attributes β `self.model.func.remote()` 2. **Dynamic handles**: Obtained at runtime via `serve.get_deployment_handle()` β `model.func.remote()` Previously, there was no way to programmatically discover these dependencies from a running replica. ## Implementation ### Core Changes - **`ReplicaActor.list_outbound_deployments()`**: Returns `List[DeploymentID]` of all downstream deployments - Recursively inspects user callable attributes to find stored handles (including nested in dicts/lists) - Tracks dynamic handles created via `get_deployment_handle()` at runtime using a callback mechanism - **Runtime tracking**: Modified `get_deployment_handle()` to register handles when called from within a replica via `ReplicaContext._handle_registration_callback` Next PR: ray-project#58350 --------- Signed-off-by: abrar <[email protected]>
This PR: - adds a new page to the Ray Train docs called "Monitor your Application" that lists and describes the Prometheus metrics emitted by Ray Train - Updates the Ray Core system metrics docs to include some missing metrics Link to example build: https://anyscale-ray--58235.com.readthedocs.build/en/58235/train/user-guides/monitor-your-application.html Preview Screenshot: <img width="1630" height="662" alt="Screenshot 2025-10-29 at 2 46 07β―PM" src="https://github.com/user-attachments/assets/9ca7ea6d-522b-4033-909a-2ee626960e8a" /> --------- Signed-off-by: JasonLi1909 <[email protected]>
Currently, users that import ray.tune can run into an ImportError if they do not have pydantic installed. This is because ray.tune imports ray.train, which requires pydantic. This PR prevents this error by adding pydantic as a ray tune dependency. Relevant user issue: ray-project#58280 --------- Signed-off-by: JasonLi1909 <[email protected]>
The timeout is due to `moto-server` which mocks the s3. Remove the remote storage for now. --------- Signed-off-by: xgui <[email protected]>
β¦-project#58330) Signed-off-by: Nikhil Ghosh <[email protected]>
β¦project#58229) Ray Train's framework agnostic collective utilities (`ray.train.collective.barrier`, `ray.train.collective.broadcast_from_rank_zero`) currently timeout after 30 minutes if not all ranks join the operation. `ray.train.report` uses these collective utilities internally, so users who don't call report on every rank can run into deadlocks. For example, the report barrier can deadlock with another worker waiting on others to join a backward pass collective. This PR changes the default Ray Train collective behavior to never timeout and to only log warning messages about the missing ranks. User code typically already has timeouts such as NCCL timeouts (also 30 minutes by default), so the extra timeout here doesn't really help and increases the user burden of keeping track of environment variables to set when debugging hanging jobs. New default: `RAY_TRAIN_COLLECTIVE_TIMEOUT_S=-1` This PR also generalizes the environment variable name: `RAY_TRAIN_REPORT_BARRIER` -> `RAY_TRAIN_COLLECTIVE`. --------- Signed-off-by: Justin Yu <[email protected]>
which depends on datalbuild Signed-off-by: Lonnie Liu <[email protected]>
Removing unnecessary --strip-extras flag from raydepsets (only updates depset lock file headers): [--strip-extras](https://docs.astral.sh/uv/reference/cli/#uv-pip-compile--no-strip-extras) Include extras in the output file. By default, uv strips extras, as any packages pulled in by the extras are already included as dependencies in the output file directly. Further, output files generated with --no-strip-extras cannot be used as constraints files in install and sync invocations. --------- Signed-off-by: elliot-barn <[email protected]>
β¦#58444) fix a typo (missing semicolon) in authentication_token_loader_test and use cross env compatible `ray::UnsetEnv()` and `ray::setEnv()` in tests --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>
β¦rough ray.remote (ray-project#58439) Fixes static type hints for ActorClass when setting options through ray.remote Closes ray-project#58401 and ray-project#58402 cc @richardliaw @edoakes @pcmoritz --------- Signed-off-by: will.lin <[email protected]>
upgrading long running tests to run on py3.10 Successful release test build: https://buildkite.com/ray-project/release/builds/66618 --------- Signed-off-by: elliot-barn <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
β¦t stats (ray-project#58422) ## Why These Changes Are Needed This PR adds a new metric to track the time spent retrieving `RefBundle` objects during dataset iteration. This metric provides better visibility into the performance breakdown of batch iteration, specifically capturing the time spent in `get_next_ref_bundle()` calls within the `prefetch_batches_locally` function. ## Related Issue Number N/A ## Example ``` dataloader/train = {'producer_throughput': 8361.841782656593, 'iter_stats': {'prefetch_block-avg': inf, 'prefetch_block-min': inf, 'prefetch_block-max': 0, 'prefetch_block-total': 0, 'get_ref_bundles-avg': 0.05172277254545271, 'get_ref_bundles-min': 1.1991999997462699e-05, 'get_ref_bundles-max': 11.057470971999976, 'get_ref_bundles-total': 15.361663445999454, 'fetch_block-avg': 0.31572694455743233, 'fetch_block-min': 0.0006362799999806157, 'fetch_block-max': 2.1665870369999993, 'fetch_block-total': 93.45517558899996, 'block_to_batch-avg': 0.001048687573988573, 'block_to_batch-min': 2.10620000302697e-05, 'block_to_batch-max': 0.049948245999985375, 'block_to_batch-total': 2.048086831999683, 'format_batch-avg': 0.0001013781433686053, 'format_batch-min': 1.415700000961806e-05, 'format_batch-max': 0.009682661999988795, 'format_batch-total': 0.19799151399888615, 'collate-avg': 0.01303446213312943, 'collate-min': 0.00025646699998560507, 'collate-max': 0.9855495820000328, 'collate-total': 25.456304546001775, 'finalize-avg': 0.012211385266257683, 'finalize-min': 0.004209667999987232, 'finalize-max': 0.3785081949999949, 'finalize-total': 23.848835425001255, 'time_spent_blocked-avg': 0.04783407008137157, 'time_spent_blocked-min': 1.2316999971062614e-05, 'time_spent_blocked-max': 12.46102861700001, 'time_spent_blocked-total': 93.46777293900004, 'time_spent_training-avg': 0.015053571562211652, 'time_spent_training-min': 1.3704999958008557e-05, 'time_spent_training-max': 1.079616685000019, 'time_spent_training-total': 29.399625260999358}} ``` ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: xgui <[email protected]> Signed-off-by: Xinyuan <[email protected]>
## Description: when token auth is enabled, the dashboard prompts the user to enter the valid auth token and caches it (as a browser cookie). when token based auth is disabled, existing behaviour is retained. all dashboard ui based rpc's to to the ray cluster set the authorization header in their requests ## Screenshots token popup <img width="3440" height="2146" alt="image" src="https://github.com/user-attachments/assets/004c23a3-991e-4a2c-a2ad-5a0ce2e60893" /> on entering an invalid token <img width="3440" height="2146" alt="image" src="https://github.com/user-attachments/assets/7183a798-ceb7-4657-8706-39ce5fe8e61e" /> --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>
β¦ants (ray-project#57910) 1. **Remove direct environment variable access patterns** - Replace all instances of `os.getenv("RAY_enable_open_telemetry") == "1"` - Standardize to use `ray_constants.RAY_ENABLE_OPEN_TELEMETRY` consistently throughout the codebase 2. **Unify default value format for RAY_enable_open_telemetry** - Standardize the default value to `"true"` | `"false"` - Previously, the codebase had mixed usage of `"1"` and `"true"`, which is now unified 3. **Backward compatibility maintained** - Carefully verified that the existing `RAY_ENABLE_OPEN_TELEMETRY` constant properly handles both `"1"` and `"true"` values - This change will not introduce any breaking behavior - The `env_bool` helper function already supports both formats: ```python RAY_ENABLE_OPEN_TELEMETRY = env_bool("RAY_enable_open_telemetry", False) def env_bool(key, default): if key in os.environ: return ( True if os.environ[key].lower() == "true" or os.environ[key] == "1" else False ) return default ``` --- Most of the current code uses: `RAY_enable_open_telemetry: "1"` A smaller portion (not zero) uses: `RAY_enable_open_telemetry: "true"` https://github.com/ray-project/ray/blob/fe7ad00f9720a722fde5fecba5bb681234bcdb63/python/ray/tests/test_metrics_agent.py#L497 My personal preference is "true"βitβs concise and unambiguous. If itβs "1", I have to think/guess whether it means "true" or "false". --------- Signed-off-by: justwph <[email protected]>
β¦y-project#58217) Change the unit of `scheduler_placement_time` from seconds to mili-seconds. The current bucket is in the range of 0.1s to 2.5 hours which doesn't make sense. According to a sample of data, the range we are interested in would be from us to s. Thanks @ZacAttack for pointing this out. ``` Note: This is an internal (nonβpublic-facing) metric, so we only need to update its usage within Ray (e.g., the dashboard). A simple code change should suffice. ``` <img width="1609" height="421" alt="505491038-c5d81017-b86c-406f-acf4-614560752062" src="https://github.com/user-attachments/assets/cc647b97-42ec-42eb-bf01-4d1867940207" /> Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>
β¦s in the Raylet (ray-project#58342) Found it very hard to parse what was happening here, so helping future me (or you!). Also: - Deleted vestigial `next_resource_seq_no_`. - Converted from non-monotonic clock to a monotonically incremented `uint64_t` for the version number for commands. - Added logs when we drop messages with stale versions. --------- Signed-off-by: Edward Oakes <[email protected]>
## Description There was a typo ## Related issues N/A ## Additional information N/A Signed-off-by: Daniel Shin <[email protected]>
be consistent with the CI base env specified in `--build-name` Signed-off-by: Lonnie Liu <[email protected]>
getting ready to run things on python 3.10 Signed-off-by: Lonnie Liu <[email protected]>
β¦tion on a single node (ray-project#58456) > Thank you for contributing to Ray! π > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >β οΈ Remove these instructions before submitting your PR. > π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description Currently, finalization is scheduled in batches sequentially -- ie batch of N adjacent partitions is finalized at once (in a sliding window). This creates a lensing effect since: 1. Adjacent partitions i and i+1 get scheduled onto adjacent aggregators j and j+i (since membership is determined as j = i % num_aggregators) 2. Adjacent aggregators have high likelihood of getting scheduled on the same node (due to similarly being scheduled at about the same time in sequence) To address that this change applies random sampling when choosing next partitions to finalize to make sure partitions are chosen uniformly reducing concurrent finalization of the adjacent partitions. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <[email protected]>
## Description > Briefly describe what this PR accomplishes and why it's needed. Making NotifyGCSRestart RPC Fault Tolerant and Idempotent. There were multiple places where we were always returning Status::OK() in the gcs_subscriber making idempotency harder to understand and there was dead code for one of the resubscribes, so did a minor clean up. Added a python integration test to verify retry behavior, left out the cpp test since on the raylet side there's nothing to test since its just making a gcs_client rpc call --------- Signed-off-by: joshlee <[email protected]>
β¦ct#58445) ## Summary Creates a dedicated `tests/unit/` directory for unit tests that don't require Ray runtime or external dependencies. ## Changes - Created `tests/unit/` directory structure - Moved 13 pure unit tests to `tests/unit/` - Added `conftest.py` with fixtures to prevent `ray.init()` and `time.sleep()` - Added `README.md` documenting unit test requirements - Updated `BUILD.bazel` to run unit tests with "small" size tag ## Test Files Moved 1. test_arrow_type_conversion.py 2. test_block.py 3. test_block_boundaries.py 4. test_data_batch_conversion.py 5. test_datatype.py 6. test_deduping_schema.py 7. test_expression_evaluator.py 8. test_expressions.py 9. test_filename_provider.py 10. test_logical_plan.py 11. test_object_extension.py 12. test_path_util.py 13. test_ruleset.py These tests are fast (<1s each), isolated (no Ray runtime), and deterministic (no time.sleep or randomness). --------- Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
> Thank you for contributing to Ray! π > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >β οΈ Remove these instructions before submitting your PR. > π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Concurrency Cap Backpressure tuning - Maintain asymmetric EWMA of total queued bytes (this op + downstream) as the typical level: level. - Maintain asymmetric EWMA of absolute residual vs the previous level as a scale proxy: dev = EWMA(|q - level_prev|). - Define deadband: [lower, upper] = [level - K_DEVdev, level + K_DEVdev]. If q > upper -> target cap = running - BACKOFF_FACTOR (back off) If q < lower -> target cap = running + RAMPUP_FACTOR (ramp up) Else -> target cap = running (hold) - Clamp to [1, configured_cap], admit iff running < target cap. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: Srinath Krishnamachari <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Nikhil Ghosh <[email protected]>
β¦ in read-only mode (ray-project#58460) This ensures node type names are correctly reported even when the autoscaler is disabled (read-only mode). ## Description Autoscaler v2 fails to report prometheus metrics when operating in read-only mode on KubeRay with the following KeyError error: ``` 2025-11-08 12:06:57,402 ERROR autoscaler.py:215 -- 'small-group' Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 200, in update_autoscaling_state return Reconciler.reconcile( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 120, in reconcile Reconciler._step_next( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 275, in _step_next Reconciler._scale_cluster( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1125, in _scale_cluster reply = scheduler.schedule(sched_request) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 933, in schedule ResourceDemandScheduler._enforce_max_workers_per_type(ctx) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 1006, in _enforce_max_workers_per_type node_config = ctx.get_node_type_configs()[node_type] KeyError: 'small-group' ``` This happens because the `ReadOnlyProviderConfigReader` populates `ctx.get_node_type_configs()` using node IDs as node types, which is correct for local Ray (where local ray does not have `RAY_NODE_TYPE_NAME` set), but incorrect for KubeRay where `ray_node_type_name` is present and expected with `RAY_NODE_TYPE_NAME` set. As a result, in read-only mode the scheduler sees a node type name (ex. small-group) that never exists in the populated configs. This PR fixes the issue by using `ray_node_type_name` when it exists, and only falling back to node ID when it does not. ## Related issues Fixes ray-project#58227 Signed-off-by: Rueian <[email protected]>
rather than ray-project/ray Signed-off-by: Lonnie Liu <[email protected]>
β¦r pool scaling (ray-project#58726) ## Summary Add support for configurable upscaling step size in the actor pool autoscaler. This enables rapid scale-up and efficient resource utilization by allowing the autoscaler to scale up multiple actors at once, instead of scaling up one actor at a time. ## Description ### Background Currently, the actor pool autoscaler scales up actors one at a time, which can be slow in certain scenarios: 1. **Slow actor startup**: When actor initialization logic is complex, actors may remain in pending state for extended periods. The autoscaler skips scaling when it encounters pending actors, preventing further scaling. 2. **Elastic cluster with unstable resources**: In environments where available resources are uncertain, users often configure large concurrency ranges (e.g., (10,1000)) for `map_batches`. In these cases, rapid startup and scaling are critical to utilize available resources efficiently. ### Solution This PR adds support for configurable upscaling step size in the actor pool autoscaler. Instead of always scaling up by 1 actor at a time, the autoscaler can now scale up multiple actors based on utilization metrics, while respecting resource constraints. ## Related issues <!-- Add related issue numbers if applicable --> Signed-off-by: dragongu <[email protected]>
``` REGRESSION 40.86%: single_client_put_gigabytes (THROUGHPUT) regresses from 18.324991353469613 to 10.83745250237838 in microbenchmark.json REGRESSION 3.21%: tasks_per_second (THROUGHPUT) regresses from 571.2270630108624 to 552.9028002075252 in benchmarks/many_tasks.json REGRESSION 1.49%: pgs_per_second (THROUGHPUT) regresses from 17.897951502183457 to 17.630807526575012 in benchmarks/many_pgs.json REGRESSION 72.70%: dashboard_p99_latency_ms (LATENCY) regresses from 4098.851 to 7078.797 in benchmarks/many_actors.json REGRESSION 24.76%: stage_3_creation_time (LATENCY) regresses from 1.4687559604644775 to 1.8323874473571777 in stress_tests/stress_test_many_tasks.json REGRESSION 9.00%: avg_pg_create_time_ms (LATENCY) regresses from 1.503374489489326 to 1.6386530375375787 in stress_tests/stress_test_placement_group.json REGRESSION 1.36%: stage_3_time (LATENCY) regresses from 1885.3751878738403 to 1911.0371930599213 in stress_tests/stress_test_many_tasks.json REGRESSION 0.40%: stage_1_avg_iteration_time (LATENCY) regresses from 13.989246034622193 to 14.045441269874573 in stress_tests/stress_test_many_tasks.json REGRESSION 0.35%: dashboard_p95_latency_ms (LATENCY) regresses from 3829.61 to 3843.186 in benchmarks/many_actors.json ``` Signed-off-by: Lonnie Liu <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
β¦ration (ray-project#58888) ## Description > Fixed "dictionary changed size during iteration" error that occurs when shutdown() iterates over task_status_dict while background threads modify it concurrently. ## Additional information > Why not use thread lock? The bug is in the shutdown part, and no other parts iterate it. Signed-off-by: Cai Zhanqi <[email protected]> Co-authored-by: Cai Zhanqi <[email protected]>
## Description Removed unused script, which also has leaked tokens. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Cindy Zhang <[email protected]>
## Description ### Status Quo Previously, `.gitignore` files handled both uploading to cluster _and_ uploading to github. This PR essentially allows the ability to break those 2 functionalities apart by creating a `.rayignore` file which will handle uploading to cluster. ### Purpose Any path or file specified in `.rayignore` will be ignored when uploading to the cluster. This is useful for local development when you don't want random files being uploaded and taking up space. ### How it works By default, directories containing both `.gitignore` and `.rayignore` will both be considered (so existing behavior is preserved). To make `.gitignore` only ignore files uploaded to github, and `.rayignore` only ignore files uploaded to cluster (essentially making them independent of each other), you can use the existing `RAY_RUNTIME_ENV_IGNORE_GITIGNORE` and set that to `1` ## Related issues ray-project#53648 ## Additional information Since `.rayignore` is part of the ray ecosystem, I did not create an env var to disable ignoring all-together. If users do not want to ignore files, they can leave `.rayignore` empty, or not create the file at all. --------- Signed-off-by: iamjustinhsu <[email protected]>
β¦t#58603) ## Description `HashShuffleAggregator` currently doesn't break big blocks into smaller blocks (or combine smaller blocks into bigger ones). For large blocks, this can be very problematic. This PR addresses this by using `OutputBlockBuffer` to reshape the blocks back to `data_context.target_max_block_size` ## Related issues None ## Additional information Encountered this personally with 180GiB block, which would OOD --------- Signed-off-by: iamjustinhsu <[email protected]>
β¦ay-project#58355) ### Summary This PR exposes deployment topology information in Ray Serve instance details, allowing users to visualize and understand the dependency graph of deployments within their applications. ### What's Changed #### New Data Structures Added two new schema classes to represent deployment topology: - **`DeploymentNode`** - Represents a node in the deployment DAG - **`DeploymentTopology`** - Represents the full dependency graph #### Implementation **Controller Integration** - Updated `ServeController` to include `deployment_topology` in `ApplicationDetails` when serving instance details - Topology is now accessible via the `get_serve_details()` API --- **Example Output:** ```python { "app_name": "my_app", "ingress_deployment": "Ingress", "nodes": { "Ingress": { "name": "Ingress", "is_ingress": True, "outbound_deployments": [ {"name": "ServiceA", "app_name": "my_app"} ] }, "ServiceA": { "name": "ServiceA", "is_ingress": False, "outbound_deployments": [ {"name": "Database", "app_name": "my_app"} ] }, "Database": { "name": "Database", "is_ingress": False, "outbound_deployments": [] } } } ``` --------- Signed-off-by: abrar <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
β¦ckpoint` (ray-project#58537) RLlib uses nested metric structure (like `"{ENV_RUNNER_RESULTS}/{EPISODE_RETURN_MEAN}"`) which `Result.get_best_checkpoint` doesn't support. Following `ResultGrid.get_best_result()` to use `unflattened_lookup`, I've added that to `get_best_checkpoint` along with testing for nested structures (and its backward compatibility) --------- Signed-off-by: Mark Towers <[email protected]> Signed-off-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]> Co-authored-by: Justin Yu <[email protected]>
β¦ownscaling (ray-project#52929) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> This PR improves the downscaling behavior in Ray Serve by modifying the logic in `_get_replicas_to_stop()` within Default `DeploymentScheduler`. Previously, the scheduler selected replicas to stop by traversing the least loaded nodes in ascending order. This often resulted in stopping replicas that had been scheduled earlier and placed optimally using the `_best_fit_node()` strategy. This led to several drawbacks: - Long-lived replicas, which were scheduled on best-fit nodes, were removed first β leading to inefficient reuse of resources. - Recently scaled-up replicas, which were placed on less utilized nodes, were kept longer despite being suboptimal. - Cold-start overhead increased, as newer replicas were removed before fully warming up. This PR reverses the node traversal order during downscaling so that **more recently added replicas are prioritized for termination**, *in cases where other conditions (e.g., running state and number of replicas per node) are equal*. These newer replicas are typically less optimal in placement and not yet fully warmed up. Preserving long-lived replicas improves performance stability and reduces unnecessary resource fragmentation. ## Related issue number <!-- For example: "Closes ray-project#1234" --> N/A ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: kitae <[email protected]>
no python 3.9 wheels any more Signed-off-by: Lonnie Liu <[email protected]>
also uses `curl -fsSL` as much as possible otherwise the release blocker checker is not working. also removes unnecessary sudos. Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: dayshah <[email protected]>
## Description - Expose a version parameter on ray.data.read_lance to read historical Lance dataset versions. - Add unit test python/ray/data/tests/test_lance.py::test_lance_read_with_version that writes an initial dataset, records the initial version, merges new data, and asserts default read returns the latest while read_lance(path, version=initial_version) returns the original columns and rows. ## Related issues > Closes ray-project#58226 ## Additional information As mentioned in the original issue, exposed version parameter in ```read_lance``` function. The parameter is passed down to ```LanceDatasource``` which is updated as well. Ultimately, ```lance.dataset``` takes this version param to read the specific version. --------- Signed-off-by: Simeet Nayan <[email protected]> Signed-off-by: Simeet Nayan <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
β¦oject#58906) Created by release automation bot. Update with commit ae94ff4 Signed-off-by: Lonnie Liu <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
β¦ dashboard (ray-project#58525) # Summary Add time to first batch (https://github.com/ray-project/ray/pull/55758/files) to Ray Data dashboard. # Testing Ran release test which produced reasonable dashboard: https://console.anyscale-staging.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_92c7b71w55flm6gv6imv4m6vqg/jobs/prodjob_xntirvctlif8qtdz3wmpz8a2m3/data?job-logs-section-tabs=application_logs&job-tab=metrics&metrics-tab=data <img width="1696" height="609" alt="Screenshot 2025-11-12 at 7 50 17β―PM" src="https://github.com/user-attachments/assets/e4195184-9d1d-4489-984a-fa152f479fb9" /> --------- Signed-off-by: Timothy Seah <[email protected]>
β¦ject#58821) > Thank you for contributing to Ray! π > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. >β οΈ Remove these instructions before submitting your PR. > π‘ Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Parallelize DefaultCollateFn - arrow_batch_to_tensors In `arrow_batch_to_tensors`, use `make_async_gen` to set up multiple workers to speed up processing of columns for Tensor conversion in `convert_ndarray_to_torch_tensor`, so `DefaultCollateFn` can be speedup. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: Srinath Krishnamachari <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
updating llm batch release test depset successful release test run: https://buildkite.com/ray-project/release/builds/68802#019aa374-b0c3-4da1-a003-9296ac07f4e0 --------- Signed-off-by: elliot-barn <[email protected]>
ray-project#58911) This reverts commit 3663299. ```` [2025-11-22T01:29:32Z] File "/rayci/python/ray/dashboard/modules/metrics/dashboards/data_dashboard_panels.py", line 1451, in <module> -- [2025-11-22T01:29:32Z] assert len(all_panel_ids) == len( [2025-11-22T01:29:32Z] AssertionError: Duplicated id found. Use unique id for each panel. [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 43, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 70, 71, 72, 73, 74, 75, 76, 77, 78, 78, 79, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108] ```` Co-authored-by: Lonnie Liu <[email protected]>
β¦#58872) removing flag from raydepsets: --index-url https://pypi.org/simple (included by default https://docs.astral.sh/uv/reference/cli/#uv-pip-compile--default-index) adding flag to raydepsets: --no-header updating unit tests This will prevent updating all lock files updating when default or config level flags are updated --------- Signed-off-by: elliot-barn <[email protected]>
β¦er (ray-project#58739) ## Description The algorithm config isn't updating `rl_module_spec.model_config` when a custom one is specified which means that the learner and env-runner. As a result, the runner model wasn't been updated. The reason this problem wasn't detected previous was that when updating the model state-dict is we used `strict=False`. Therefore, I've added an error checker that the missing keys should always be empty and will detect when env-runner is missing components from the learner update model. ```python from ray.rllib.algorithms import PPOConfig from ray.rllib.core.rl_module import RLModuleSpec from ray.rllib.policy.sample_batch import DEFAULT_POLICY_ID config = ( PPOConfig() .environment('CartPole-v1') .env_runners( num_env_runners=0, num_envs_per_env_runner=1, ) .rl_module( rl_module_spec=RLModuleSpec( model_config={ "head_fcnet_hiddens": (32,), # This used to cause encoder.config.shared mismatch } ) ) ) algo = config.build_algo() learner_module = algo.learner_group._learner._module[DEFAULT_POLICY_ID] env_runner_modules = algo.env_runner_group.foreach_env_runner(lambda runner: runner.module) print(f'{learner_module.encoder.config.shared=}') print(f'{[mod.encoder.config.shared for mod in env_runner_modules]=}') algo.train() ``` ## Related issues Closes ray-project#58715 --------- Signed-off-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]> Co-authored-by: Hassam Ullah Sheikh <[email protected]>
resolves logical merge conflicts and fix ci test Signed-off-by: Lonnie Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pull request #687 has too many files changed.
The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5453.
Summary of ChangesHello @antfin-oss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request performs a daily merge from the Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with π and π on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is an automated daily merge from master to main, containing a large number of changes and refactorings across the codebase. The most significant changes include a major overhaul of the CI/CD pipeline and build system, with a move towards more modular and centralized configurations. This includes the introduction of a new dependency management tool raydepsets, the adoption of uv for Python package management, and refactoring of Bazel build files. There are also substantial improvements to code quality checks through pre-commit hooks. I've identified a couple of issues in the test suite that seem to have been missed during this large refactoring. Please see my comments for details.
| def test_get_docker_image() -> None: | ||
| assert ( | ||
| LinuxContainer("test")._get_docker_image() | ||
| == "029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:unknown-test" | ||
| == "029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:test" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected docker image name is incorrect. The _get_docker_image method now prepends the RAYCI_BUILD_ID to the docker tag. In the test environment, RAYCI_BUILD_ID is set to a1b2c3d4, so the expected image tag should be ...:a1b2c3d4-test.
assert (
LinuxContainer("test")._get_docker_image()
== "029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:a1b2c3d4-test"
)| install_ray_cmds.append(inputs) | ||
|
|
||
| with mock.patch("subprocess.check_call", side_effect=_mock_subprocess): | ||
| LinuxTesterContainer("team", build_type="debug") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docker_image variable is incorrect. The LinuxTesterContainer will build an image with a tag that includes the RAYCI_BUILD_ID prefix, which is a1b2c3d4 in the test environment. The assertion should check for ...:a1b2c3d4-team.
| LinuxTesterContainer("team", build_type="debug") | |
| docker_image = f"{_DOCKER_ECR_REPO}:a1b2c3d4-team" |
This Pull Request was created automatically to merge the latest changes from
masterintomainbranch.π Created: 2025-11-24
π Merge direction:
masterβmainπ€ Triggered by: Scheduled
Please review and merge if everything looks good.