[core] Allow matching `worker_process_setup_hook` on re-entry by jeffreywang-anyscale · Pull Request #61473 · ray-project/ray

jeffreywang-anyscale · 2026-03-04T03:17:04Z

Summary

When upload_worker_process_setup_hook_if_needed is called and the setup hook env var is already populated (e.g. inherited from a job supervisor), validate that it is consistent with the declared worker_process_setup_hook in the runtime env. If they diverge, raise a RuntimeError.
Extract build_setup_hook_export_entry from FunctionActorManager.export_setup_func so that the consistency check can recompute the expected GCS key for callable hooks without requiring a full export.
Cover module-path match, callable already-processed, module divergence, callable divergence, and type mismatch (callable vs module path) scenarios in unit tests.

Why are these changes needed?

When a KubeRay job specifies worker_process_setup_hook in its runtime_env:

The job supervisor processes the hook, converting it into __RAY_WORKER_PROCESS_SETUP_HOOK_ENV_VAR in env_vars, but the original worker_process_setup_hook key is not removed from the runtime_env dict.
The driver subprocess inherits this already-processed runtime_env.
When the driver calls ray.init(), upload_worker_process_setup_hook_if_needed runs again, sees worker_process_setup_hook is present, enters export_setup_func_module, and hits the assertion that the env var must not already exist.

This manifests in 2.54 because vllm_engine_stage.py changed the default distributed_executor_backend from mp to ray for multi-GPU, which causes vLLM to create Ray actors that re-enter the runtime_env processing pipeline.

Related issues

Closes #61350.

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

gemini-code-assist

Code Review

This pull request addresses an idempotency issue in upload_worker_process_setup_hook_if_needed to prevent AssertionError when a runtime_env with a worker_process_setup_hook is processed multiple times, which is necessary for deployment scenarios like KubeRay. However, the current implementation of the idempotency guard allows users to bypass security assertions intended to protect internal environment variables, potentially leading to the bypass of mandatory security hooks. A more secure approach to idempotency should be implemented.

gemini-code-assist · 2026-03-04T03:20:23Z

+    # Ensure idempotency: Already processed (e.g. inherited from job supervisor) — skip.
+    env_vars = runtime_env.get("env_vars", {})
+    if ray_constants.WORKER_PROCESS_SETUP_HOOK_ENV_VAR in env_vars:
+        return runtime_env


The idempotency check introduced here allows a user to bypass existing security assertions that prevent manual setting of the reserved internal environment variable __RAY_WORKER_PROCESS_SETUP_HOOK_ENV_VAR.

By providing this internal variable in the env_vars of a runtime_env, a user can trigger an early return from this function, effectively skipping the processing of the intended worker_process_setup_hook and bypassing the assertions at lines 51 and 73. This allows an attacker to spoof the internal state of the runtime environment and potentially override mandatory setup hooks (e.g., those enforced by a cluster administrator via RAY_RUNTIME_ENV_HOOK) with arbitrary, malicious hooks.

To remediate this while maintaining idempotency, consider a more robust way to distinguish between a runtime environment that was legitimately processed by Ray and one where the internal variable was maliciously injected by a user. For example, you could verify that the internal variable matches the expected output of the provided hook, or use a separate, non-user-controllable flag to track processing state.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

tgowda1996 · 2026-03-04T20:12:10Z

Looks good! Thanks for fixing this jeffrey!

ZacAttack · 2026-03-04T23:14:47Z

Spoke on slack, but main feedback is that this looks a little sketchy to me. We can probably check if a setup hook has been populated by looking at the entry for it in the gcs. If they're the same, then carry on, but if they diverge, complain loudly.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

cursor · 2026-03-06T01:31:39Z

+        except Exception:
+            existing_is_callable_ref = False
+        if existing_is_callable_ref or existing_hook_value != setup_func:
+            _raise_setup_hook_conflict(existing_hook_value, f"'{setup_func}'")


Callable hook re-entry raises false conflict on re-processing

Medium Severity

When a callable hook is first processed, export_setup_func_callable replaces runtime_env["worker_process_setup_hook"] with setup_func.__name__ (a plain string like "my_hook"), while setting the env var to a callable reference ("ray_runtime_env_func::..."). On re-entry, _check_setup_hook_consistency sees setup_func as a string and enters the elif isinstance(setup_func, str) branch. Since the existing env var is a callable reference, _encode_function_key succeeds, setting existing_is_callable_ref = True, which unconditionally triggers _raise_setup_hook_conflict. This makes legitimate callable hook re-entry impossible — the exact class of bug this PR aims to fix.

Additional Locations (1)

python/ray/_private/runtime_env/setup_hook.py#L69-L70

## Summary - When `upload_worker_process_setup_hook_if_needed` is called and the setup hook env var is already populated (e.g. inherited from a job supervisor), validate that it is consistent with the declared `worker_process_setup_hook` in the runtime env. If they diverge, raise a RuntimeError. - Extract `build_setup_hook_export_entry` from `FunctionActorManager.export_setup_func` so that the consistency check can recompute the expected GCS key for callable hooks without requiring a full export. - Cover module-path match, callable already-processed, module divergence, callable divergence, and type mismatch (callable vs module path) scenarios in unit tests. ## Why are these changes needed? When a KubeRay job specifies `worker_process_setup_hook` in its runtime_env: 1. The job supervisor processes the hook, converting it into `__RAY_WORKER_PROCESS_SETUP_HOOK_ENV_VAR` in env_vars, but the original `worker_process_setup_hook` key is not removed from the `runtime_env` dict. 2. The driver subprocess inherits this already-processed runtime_env. 3. When the driver calls `ray.init()`, `upload_worker_process_setup_hook_if_needed` runs again, sees `worker_process_setup_hook` is present, enters `export_setup_func_module`, and hits the [assertion](https://github.com/ray-project/ray/blob/master/python/ray/_private/runtime_env/setup_hook.py#L51) that the env var must not already exist. This manifests in 2.54 because `vllm_engine_stage.py` changed the default distributed_executor_backend from `mp` to `ray` for multi-GPU, which causes vLLM to create Ray actors that **re-enter the runtime_env processing** pipeline. ## Related issues Closes #61350. ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

…rt method to `spawn`" (#1344) Reverts #1333 Fixes #1342 and #1343 . It looks like we hit the same issue as ray-project/ray#61350 when dealing with worker process setup hook and vllm with the ray backend. The long term fix is actually in the ray repo - the bug has been fixed in ray-project/ray#61473 and we should be able to make use of the setup hook after upgrading to the next ray release. Until then, I've just reverted the changes and added `spawn` for the mp context for our dataloader I did a quick smoke test by running the gsm8k example and the script enters the first step successfully  --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1344" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a>  --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Idempotent upload_worker_process_setup_hook_if_needed

744807e

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang-anyscale requested a review from a team as a code owner March 4, 2026 03:17

jeffreywang-anyscale added the go add ONLY when ready to merge, run all tests label Mar 4, 2026

gemini-code-assist Bot reviewed Mar 4, 2026

View reviewed changes

Fix pydoclint

bab803b

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang-anyscale changed the title ~~[core] Idempotent upload_worker_process_setup_hook_if_needed~~ [core] Idempotent upload_worker_process_setup_hook_if_needed Mar 4, 2026

ray-gardener Bot added the core Issues that should be addressed in Ray Core label Mar 4, 2026

Allow matching worker_process_setup_hook on re-entry

63b3701

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang-anyscale changed the title ~~[core] Idempotent upload_worker_process_setup_hook_if_needed~~ [core] Allow matching worker_process_setup_hook on re-entry Mar 5, 2026

cursor Bot reviewed Mar 5, 2026

View reviewed changes

Comment thread python/ray/_private/runtime_env/setup_hook.py

Address cursor comment + remove unrealistic test case

f2350e9

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

cursor Bot reviewed Mar 6, 2026

View reviewed changes

edoakes approved these changes Mar 10, 2026

View reviewed changes

edoakes merged commit 2208e78 into master Mar 10, 2026
6 checks passed

edoakes deleted the worker-process-setup-hook branch March 10, 2026 20:57

SumanthRH mentioned this pull request Mar 18, 2026

Revert "[train] Add worker_process_setup_hook to set mp start method to spawn" NovaSky-AI/SkyRL#1344

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Allow matching `worker_process_setup_hook` on re-entry#61473

[core] Allow matching `worker_process_setup_hook` on re-entry#61473
edoakes merged 4 commits intomasterfrom
worker-process-setup-hook

jeffreywang-anyscale commented Mar 4, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 4, 2026

Uh oh!

tgowda1996 commented Mar 4, 2026

Uh oh!

ZacAttack commented Mar 4, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jeffreywang-anyscale commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why are these changes needed?

Related issues

Additional information

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

tgowda1996 commented Mar 4, 2026

Uh oh!

ZacAttack commented Mar 4, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Mar 6, 2026

Choose a reason for hiding this comment

Callable hook re-entry raises false conflict on re-processing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jeffreywang-anyscale commented Mar 4, 2026 •

edited

Loading