Skip to content

[core] Allow matching worker_process_setup_hook on re-entry#61473

Merged
edoakes merged 4 commits intomasterfrom
worker-process-setup-hook
Mar 10, 2026
Merged

[core] Allow matching worker_process_setup_hook on re-entry#61473
edoakes merged 4 commits intomasterfrom
worker-process-setup-hook

Conversation

@jeffreywang-anyscale
Copy link
Copy Markdown
Contributor

@jeffreywang-anyscale jeffreywang-anyscale commented Mar 4, 2026

Summary

  • When upload_worker_process_setup_hook_if_needed is called and the setup hook env var is already populated (e.g. inherited from a job supervisor), validate that it is consistent with the declared worker_process_setup_hook in the runtime env. If they diverge, raise a RuntimeError.
  • Extract build_setup_hook_export_entry from FunctionActorManager.export_setup_func so that the consistency check can recompute the expected GCS key for callable hooks without requiring a full export.
  • Cover module-path match, callable already-processed, module divergence, callable divergence, and type mismatch (callable vs module path) scenarios in unit tests.

Why are these changes needed?

When a KubeRay job specifies worker_process_setup_hook in its runtime_env:

  1. The job supervisor processes the hook, converting it into __RAY_WORKER_PROCESS_SETUP_HOOK_ENV_VAR in env_vars, but the original worker_process_setup_hook key is not removed from the runtime_env dict.
  2. The driver subprocess inherits this already-processed runtime_env.
  3. When the driver calls ray.init(), upload_worker_process_setup_hook_if_needed runs again, sees worker_process_setup_hook is present, enters export_setup_func_module, and hits the assertion that the env var must not already exist.

This manifests in 2.54 because vllm_engine_stage.py changed the default distributed_executor_backend from mp to ray for multi-GPU, which causes vLLM to create Ray actors that re-enter the runtime_env processing pipeline.

Related issues

Closes #61350.

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@jeffreywang-anyscale jeffreywang-anyscale requested a review from a team as a code owner March 4, 2026 03:17
@jeffreywang-anyscale jeffreywang-anyscale added the go add ONLY when ready to merge, run all tests label Mar 4, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an idempotency issue in upload_worker_process_setup_hook_if_needed to prevent AssertionError when a runtime_env with a worker_process_setup_hook is processed multiple times, which is necessary for deployment scenarios like KubeRay. However, the current implementation of the idempotency guard allows users to bypass security assertions intended to protect internal environment variables, potentially leading to the bypass of mandatory security hooks. A more secure approach to idempotency should be implemented.

Comment on lines +98 to +101
# Ensure idempotency: Already processed (e.g. inherited from job supervisor) — skip.
env_vars = runtime_env.get("env_vars", {})
if ray_constants.WORKER_PROCESS_SETUP_HOOK_ENV_VAR in env_vars:
return runtime_env
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The idempotency check introduced here allows a user to bypass existing security assertions that prevent manual setting of the reserved internal environment variable __RAY_WORKER_PROCESS_SETUP_HOOK_ENV_VAR.

By providing this internal variable in the env_vars of a runtime_env, a user can trigger an early return from this function, effectively skipping the processing of the intended worker_process_setup_hook and bypassing the assertions at lines 51 and 73. This allows an attacker to spoof the internal state of the runtime environment and potentially override mandatory setup hooks (e.g., those enforced by a cluster administrator via RAY_RUNTIME_ENV_HOOK) with arbitrary, malicious hooks.

To remediate this while maintaining idempotency, consider a more robust way to distinguish between a runtime environment that was legitimately processed by Ray and one where the internal variable was maliciously injected by a user. For example, you could verify that the internal variable matches the expected output of the provided hook, or use a separate, non-user-controllable flag to track processing state.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@jeffreywang-anyscale jeffreywang-anyscale changed the title [core] Idempotent upload_worker_process_setup_hook_if_needed [core] Idempotent upload_worker_process_setup_hook_if_needed Mar 4, 2026
@ray-gardener ray-gardener Bot added the core Issues that should be addressed in Ray Core label Mar 4, 2026
@tgowda1996
Copy link
Copy Markdown

Looks good! Thanks for fixing this jeffrey!

@ZacAttack
Copy link
Copy Markdown
Contributor

Spoke on slack, but main feedback is that this looks a little sketchy to me. We can probably check if a setup hook has been populated by looking at the entry for it in the gcs. If they're the same, then carry on, but if they diverge, complain loudly.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@jeffreywang-anyscale jeffreywang-anyscale changed the title [core] Idempotent upload_worker_process_setup_hook_if_needed [core] Allow matching worker_process_setup_hook on re-entry Mar 5, 2026
Comment thread python/ray/_private/runtime_env/setup_hook.py
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

except Exception:
existing_is_callable_ref = False
if existing_is_callable_ref or existing_hook_value != setup_func:
_raise_setup_hook_conflict(existing_hook_value, f"'{setup_func}'")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Callable hook re-entry raises false conflict on re-processing

Medium Severity

When a callable hook is first processed, export_setup_func_callable replaces runtime_env["worker_process_setup_hook"] with setup_func.__name__ (a plain string like "my_hook"), while setting the env var to a callable reference ("ray_runtime_env_func::..."). On re-entry, _check_setup_hook_consistency sees setup_func as a string and enters the elif isinstance(setup_func, str) branch. Since the existing env var is a callable reference, _encode_function_key succeeds, setting existing_is_callable_ref = True, which unconditionally triggers _raise_setup_hook_conflict. This makes legitimate callable hook re-entry impossible — the exact class of bug this PR aims to fix.

Additional Locations (1)

Fix in Cursor Fix in Web

@edoakes edoakes merged commit 2208e78 into master Mar 10, 2026
6 checks passed
@edoakes edoakes deleted the worker-process-setup-hook branch March 10, 2026 20:57
abrarsheikh pushed a commit that referenced this pull request Mar 11, 2026
## Summary
- When `upload_worker_process_setup_hook_if_needed` is called and the
setup hook env var is already populated (e.g. inherited from a job
supervisor), validate that it is consistent with the declared
`worker_process_setup_hook` in the runtime env. If they diverge, raise a
RuntimeError.
- Extract `build_setup_hook_export_entry` from
`FunctionActorManager.export_setup_func` so that the consistency check
can recompute the expected GCS key for callable hooks without requiring
a full export.
- Cover module-path match, callable already-processed, module
divergence, callable divergence, and type mismatch (callable vs module
path) scenarios in unit tests.

## Why are these changes needed?
When a KubeRay job specifies `worker_process_setup_hook` in its
runtime_env:
1. The job supervisor processes the hook, converting it into
`__RAY_WORKER_PROCESS_SETUP_HOOK_ENV_VAR` in env_vars, but the original
`worker_process_setup_hook` key is not removed from the `runtime_env`
dict.
2. The driver subprocess inherits this already-processed runtime_env.
3. When the driver calls `ray.init()`,
`upload_worker_process_setup_hook_if_needed` runs again, sees
`worker_process_setup_hook` is present, enters
`export_setup_func_module`, and hits the
[assertion](https://github.com/ray-project/ray/blob/master/python/ray/_private/runtime_env/setup_hook.py#L51)
that the env var must not already exist.

This manifests in 2.54 because `vllm_engine_stage.py` changed the
default distributed_executor_backend from `mp` to `ray` for multi-GPU,
which causes vLLM to create Ray actors that **re-enter the runtime_env
processing** pipeline.

## Related issues
Closes #61350.

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
SumanthRH added a commit to NovaSky-AI/SkyRL that referenced this pull request Mar 19, 2026
…rt method to `spawn`" (#1344)

Reverts #1333

Fixes #1342 and #1343 . It looks like we hit the same issue as
ray-project/ray#61350 when dealing with worker
process setup hook and vllm with the ray backend.

The long term fix is actually in the ray repo - the bug has been fixed
in ray-project/ray#61473 and we should be able
to make use of the setup hook after upgrading to the next ray release.
Until then, I've just reverted the changes and added `spawn` for the mp
context for our dataloader

I did a quick smoke test by running the gsm8k example and the script
enters the first step successfully
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1344"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

---------

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
devpatelio pushed a commit to NovaSky-AI/SkyRL that referenced this pull request Mar 20, 2026
…rt method to `spawn`" (#1344)

Reverts #1333

Fixes #1342 and #1343 . It looks like we hit the same issue as
ray-project/ray#61350 when dealing with worker
process setup hook and vllm with the ray backend.

The long term fix is actually in the ray repo - the bug has been fixed
in ray-project/ray#61473 and we should be able
to make use of the setup hook after upgrading to the next ray release.
Until then, I've just reverted the changes and added `spawn` for the mp
context for our dataloader

I did a quick smoke test by running the gsm8k example and the script
enters the first step successfully
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1344"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

---------

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RayDataLLM] worker_process_setup_hook doesn't work in ray data llm jobs launched on a kuberay cluster

4 participants