[core] Allow matching worker_process_setup_hook on re-entry#61473
[core] Allow matching worker_process_setup_hook on re-entry#61473
worker_process_setup_hook on re-entry#61473Conversation
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request addresses an idempotency issue in upload_worker_process_setup_hook_if_needed to prevent AssertionError when a runtime_env with a worker_process_setup_hook is processed multiple times, which is necessary for deployment scenarios like KubeRay. However, the current implementation of the idempotency guard allows users to bypass security assertions intended to protect internal environment variables, potentially leading to the bypass of mandatory security hooks. A more secure approach to idempotency should be implemented.
| # Ensure idempotency: Already processed (e.g. inherited from job supervisor) — skip. | ||
| env_vars = runtime_env.get("env_vars", {}) | ||
| if ray_constants.WORKER_PROCESS_SETUP_HOOK_ENV_VAR in env_vars: | ||
| return runtime_env |
There was a problem hiding this comment.
The idempotency check introduced here allows a user to bypass existing security assertions that prevent manual setting of the reserved internal environment variable __RAY_WORKER_PROCESS_SETUP_HOOK_ENV_VAR.
By providing this internal variable in the env_vars of a runtime_env, a user can trigger an early return from this function, effectively skipping the processing of the intended worker_process_setup_hook and bypassing the assertions at lines 51 and 73. This allows an attacker to spoof the internal state of the runtime environment and potentially override mandatory setup hooks (e.g., those enforced by a cluster administrator via RAY_RUNTIME_ENV_HOOK) with arbitrary, malicious hooks.
To remediate this while maintaining idempotency, consider a more robust way to distinguish between a runtime environment that was legitimately processed by Ray and one where the internal variable was maliciously injected by a user. For example, you could verify that the internal variable matches the expected output of the provided hook, or use a separate, non-user-controllable flag to track processing state.
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
upload_worker_process_setup_hook_if_needed
|
Looks good! Thanks for fixing this jeffrey! |
|
Spoke on slack, but main feedback is that this looks a little sketchy to me. We can probably check if a setup hook has been populated by looking at the entry for it in the gcs. If they're the same, then carry on, but if they diverge, complain loudly. |
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
upload_worker_process_setup_hook_if_neededworker_process_setup_hook on re-entry
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
| except Exception: | ||
| existing_is_callable_ref = False | ||
| if existing_is_callable_ref or existing_hook_value != setup_func: | ||
| _raise_setup_hook_conflict(existing_hook_value, f"'{setup_func}'") |
There was a problem hiding this comment.
Callable hook re-entry raises false conflict on re-processing
Medium Severity
When a callable hook is first processed, export_setup_func_callable replaces runtime_env["worker_process_setup_hook"] with setup_func.__name__ (a plain string like "my_hook"), while setting the env var to a callable reference ("ray_runtime_env_func::..."). On re-entry, _check_setup_hook_consistency sees setup_func as a string and enters the elif isinstance(setup_func, str) branch. Since the existing env var is a callable reference, _encode_function_key succeeds, setting existing_is_callable_ref = True, which unconditionally triggers _raise_setup_hook_conflict. This makes legitimate callable hook re-entry impossible — the exact class of bug this PR aims to fix.
Additional Locations (1)
## Summary - When `upload_worker_process_setup_hook_if_needed` is called and the setup hook env var is already populated (e.g. inherited from a job supervisor), validate that it is consistent with the declared `worker_process_setup_hook` in the runtime env. If they diverge, raise a RuntimeError. - Extract `build_setup_hook_export_entry` from `FunctionActorManager.export_setup_func` so that the consistency check can recompute the expected GCS key for callable hooks without requiring a full export. - Cover module-path match, callable already-processed, module divergence, callable divergence, and type mismatch (callable vs module path) scenarios in unit tests. ## Why are these changes needed? When a KubeRay job specifies `worker_process_setup_hook` in its runtime_env: 1. The job supervisor processes the hook, converting it into `__RAY_WORKER_PROCESS_SETUP_HOOK_ENV_VAR` in env_vars, but the original `worker_process_setup_hook` key is not removed from the `runtime_env` dict. 2. The driver subprocess inherits this already-processed runtime_env. 3. When the driver calls `ray.init()`, `upload_worker_process_setup_hook_if_needed` runs again, sees `worker_process_setup_hook` is present, enters `export_setup_func_module`, and hits the [assertion](https://github.com/ray-project/ray/blob/master/python/ray/_private/runtime_env/setup_hook.py#L51) that the env var must not already exist. This manifests in 2.54 because `vllm_engine_stage.py` changed the default distributed_executor_backend from `mp` to `ray` for multi-GPU, which causes vLLM to create Ray actors that **re-enter the runtime_env processing** pipeline. ## Related issues Closes #61350. ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
…rt method to `spawn`" (#1344) Reverts #1333 Fixes #1342 and #1343 . It looks like we hit the same issue as ray-project/ray#61350 when dealing with worker process setup hook and vllm with the ray backend. The long term fix is actually in the ray repo - the bug has been fixed in ray-project/ray#61473 and we should be able to make use of the setup hook after upgrading to the next ray release. Until then, I've just reverted the changes and added `spawn` for the mp context for our dataloader I did a quick smoke test by running the gsm8k example and the script enters the first step successfully <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1344" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…rt method to `spawn`" (#1344) Reverts #1333 Fixes #1342 and #1343 . It looks like we hit the same issue as ray-project/ray#61350 when dealing with worker process setup hook and vllm with the ray backend. The long term fix is actually in the ray repo - the bug has been fixed in ray-project/ray#61473 and we should be able to make use of the setup hook after upgrading to the next ray release. Until then, I've just reverted the changes and added `spawn` for the mp context for our dataloader I did a quick smoke test by running the gsm8k example and the script enters the first step successfully <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1344" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>


Summary
upload_worker_process_setup_hook_if_neededis called and the setup hook env var is already populated (e.g. inherited from a job supervisor), validate that it is consistent with the declaredworker_process_setup_hookin the runtime env. If they diverge, raise a RuntimeError.build_setup_hook_export_entryfromFunctionActorManager.export_setup_funcso that the consistency check can recompute the expected GCS key for callable hooks without requiring a full export.Why are these changes needed?
When a KubeRay job specifies
worker_process_setup_hookin its runtime_env:__RAY_WORKER_PROCESS_SETUP_HOOK_ENV_VARin env_vars, but the originalworker_process_setup_hookkey is not removed from theruntime_envdict.ray.init(),upload_worker_process_setup_hook_if_neededruns again, seesworker_process_setup_hookis present, entersexport_setup_func_module, and hits the assertion that the env var must not already exist.This manifests in 2.54 because
vllm_engine_stage.pychanged the default distributed_executor_backend frommptorayfor multi-GPU, which causes vLLM to create Ray actors that re-enter the runtime_env processing pipeline.Related issues
Closes #61350.
Additional information