[rollout] fix: Add "non_block" argument compatibility to collective_rpc()#3934
[rollout] fix: Add "non_block" argument compatibility to collective_rpc()#3934wuxibin89 merged 3 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request aims to fix a TypeError in collective_rpc when used with newer versions of vLLM by adding **extra_kwargs to the method signature. While this correctly resolves the signature mismatch, the implementation has a critical flaw: the arguments captured by **extra_kwargs are not passed on to the remote worker. This will cause any new keyword arguments to be silently ignored, potentially leading to incorrect behavior. I've left a critical comment with a suggested fix to properly merge these arguments before they are sent.
| timeout: Optional[float] = None, | ||
| args: tuple = (), | ||
| kwargs: Optional[dict[str, Any]] = None, | ||
| **extra_kwargs |
There was a problem hiding this comment.
While adding **extra_kwargs fixes the TypeError, the collected arguments are then discarded, as they are not used in the function body. This means any new keyword arguments from vllm (like non_block, but also any others) will be silently ignored instead of being passed to the remote worker via the pickled message. This is a critical issue as it can lead to unexpected behavior if the remote method relies on these arguments.
The pull request description mentions not wanting to affect the existing non-blocking logic. You can achieve this while still correctly handling other arguments by merging extra_kwargs and then explicitly handling or removing non_block.
I recommend updating the function body to merge the arguments:
all_kwargs = kwargs or {}
all_kwargs.update(extra_kwargs)
# Now `all_kwargs` contains all keyword arguments.
# You can handle `non_block` if needed, for example, by ignoring it:
all_kwargs.pop("non_block", None)
# ...
message = pickle.dumps((sent_method, args, all_kwargs))
# ...This ensures forward compatibility with any other keyword arguments that might be added in the future, while still addressing the immediate TypeError. Since the function body is not in this diff, I cannot provide a direct code suggestion.
|
Thanks, it works! |
|
verl is TOO DEEPLY coupled with inference and training backend using its hybrid engine. I think we need a better tradeoff between compatibility and efficiency. |
…pc() (verl-project#3934) ### What does this PR do? This PR fixes a `TypeError` that occurs when newer versions of vLLM (v0.11+) attempt to call `ExternalZeroMQDistributedExecutor.collective_rpc`. The issue stems from a recent vLLM update (vllm-project/vllm#24219) that added the keyword argument `non_block` to the `Executor.collective_rpc` interface. Since the `verl` implementation of `collective_rpc` did not define this parameter, calling it with `non_block=True` resulted in the error: `TypeError: ExternalZeroMQDistributedExecutor.collective_rpc() got an unexpected keyword argument 'non_block'`. By using `**extra_kwargs` in the function signature, we ensure compatibility with both legacy and modern vLLM interfaces without affecting the existing ZeroMQ non-blocking logic. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: weikaiwen <weikaiwen1998@126.com>
…pc() (verl-project#3934) ### What does this PR do? This PR fixes a `TypeError` that occurs when newer versions of vLLM (v0.11+) attempt to call `ExternalZeroMQDistributedExecutor.collective_rpc`. The issue stems from a recent vLLM update (vllm-project/vllm#24219) that added the keyword argument `non_block` to the `Executor.collective_rpc` interface. Since the `verl` implementation of `collective_rpc` did not define this parameter, calling it with `non_block=True` resulted in the error: `TypeError: ExternalZeroMQDistributedExecutor.collective_rpc() got an unexpected keyword argument 'non_block'`. By using `**extra_kwargs` in the function signature, we ensure compatibility with both legacy and modern vLLM interfaces without affecting the existing ZeroMQ non-blocking logic. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: weikaiwen <weikaiwen1998@126.com>
…pc() (verl-project#3934) ### What does this PR do? This PR fixes a `TypeError` that occurs when newer versions of vLLM (v0.11+) attempt to call `ExternalZeroMQDistributedExecutor.collective_rpc`. The issue stems from a recent vLLM update (vllm-project/vllm#24219) that added the keyword argument `non_block` to the `Executor.collective_rpc` interface. Since the `verl` implementation of `collective_rpc` did not define this parameter, calling it with `non_block=True` resulted in the error: `TypeError: ExternalZeroMQDistributedExecutor.collective_rpc() got an unexpected keyword argument 'non_block'`. By using `**extra_kwargs` in the function signature, we ensure compatibility with both legacy and modern vLLM interfaces without affecting the existing ZeroMQ non-blocking logic. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: weikaiwen <weikaiwen1998@126.com>
…pc() (verl-project#3934) ### What does this PR do? This PR fixes a `TypeError` that occurs when newer versions of vLLM (v0.11+) attempt to call `ExternalZeroMQDistributedExecutor.collective_rpc`. The issue stems from a recent vLLM update (vllm-project/vllm#24219) that added the keyword argument `non_block` to the `Executor.collective_rpc` interface. Since the `verl` implementation of `collective_rpc` did not define this parameter, calling it with `non_block=True` resulted in the error: `TypeError: ExternalZeroMQDistributedExecutor.collective_rpc() got an unexpected keyword argument 'non_block'`. By using `**extra_kwargs` in the function signature, we ensure compatibility with both legacy and modern vLLM interfaces without affecting the existing ZeroMQ non-blocking logic. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: weikaiwen <weikaiwen1998@126.com>
…pc() (verl-project#3934) ### What does this PR do? This PR fixes a `TypeError` that occurs when newer versions of vLLM (v0.11+) attempt to call `ExternalZeroMQDistributedExecutor.collective_rpc`. The issue stems from a recent vLLM update (vllm-project/vllm#24219) that added the keyword argument `non_block` to the `Executor.collective_rpc` interface. Since the `verl` implementation of `collective_rpc` did not define this parameter, calling it with `non_block=True` resulted in the error: `TypeError: ExternalZeroMQDistributedExecutor.collective_rpc() got an unexpected keyword argument 'non_block'`. By using `**extra_kwargs` in the function signature, we ensure compatibility with both legacy and modern vLLM interfaces without affecting the existing ZeroMQ non-blocking logic. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: weikaiwen <weikaiwen1998@126.com>
…pc() (verl-project#3934) ### What does this PR do? This PR fixes a `TypeError` that occurs when newer versions of vLLM (v0.11+) attempt to call `ExternalZeroMQDistributedExecutor.collective_rpc`. The issue stems from a recent vLLM update (vllm-project/vllm#24219) that added the keyword argument `non_block` to the `Executor.collective_rpc` interface. Since the `verl` implementation of `collective_rpc` did not define this parameter, calling it with `non_block=True` resulted in the error: `TypeError: ExternalZeroMQDistributedExecutor.collective_rpc() got an unexpected keyword argument 'non_block'`. By using `**extra_kwargs` in the function signature, we ensure compatibility with both legacy and modern vLLM interfaces without affecting the existing ZeroMQ non-blocking logic. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: weikaiwen <weikaiwen1998@126.com>
…pc() (verl-project#3934) ### What does this PR do? This PR fixes a `TypeError` that occurs when newer versions of vLLM (v0.11+) attempt to call `ExternalZeroMQDistributedExecutor.collective_rpc`. The issue stems from a recent vLLM update (vllm-project/vllm#24219) that added the keyword argument `non_block` to the `Executor.collective_rpc` interface. Since the `verl` implementation of `collective_rpc` did not define this parameter, calling it with `non_block=True` resulted in the error: `TypeError: ExternalZeroMQDistributedExecutor.collective_rpc() got an unexpected keyword argument 'non_block'`. By using `**extra_kwargs` in the function signature, we ensure compatibility with both legacy and modern vLLM interfaces without affecting the existing ZeroMQ non-blocking logic. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: weikaiwen <weikaiwen1998@126.com>
…pc() (verl-project#3934) ### What does this PR do? This PR fixes a `TypeError` that occurs when newer versions of vLLM (v0.11+) attempt to call `ExternalZeroMQDistributedExecutor.collective_rpc`. The issue stems from a recent vLLM update (vllm-project/vllm#24219) that added the keyword argument `non_block` to the `Executor.collective_rpc` interface. Since the `verl` implementation of `collective_rpc` did not define this parameter, calling it with `non_block=True` resulted in the error: `TypeError: ExternalZeroMQDistributedExecutor.collective_rpc() got an unexpected keyword argument 'non_block'`. By using `**extra_kwargs` in the function signature, we ensure compatibility with both legacy and modern vLLM interfaces without affecting the existing ZeroMQ non-blocking logic. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: weikaiwen <weikaiwen1998@126.com>
What does this PR do?
This PR fixes a
TypeErrorthat occurs when newer versions of vLLM (v0.11+) attempt to callExternalZeroMQDistributedExecutor.collective_rpc.The issue stems from a recent vLLM update (vllm-project/vllm#24219) that added the keyword argument
non_blockto theExecutor.collective_rpcinterface. Since theverlimplementation ofcollective_rpcdid not define this parameter, calling it withnon_block=Trueresulted in the error:TypeError: ExternalZeroMQDistributedExecutor.collective_rpc() got an unexpected keyword argument 'non_block'.By using
**extra_kwargsin the function signature, we ensure compatibility with both legacy and modern vLLM interfaces without affecting the existing ZeroMQ non-blocking logic.Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)