[2/N] Elastic EP Milestone 2: Integrating NIXL-EP#35627
[2/N] Elastic EP Milestone 2: Integrating NIXL-EP#35627tlrmchlsmth merged 6 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Hi @itayalroy, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Code Review
This pull request integrates NIXL-EP kernels for elastic expert parallelism, which is a significant enhancement. The changes are mostly about adding the new nixl_ep backend and its related logic. The implementation also includes some important fixes, such as properly destroying NCCL communicators to prevent resource leaks.
My review focuses on ensuring thread safety in the new NixlEPAll2AllManager. I've identified a potential race condition when accessing the shared buffer and suggested a fix using a lock to ensure robustness in a multi-threaded environment.
| def get_handle(self, kwargs): | ||
| if ( | ||
| NixlEPAll2AllManager._buffer is not None | ||
| and NixlEPAll2AllManager._buffer[1] == self.cpu_group.size() | ||
| ): | ||
| return NixlEPAll2AllManager._buffer[0] | ||
|
|
||
| num_experts_per_rank = kwargs["num_global_experts"] // kwargs["num_ep_ranks"] | ||
| nixl_kwargs = dict( | ||
| max_num_tokens_per_dp_rank=kwargs["max_num_tokens_per_dp_rank"], | ||
| token_hidden_size=kwargs["token_hidden_size"], | ||
| num_experts_per_rank=num_experts_per_rank, | ||
| ) | ||
| if NixlEPAll2AllManager._buffer is None: | ||
| self._init_buffer(**nixl_kwargs) | ||
| else: | ||
| self._update_buffer() | ||
|
|
||
| assert NixlEPAll2AllManager._buffer is not None | ||
| handle = NixlEPAll2AllManager._buffer[0] | ||
| return handle |
There was a problem hiding this comment.
The _buffer class attribute is a shared mutable state. The get_handle method reads and writes to this shared state without any synchronization, which can lead to a race condition if called from multiple threads concurrently. This could happen, for example, during dynamic LoRA loading, leading to incorrect behavior or crashes.
To prevent this, a lock should be used to protect access to _buffer.
First, please add a lock to the NixlEPAll2AllManager class:
class NixlEPAll2AllManager(All2AllManagerBase):
...
_lock = threading.Lock()
...Then, wrap the get_handle method's logic with this lock as suggested below.
def get_handle(self, kwargs):
with NixlEPAll2AllManager._lock:
if (
NixlEPAll2AllManager._buffer is not None
and NixlEPAll2AllManager._buffer[1] == self.cpu_group.size()
):
return NixlEPAll2AllManager._buffer[0]
num_experts_per_rank = kwargs["num_global_experts"] // kwargs["num_ep_ranks"]
nixl_kwargs = dict(
max_num_tokens_per_dp_rank=kwargs["max_num_tokens_per_dp_rank"],
token_hidden_size=kwargs["token_hidden_size"],
num_experts_per_rank=num_experts_per_rank,
)
if NixlEPAll2AllManager._buffer is None:
self._init_buffer(**nixl_kwargs)
else:
self._update_buffer()
assert NixlEPAll2AllManager._buffer is not None
handle = NixlEPAll2AllManager._buffer[0]
return handleThere was a problem hiding this comment.
I don't think this can actually happen, since get_handle() only appears to be called from a single thread during initial setup or elastic EP reconfiguration. In any case, this isn't on the data path, the cost of adding a lock here is negligible, so I added it to be safe.
5647180 to
2def741
Compare
9242642 to
020600d
Compare
|
Hi @itayalroy, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
6abe5cf to
29070b4
Compare
tlrmchlsmth
left a comment
There was a problem hiding this comment.
One quesiont: Does NIXL-EP use NVLINK at all for intranode traffic? Is it suitable for MNNVL systems? And are there any de-duplication optimizations?
| def get_handle(self, kwargs): | ||
| if ( | ||
| NixlEPAll2AllManager._buffer is not None | ||
| and NixlEPAll2AllManager._buffer[1] == self.cpu_group.size() | ||
| ): | ||
| return NixlEPAll2AllManager._buffer[0] | ||
|
|
||
| num_experts_per_rank = kwargs["num_global_experts"] // kwargs["num_ep_ranks"] | ||
| nixl_kwargs = dict( | ||
| max_num_tokens_per_dp_rank=kwargs["max_num_tokens_per_dp_rank"], | ||
| token_hidden_size=kwargs["token_hidden_size"], | ||
| num_experts_per_rank=num_experts_per_rank, | ||
| ) | ||
| if NixlEPAll2AllManager._buffer is None: | ||
| self._init_buffer(**nixl_kwargs) | ||
| else: | ||
| self._update_buffer() | ||
|
|
||
| assert NixlEPAll2AllManager._buffer is not None | ||
| handle = NixlEPAll2AllManager._buffer[0] | ||
| return handle |
vllm/model_executor/layers/fused_moe/nixl_ep_prepare_finalize.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
this file is extremely similar to the DeepEP LL prepare_finalize implementation. Should we consolidate these?
@bnellnm WDYT?
There was a problem hiding this comment.
It might be good to factor out some of the common utilities, e.g. dequant_fp8, maybe_roundup_layer_hidden_size, (maybe _do_quant?) but I think it might be good to keep the main implementations separate in case one or the other of the backends changes their API.
There was a problem hiding this comment.
We expect nixl_ep_prepare_finalize.py and deepep_ll_prepare_finalize.py to diverge pretty quickly as NIXL-EP progresses, and possibly on the DeepEP side too, so preferred to keep them separate
29070b4 to
4429898
Compare
NVLink is used for intranode traffic, support for MNNVL is in review, and currently no de-duplication optimizations (although this might change, we are still evaluating the tradeoffs of that approach) |
52728b1 to
80253cc
Compare
260d1a1 to
08facd0
Compare
Signed-off-by: Yongji Wu <wuyongji317@gmail.com> rebase fix Signed-off-by: Yongji Wu <wuyongji317@gmail.com> rebase fix Signed-off-by: Yongji Wu <wuyongji317@gmail.com> rebase fix Signed-off-by: Yongji Wu <wuyongji317@gmail.com> rebase fix Signed-off-by: Yongji Wu <wuyongji317@gmail.com> Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Scale-up after scale-down would hang indefinitely in the ZMQ poll loop waiting for engine identity messages. Without ROUTER_HANDOVER enabled on the ZMQ ROUTER socket, engines reconnecting with previously-used identities had their messages silently dropped, because the ROUTER still held stale routing entries from the dead connections. Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
08facd0 to
8df1eb6
Compare
There was a problem hiding this comment.
Last thing missing is better integration testing, which is tricky given the heavy dependency. From @itayalroy we should be able to pip install nixl in either the next NIXL release or the one after, which will make this easier to manage in the test image
Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Yongji Wu <wuyongji317@gmail.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com> Signed-off-by: whycoming <120623296@qq.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Yongji Wu <wuyongji317@gmail.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Yongji Wu <wuyongji317@gmail.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Yongji Wu <wuyongji317@gmail.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Yongji Wu <wuyongji317@gmail.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Yongji Wu <wuyongji317@gmail.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Yongji Wu <wuyongji317@gmail.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Yongji Wu <wuyongji317@gmail.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
This PR is a rebase of #29630 originally authored by @libertyeagle that integrates NIXL-EP kernels into vLLM.
NIXL-EP is an implementation of expert-parallel communication kernels over NIXL's device API. It provides elastic scaling capabilities, enabling dynamic addition and removal of processes (ranks) during runtime, without the need to destroy and recreate communicators during scaling up/down.
This PR also includes a few small fixes to vLLM Elastic EP (#34861) that we found while thoroughly testing vLLM with NIXL-EP.