[bugfix][async scheduling] fix extra cuda context in device 0 with EP/DP#37449
[bugfix][async scheduling] fix extra cuda context in device 0 with EP/DP#37449youkaichao merged 1 commit intovllm-project:mainfrom
Conversation
Signed-off-by: youkaichao <youkaichao@gmail.com>
| ) | ||
| self.worker.load_model() | ||
|
|
||
| scheduler_config = vllm_config.scheduler_config |
There was a problem hiding this comment.
move this part of code here, to be after self.worker.init_device(), so that self.worker.device is initialized properly.
There was a problem hiding this comment.
Code Review
This pull request addresses a critical issue where asynchronous scheduling threads could implicitly create an extra CUDA context on device 0, leading to unnecessary memory consumption, especially in Expert Parallel (EP) and Data Parallel (DP) setups. The changes correctly relocate the asynchronous output copy thread initialization to ensure the worker is fully loaded before the thread starts. Additionally, the async_output_busy_loop method now explicitly sets the CUDA device for the thread to match the worker's assigned device, preventing the creation of unintended contexts. This is a well-targeted fix that directly resolves the described bug and improves resource management.
njhill
left a comment
There was a problem hiding this comment.
Nice find, thanks @youkaichao!
…/DP (vllm-project#37449) Signed-off-by: youkaichao <youkaichao@gmail.com>
…/DP (vllm-project#37449) Signed-off-by: youkaichao <youkaichao@gmail.com>
…/DP (vllm-project#37449) Signed-off-by: youkaichao <youkaichao@gmail.com>
…/DP (vllm-project#37449) Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…/DP (vllm-project#37449) Signed-off-by: youkaichao <youkaichao@gmail.com>
…/DP (vllm-project#37449) Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
Purpose
See https://forums.developer.nvidia.com/t/when-a-thread-has-a-primary-cuda-context-does-the-child-thread-it-creates-automatically-inherit-the-cuda-context/362810 , a new thread does not have any cuda context, and later cuda runtime call might create a context in device 0.
Test Plan
Run vLLM serve with EP/DP:
vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 -dp 2 -ep --port 8899test with multiple requests:
Test Result
Before the fix, after the benchmark script, worker 1 will take around 800 MiB memory on device 0.
After the fix, each worker only resides on one GPU.