[DP] Fix dp padding logic in dummyrun#4705
Conversation
There was a problem hiding this comment.
Code Review
This pull request correctly applies padding logic to dummy_run to align with changes in CudagraphDispatcher. The use of num_tokens_padded for tensor slicing and subsequent function calls is consistent. However, I've found a few issues: a critical bug where num_reqs_padded is used instead of num_tokens_padded when updating num_tokens_across_dp, a potential issue with MoE communication method selection using unpadded token counts, and a leftover debug print statement. Addressing these will ensure correctness and clean up the code.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
This bugfix is really imperative for DP scenarios. Can you fix ut and merge this ASAP? @MengqingCao
Signed-off-by: MengqingCao <cmq0113@163.com>
|
@GDzhu01 I've tested this pr on the deepseek-r1-w8a8 model with the latest code, could you help test again? Thx! |
Yes, I think an approve from @GDzhu01 is needed after his test |
|
After offline discussion with @GDzhu01 , I confirmed that there is no problem with this pr, let's merge this! |
### What this PR does / why we need it? Fix dp padding logic in dummyrun. After vllm-project/vllm#28579, `num_tokens` will be padded in `CudagraphDispatcher`, thus we also need to do the pad in the dummy_run. ### How was this patch tested? Test locally with the following scripts ```bash VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server \ --model wemaster/deepseek_mtp_main_random_bf16 \ --trust-remote-code \ --data-parallel-size 4 \ --tensor-parallel-size 1 \ --compilation-config '{"cudagraph_capture_sizes":[96],"cudagraph_mode":"FULL_DECODE_ONLY"}' \ --enable-expert-parallel ``` ```bash vllm bench serve --model wemaster/deepseek_mtp_main_random_bf16 --endpoint /v1/completions --dataset-name random --random-input 512 --random-output 100 --num-prompts 48 --request-rate 1 --ready-check-timeout-sec 0 ``` - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
### What this PR does / why we need it? Fix dp padding logic in dummyrun. After vllm-project/vllm#28579, `num_tokens` will be padded in `CudagraphDispatcher`, thus we also need to do the pad in the dummy_run. ### How was this patch tested? Test locally with the following scripts ```bash VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server \ --model wemaster/deepseek_mtp_main_random_bf16 \ --trust-remote-code \ --data-parallel-size 4 \ --tensor-parallel-size 1 \ --compilation-config '{"cudagraph_capture_sizes":[96],"cudagraph_mode":"FULL_DECODE_ONLY"}' \ --enable-expert-parallel ``` ```bash vllm bench serve --model wemaster/deepseek_mtp_main_random_bf16 --endpoint /v1/completions --dataset-name random --random-input 512 --random-output 100 --num-prompts 48 --request-rate 1 --ready-check-timeout-sec 0 ``` - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it? Fix dp padding logic in dummyrun. After vllm-project/vllm#28579, `num_tokens` will be padded in `CudagraphDispatcher`, thus we also need to do the pad in the dummy_run. ### How was this patch tested? Test locally with the following scripts ```bash VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server \ --model wemaster/deepseek_mtp_main_random_bf16 \ --trust-remote-code \ --data-parallel-size 4 \ --tensor-parallel-size 1 \ --compilation-config '{"cudagraph_capture_sizes":[96],"cudagraph_mode":"FULL_DECODE_ONLY"}' \ --enable-expert-parallel ``` ```bash vllm bench serve --model wemaster/deepseek_mtp_main_random_bf16 --endpoint /v1/completions --dataset-name random --random-input 512 --random-output 100 --num-prompts 48 --request-rate 1 --ready-check-timeout-sec 0 ``` - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: MengqingCao <cmq0113@163.com>
What this PR does / why we need it?
Fix dp padding logic in dummyrun. After vllm-project/vllm#28579,
num_tokenswill be padded inCudagraphDispatcher, thus we also need to do the pad in the dummy_run.How was this patch tested?
Test locally with the following scripts
VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server \ --model wemaster/deepseek_mtp_main_random_bf16 \ --trust-remote-code \ --data-parallel-size 4 \ --tensor-parallel-size 1 \ --compilation-config '{"cudagraph_capture_sizes":[96],"cudagraph_mode":"FULL_DECODE_ONLY"}' \ --enable-expert-parallel