[Bugfix] rename enable_flash_comm_v1 back to enable_sp#6883
[Bugfix] rename enable_flash_comm_v1 back to enable_sp#6883zzzzwwjj merged 1 commit intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on improving code consistency by renaming a key helper function related to sequence parallelism. The change standardizes the function's name across the codebase to match its configuration counterpart, without altering its functional behavior. This refactoring simplifies future maintenance and enhances readability by using a more descriptive and consistent naming scheme. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
48e3efb to
dd2d9f9
Compare
There was a problem hiding this comment.
Code Review
This pull request refactors the sequence parallelism helper function by renaming enable_flash_comm_v1 to enable_sp and updating all its call sites. The behavior is intended to be unchanged. My review confirms that the logic is preserved. However, the refactoring has left an unused vllm_config parameter in the new enable_sp function, which I've commented on. Addressing this would improve code clarity.
| _ENABLE_SP = ( | ||
| envs_ascend.VLLM_ASCEND_ENABLE_FLASHCOMM1 | ||
| # Flash comm 1 should be enabled by env VLLM_ASCEND_ENABLE_FLASHCOMM1 | ||
| # We retain the env VLLM_ASCEND_ENABLE_FLASHCOMM here for backward compatibility. | ||
| or bool(int(os.getenv("VLLM_ASCEND_ENABLE_FLASHCOMM", "0"))) | ||
| ) |
There was a problem hiding this comment.
After this refactoring, the vllm_config parameter in the enable_sp function signature is no longer used within the function body. This makes the code less clear. Consider removing the vllm_config parameter and the associated logic that handles it when it's None. This would require updating all call sites to no longer pass this argument.
Restore the SP helper naming to enable_sp across runtime call sites to keep naming consistent with pass_config.enable_sp and existing SP semantics. Made-with: Cursor Signed-off-by: realliujiaxu <realliujiaxu@163.com>
dd2d9f9 to
11a3627
Compare
…6883) ### What this PR does / why we need it? PR vllm-project#5632 introduced a bug by replacing some branches gated by enable_sp with enable_flash_comm_v1. As a result, when enable_shared_expert_dp is enabled alone (i.e., VLLM_ASCEND_ENABLE_FLASHCOMM1=0 and VLLM_ASCEND_ENABLE_FLASHCOMM=0), the behavior becomes inconsistent with the previous logic and leads to accuracy issues. This PR restores the original enable_sp-based branching to recover expected behavior and accuracy. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? #### 1. start server ``` bash vllm serve /home/weights/DeepSeek-V2-Lite-W8A8/ \ --port 8001 \ --served-model-name auto \ --max-model-len 1024 \ --enforce-eager \ --tensor-parallel-size 2 \ --data-parallel-size 2 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --additional-config '{"enable_shared_expert_dp": true}' ``` #### 2. curl ```bash curl -s http://localhost:8001/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [ {"role": "user", "content": "Hello. I have a question. Who are you?"} ], "max_tokens": 10, "temperature": 0.0, "ignore_eos_token": true }' ``` - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@15d76f7 Signed-off-by: realliujiaxu <realliujiaxu@163.com>
…6883) ### What this PR does / why we need it? PR vllm-project#5632 introduced a bug by replacing some branches gated by enable_sp with enable_flash_comm_v1. As a result, when enable_shared_expert_dp is enabled alone (i.e., VLLM_ASCEND_ENABLE_FLASHCOMM1=0 and VLLM_ASCEND_ENABLE_FLASHCOMM=0), the behavior becomes inconsistent with the previous logic and leads to accuracy issues. This PR restores the original enable_sp-based branching to recover expected behavior and accuracy. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? #### 1. start server ``` bash vllm serve /home/weights/DeepSeek-V2-Lite-W8A8/ \ --port 8001 \ --served-model-name auto \ --max-model-len 1024 \ --enforce-eager \ --tensor-parallel-size 2 \ --data-parallel-size 2 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --additional-config '{"enable_shared_expert_dp": true}' ``` #### 2. curl ```bash curl -s http://localhost:8001/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [ {"role": "user", "content": "Hello. I have a question. Who are you?"} ], "max_tokens": 10, "temperature": 0.0, "ignore_eos_token": true }' ``` - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@15d76f7 Signed-off-by: realliujiaxu <realliujiaxu@163.com>
What this PR does / why we need it?
PR #5632 introduced a bug by replacing some branches gated by enable_sp with enable_flash_comm_v1. As a result, when enable_shared_expert_dp is enabled alone (i.e., VLLM_ASCEND_ENABLE_FLASHCOMM1=0 and VLLM_ASCEND_ENABLE_FLASHCOMM=0), the behavior becomes inconsistent with the previous logic and leads to accuracy issues. This PR restores the original enable_sp-based branching to recover expected behavior and accuracy.
Does this PR introduce any user-facing change?
No
How was this patch tested?
1. start server
vllm serve /home/weights/DeepSeek-V2-Lite-W8A8/ \ --port 8001 \ --served-model-name auto \ --max-model-len 1024 \ --enforce-eager \ --tensor-parallel-size 2 \ --data-parallel-size 2 \ --gpu-memory-utilization 0.9 \ --enable-expert-parallel \ --additional-config '{"enable_shared_expert_dp": true}'2. curl