[Doc]Add the user_guide doc file regarding fine-grained TP.#5084
[Doc]Add the user_guide doc file regarding fine-grained TP.#5084jianzs merged 8 commits intovllm-project:mainfrom
Conversation
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
There was a problem hiding this comment.
Code Review
This pull request adds user documentation for the Fine-Grained Tensor Parallelism feature. The documentation is comprehensive, covering the overview, benefits, usage, and deployment recommendations. However, I've identified a few critical issues. The documentation for mlp_tensor_parallel_size is misleading due to an underlying bug where its value is incorrectly tied to embedding_tensor_parallel_size. Additionally, the documentation for o_proj is missing a key restriction regarding its use in prefill-decode disaggregation scenarios. Finally, there is a syntax error in one of the configuration examples. These issues should be addressed to ensure the documentation is accurate and prevents user errors.
|
|
||
| ### Component & Execution Mode Support | ||
| - **`embedding`, `lm_head`, and `mlp`**: Can be configured with fine-grained TP in any execution context—prefill, decode, or mixed deployment. | ||
| - **`o_proj`**: Currently, fine-grained TP for the attention output projection is **only supported in graph-capture mode** (e.g., CUDA Graphs). It cannot be enabled in eager execution. |
There was a problem hiding this comment.
The documentation for o_proj fine-grained TP is incomplete. The implementation in vllm_ascend/ascend_config.py (lines 195-198) reveals an additional critical restriction: oproj_tensor_parallel_size is only supported in prefill-decode (PD) disaggregation scenarios and can only be used on the decode (consumer) nodes. This should be explicitly mentioned to prevent users from encountering runtime errors in unsupported configurations.
I suggest updating the line to be more explicit about all restrictions:
- **`o_proj`**: Currently, fine-grained TP for the attention output projection has two key restrictions:
- It is **only supported in graph-capture mode** (e.g., CUDA Graphs) and cannot be enabled in eager execution.
- It is **only supported in prefill-decode (PD) disaggregation scenarios and must be used on a decode (consumer) node**.|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
ff932a5 to
d700184
Compare
Signed-off-by: 秋刀鱼 <jaychou1620@Gmail.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
…to eplb_refactor * 'main' of https://github.com/vllm-project/vllm-ascend: (52 commits) [Doc]Add the user_guide doc file regarding fine-grained TP. (vllm-project#5084) [pref] qwen3_next add triton ops : fused_sigmoid_gating_delta_rule_update (vllm-project#4818) [Feature] Add token mask for DispatchGmmCombineDecode operator (vllm-project#5171) [CI] Improve CI (vllm-project#5078) [Refactor] remove some metadata variables in attention_v1. (vllm-project#5160) Add Qwen3-VL-235B-A22B-Instruct tutorials (vllm-project#5167) [Doc] Add a perf tune section (vllm-project#5127) [Image] Refactor image build (vllm-project#5175) [refactor] refactor weight trans nz and transpose (vllm-project#4878) [BugFix]Fix precision issue for LoRA feature (vllm-project#4141) 【Doc】Deepseekv3.1/R1 doc enhancement (vllm-project#4827) support basic long_seq feature st (vllm-project#5140) [Bugfix] install trition for test_custom_op (vllm-project#5112) [2/N][Pangu][MoE] Remove Pangu Related Code (vllm-project#5130) [bugfix] Use FUSED_MC2 MoE comm path for the op `dispatch_ffn_combine` (vllm-project#5156) [BugFix] Fix top_p,top_k issue with EAGLE and add top_p,top_k in EAGLE e2e (vllm-project#5131) [Doc][P/D] Fix MooncakeConnector's name (vllm-project#5172) [Bugfix] Fix in_profile_run in mtp_proposer dummy_run (vllm-project#5165) [Doc] Refact benchmark doc (vllm-project#5173) [Nightly] Avoid max_model_len being smaller than the decoder prompt to prevent single-node-accuray-tests from failing (vllm-project#5174) ... Signed-off-by: 白永斌 <baiyongbin3@h-partners.com>
…ject#5084) ### What this PR does / why we need it? Add user guide for **Fine-Grained Tensor Parallelism** feature. Documents usage, supported components (`embedding`, `lm_head`, `o_proj`, `mlp`/`dense_ffn`), model compatibility, and deployment guidelines. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Signed-off-by: 秋刀鱼 <jaychou1620@Gmail.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
…ject#5084) ### What this PR does / why we need it? Add user guide for **Fine-Grained Tensor Parallelism** feature. Documents usage, supported components (`embedding`, `lm_head`, `o_proj`, `mlp`/`dense_ffn`), model compatibility, and deployment guidelines. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Signed-off-by: 秋刀鱼 <jaychou1620@Gmail.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…ject#5084) ### What this PR does / why we need it? Add user guide for **Fine-Grained Tensor Parallelism** feature. Documents usage, supported components (`embedding`, `lm_head`, `o_proj`, `mlp`/`dense_ffn`), model compatibility, and deployment guidelines. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Signed-off-by: 秋刀鱼 <jaychou1620@Gmail.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
What this PR does / why we need it?
Add user guide for Fine-Grained Tensor Parallelism feature.
Documents usage, supported components (
embedding,lm_head,o_proj,mlp/dense_ffn), model compatibility, and deployment guidelines.Functionality implemented in:
add mlp tp optimze #2120 (MLP TP)
[feat]: oproj tensor parallelism in pure DP and graph-mode scenarios. #2167 (OProj TP)
[Feat]: Add custom lmhead tensor model parallel #2309 (LM Head TP)
[Feat] Add custom Embedding tensor model parallel #2616 (Embedding TP)
[Feat] Support MLP_TP feature, exclude MOE layer #4999 (Dense FFN TP)
vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e