[DOC]Add Memcache Usage Guide#6476
Conversation
Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Summary of ChangesHello @DreamerLeader, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new, detailed documentation section that guides users through the process of integrating and utilizing Memcache as a KV Pool backend. It provides essential information on installation, configuration, environment setup, and practical examples for various deployment scenarios, ensuring users can effectively leverage Memcache for improved KV caching. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request adds a comprehensive guide on using Memcache as a KV Pool backend in the documentation. The guide covers installation, configuration, and provides detailed examples for both disaggregated and colocation scenarios. While this is a great addition, I've found a couple of critical issues in the example scripts where the JSON configurations are invalid. These syntax errors will cause the example commands to fail and need to be fixed.
| "tp_size": 4 | ||
| } | ||
| } | ||
| } , |
| "tp_size": 8 | ||
| } | ||
| } | ||
| }, |
Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
| --gpu-memory-utilization 0.9 \ | ||
| --max-num_seqs 20 \ | ||
| --no-enable-prefix-caching \ | ||
| --additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false, "chunked_prefill_for_mla":true}' \ |
There was a problem hiding this comment.
"chunked_prefill_for_mla":true is deprecated, we can remove this.
| "kv_role": "kv_consumer", | ||
| "kv_connector_extra_config":{ | ||
| "backend": "memcache", | ||
| "mooncake_rpc_port":"1" |
There was a problem hiding this comment.
try use "lookup_rpc_port" since "mooncake_rpc_port" will be removed in the future
| --max-num_seqs 28 \ | ||
| --max-model-len 16384 \ | ||
| --max-num-batched-tokens 16384 \ | ||
| --additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false, "torchair_graph_config":{"enabled":false},"chunked_prefill_for_mla":true}' \ |
There was a problem hiding this comment.
Consider removing "torchair_graph_config":{"enabled":false}, and "chunked_prefill_for_mla":true since both are deprecated.
| --served-model-name dsv3 \ | ||
| --trust-remote-code \ | ||
| --enforce-eager \ | ||
| -dp 2 \ |
There was a problem hiding this comment.
consider using --data-parallel-size and --tensor-parallel-size for better clarity
| --model=xxxxxxxxxxxxxxxx/DeepSeek \ | ||
| --served-model-name dsv3 \ | ||
| --trust-remote-code \ | ||
| -dp 2 \ |
There was a problem hiding this comment.
same here, consider changing to --data-parallel-size and --tensor-parallel-size
| --gpu-memory-utilization 0.9 \ | ||
| --max-num_seqs 20 \ | ||
| --no-enable-prefix-caching \ | ||
| --additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false, "chunked_prefill_for_mla":true}' \ |
There was a problem hiding this comment.
"chunked_prefill_for_mla":true is no longer needed
| "kv_connector": "MultiConnector", | ||
| "kv_role": "kv_consumer", | ||
| "kv_connector_extra_config": { | ||
| "use_layerwise": false, |
There was a problem hiding this comment.
move "use_layerwise": false into Ascend store connector config
| "kv_role": "kv_producer", | ||
| "kv_connector_extra_config":{ | ||
| "backend": "memcache", | ||
| "mooncake_rpc_port":"0" |
There was a problem hiding this comment.
same here, "lookup_rpc_port"
| --max-num_seqs 28 \ | ||
| --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \ | ||
| --enable_expert_parallel \ | ||
| --additional_config='{"ascend_scheduler_config":{"enabled":false}, "enable_shared_expert_dp":false, "torchair_graph_config":{"enabled": false},"chunked_prefill_for_mla":true}' \ |
There was a problem hiding this comment.
remove "torchair_graph_config":{"enabled": false},"chunked_prefill_for_mla":true}'
| "kv_connector": "MultiConnector", | ||
| "kv_role": "kv_consumer", | ||
| "kv_connector_extra_config": { | ||
| "use_layerwise": false, |
There was a problem hiding this comment.
move "use_layerwise" in kv pool config
Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
|
LGTM |
…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: [Feat] 310p support MoE W8A8 quantizaition (vllm-project#6641) [TEST]add a qwen3-30b acc case with mooncake mempool (vllm-project#6244) [MOE Refactor] Remove QuantType in prepare_finalize.py (vllm-project#6534) [EPLB] Avoiding eplb's dependency on a specified model (vllm-project#6528) [Doc][Misc] Restructure tutorial documentation (vllm-project#6501) implement batch invariant with ascendc (vllm-project#6590) [Refact]Refact MLA/SFA weight prefetch to consist with moe weight prefetch (vllm-project#6629) [Misc] upgrade to vllm main (vllm-project#6646) [main][Docs] Fix spelling errors across documentation (vllm-project#6649) [bugfix]Fix no attribute 'data' when MLAPO is enable (vllm-project#6601) [DOC]Add Memcache Usage Guide (vllm-project#6476) [main][bugfix] Fix spec acceptance rate problem in vllm_0.15.0 (vllm-project#6606) [Test][LoRA] Add e2e test for base model inference (vllm-project#6624) [refactor]Optimized the kvcache usage of Deepseek v3.2 (vllm-project#6610) [Feat](sfa,dcp) support dcp for sfa (vllm-project#6563) [BugFix] Add support for rotary_dim parameter when using partial rope in rotary_embedding (vllm-project#6581) [fix bug] fix tensor mismatch bug in sigmoid operate test case (vllm-project#6619) [Kernel]: Optimize DispatchFFNCombine performance (vllm-project#6468) [MISC] Clean up useless env USE_OPTIMIZED_MODEL (vllm-project#6618)
### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>
### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>
### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?