[Feature] Add support of new W4A4_LAOS_DYNAMIC quantization method#5143
[Feature] Add support of new W4A4_LAOS_DYNAMIC quantization method#5143wangxiyuan merged 1 commit intovllm-project:mainfrom
Conversation
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
5915e23 to
63de002
Compare
|
Any progress? If this PR is still alive, please rebase to main and make CI happy, otherwise you can close it. Thanks |
Related colleagues are in holiday right now. Next week I will push this PR to be merged after they come back. |
5c75162 to
c2c9f39
Compare
whx-sjtu
left a comment
There was a problem hiding this comment.
I remember that there exists an eagle3-related bug that will decrease acceptance rate to 0 for this quant method. Dose related fix codes included in this PR? I don't find it.
c2c9f39 to
8338edf
Compare
0c5a587 to
79aea0d
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
79aea0d to
f13883c
Compare
e570420 to
1d871ac
Compare
Signed-off-by: maxmgrdv <gordeev.maxim@huawei.com> add e2e ci Signed-off-by: hfadzxy <starmoon_zhang@163.com>
1d871ac to
6190cd3
Compare
…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (51 commits) [Bugfix] Remove `use_aclgraph` in mtp_proposer and use `use_cuda_graph` (vllm-project#6032) [BugFix] fix 3vl dense model load quant weight (vllm-project#6100) [CP&SP] Integrate FIA operator in mla_cp._forward_decode (vllm-project#5641) [CI][Doc] Upgrade wheel building's CANN to 8.5.0 and update the Docs (vllm-project#6145) [CI]Install clang in dokerfile for triton ascend (vllm-project#4409) [Main] Upgrade PTA to 2.9.0 (vllm-project#6112) [Graph][Fusion] Add QKVNormRope and QKVNormRopeWithBias (vllm-project#5721) [P/D][PCP]bugfix pcp force free twice caused logger error (vllm-project#6124) [BugFix]converting pa get_workspace back to capturing (vllm-project#5833) [CI] optimize lint term (vllm-project#5986) [Bugfix] Fix Triton operator usage for multimodal models based on `the mrope_interleaved` parameter (vllm-project#6042) [bugfix][npugraph_ex]fix the model output type issue caused by manually modify FX graph (vllm-project#6015) [BugFix] Support setting tp=1 for the Eagle draft model to take effect (vllm-project#6097) [Misc] Bump mooncake version to v0.3.8.post1 (vllm-project#6110) [Feature]Enable DispatchGmmCombineDecode when eagle is moe with w8a8 or not moe [RFC: issue 5476] (vllm-project#5758) [bugfix] adapt_remote_request_id (vllm-project#6051) [Feature] Add support of new W4A4_LAOS_DYNAMIC quantization method (vllm-project#5143) [Feature] Support DSA-CP for Hybrid scenario (vllm-project#5702) [CI] Upgrade CANN to 8.5.0 (vllm-project#6070) Default enable MLAPO (vllm-project#5952) ...
…llm-project#5143) Introduce W4A4 LAOS Quantization for better model compression and inference efficiency on Ascend devices. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: hfadzxy <starmoon_zhang@163.com>
…llm-project#5143) Introduce W4A4 LAOS Quantization for better model compression and inference efficiency on Ascend devices. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…llm-project#5143) Introduce W4A4 LAOS Quantization for better model compression and inference efficiency on Ascend devices. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: hfadzxy <starmoon_zhang@163.com>
…llm-project#5143) Introduce W4A4 LAOS Quantization for better model compression and inference efficiency on Ascend devices. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…llm-project#5143) Introduce W4A4 LAOS Quantization for better model compression and inference efficiency on Ascend devices. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Introduce W4A4 LAOS Quantization for better model compression and inference efficiency on Ascend devices.