[NVIDIA] Fix Llama4 Scout FP4 functionality issues#21499
[NVIDIA] Fix Llama4 Scout FP4 functionality issues#21499vllm-bot merged 1 commit intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Code Review
This pull request addresses weight loading and accuracy issues in the NVIDIA ModelOpt Llama4 Scout FP4 model. The changes include updates to the FlashInfer attention backend, a workaround in the CUTLASS MoE kernel, and corrections to weight/scale loading logic for quantized Llama4 models. A potential inconsistency in MoE scale loading in vllm/model_executor/models/llama4.py has been identified and flagged as high severity.
97acab5 to
78aa123
Compare
765aaff to
afaf28d
Compare
|
This PR is ready for review. Thanks @jingyu-ml for helping |
|
The fastcheck failure doesn't seem to be caused by my change? |
f75578d to
91ec86d
Compare
mgoin
left a comment
There was a problem hiding this comment.
LGTM. It would be nicer if we had an attribute registered to the parameter to know if fp4. Currently the uint8 logic could affect future formats
|
@nvpohanh please merge with main and fix the pre-commit errors to resolve the test failures |
Agreed. @jingyu-ml for vis |
The precommit failure doesn't seem to be caused by my change... let me try again |
91ec86d to
6ecb2bc
Compare
|
The buildkite/ci/pr/distributed-tests-2-gpus failures do not seem to be caused by my change... |
6ecb2bc to
118cc65
Compare
|
Okay, I see that the test failures are indeed caused by my change: I will debug this. |
|
I found that my previous accuracy check was FP8... this time is FP4 for real: |
23bd139 to
c2f113a
Compare
There was a problem hiding this comment.
I found this breaks Llama4 NVFP4 with compressed tensors
lm_eval --model vllm --model_args pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4,max_model_len=10000,enforce_eager=True,tensor_parallel_size=2 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
File "/home/mgoin/code/vllm/vllm/model_executor/models/llama4.py", line 475, in load_weights
moe_loaded = self.load_moe_expert_weights(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgoin/code/vllm/vllm/model_executor/models/llama4.py", line 402, in load_moe_expert_weights
weight_loader(param,
File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1202, in weight_loader
self._load_model_weight_or_group_weight_scale(
File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 904, in _load_model_weight_or_group_weight_scale
self._load_w2(shard_dim=shard_dim,
File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 971, in _load_w2
expert_data.copy_(loaded_weight)
RuntimeError: The size of tensor a (5120) must match the size of tensor b (4096) at non-singleton dimension 0
On main I'm able to run the eval correctly
vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4,max_model_len=10000,enforce_eager=True,tensor_parallel_size=2,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9090|± |0.0079|
| | |strict-match | 5|exact_match|↑ |0.8992|± |0.0083|
|
@mgoin I will debug this today |
a46f5df to
bb1e7e0
Compare
|
Pushed a new fix and added a bunch of comments to explain what's going on. Accuracy tests: ModelOpt Scout FP8: ModelOpt Scout FP4: RedHat Scout NVFP4: I also verified with reduced maverick (used in pipeline) and it worked. I only ran TP1 and didn't have the chance to run TP2. However, I think my latest change is not related to sharding logic, so should be okay. |
Fix the weight loading issues and accuray issues when using the NVIDIA ModelOpt Llama4 Scout FP4 model. Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
bb1e7e0 to
edfd4f9
Compare
Need further debugging... |
|
I see the same tests also failed in #21921 so they are probably not caused by my change... |
|
I saw errors like this in pipeline logs: But is that caused by my change? |
mgoin
left a comment
There was a problem hiding this comment.
Looks in a good state to me now, thanks for the hard work.
Validated existing FP8, INT4, and FP4 models are unaffected
vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic,max_model_len=10000,tensor_parallel_size=2,enforce_eager=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9037|± |0.0081|
| | |strict-match | 5|exact_match|↑ |0.8901|± |0.0086|
vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16,max_model_len=10000,tensor_parallel_size=2,enforce_eager=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9151|± |0.0077|
| | |strict-match | 5|exact_match|↑ |0.8961|± |0.0084|
vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4,max_model_len=10000,enforce_eager=True,tensor_parallel_size=2,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9075|± |0.0080|
| | |strict-match | 5|exact_match|↑ |0.8992|± |0.0083|
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: x22x22 <wadeking@qq.com>
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Noam Gat <noamgat@gmail.com>
Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
Fix the weight loading issues and accuray issues when using the NVIDIA ModelOpt Llama4 Scout FP4 model.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Fix the weight loading issues and accuray issues when using the NVIDIA ModelOpt Llama4 Scout FP4 model.
Test Plan
Run Scout FP4/FP8 accuracy tests on TP2.
Test Result
Scout FP4 TP2:
Scout FP8 TP2:
(Optional) Documentation Update