[NVIDIA] Fix Llama4 Scout FP4 functionality issues by nvpohanh · Pull Request #21499 · vllm-project/vllm

nvpohanh · 2025-07-24T05:22:27Z

Fix the weight loading issues and accuray issues when using the NVIDIA ModelOpt Llama4 Scout FP4 model.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Fix the weight loading issues and accuray issues when using the NVIDIA ModelOpt Llama4 Scout FP4 model.

Test Plan

Run Scout FP4/FP8 accuracy tests on TP2.

Test Result

Scout FP4 TP2:

vllm (pretrained=nvidia/Llama-4-Scout-17B-16E-Instruct-FP4,quantization=modelopt_fp4,tensor_parallel_size=2,max_model_len=2048,enforce_eager=True,kv_cache_dtype=auto,trust_
remote_code=True), gen_kwargs: (temperature=0.0), limit: 500.0, num_fewshot: 5, batch_size: 200                                                                                                               
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|                                                                                                                                      
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|                                                                                                                                      
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.912|±  |0.0127|                                                                                                                                      
|     |       |strict-match    |     5|exact_match|↑  |0.900|±  |0.0134|

Scout FP8 TP2:

vllm (pretrained=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8,quantization=modelopt,tensor_parallel_size=2,max_model_len=2048,enforce_eager=True,kv_cache_dtype=auto,trust_remote_code=True), gen_kwargs: (temperature=0.0), limit: 500.0, num_fewshot: 5, batch_size: 200
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.928|±  |0.0116|
|     |       |strict-match    |     5|exact_match|↑  |0.894|±  |0.0138|

(Optional) Documentation Update

github-actions · 2025-07-24T05:22:35Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vllm/model_executor/layers/fused_moe/cutlass_moe.py

vllm/attention/backends/flashinfer.py

vllm/v1/attention/backends/flashinfer.py

gemini-code-assist

Code Review

This pull request addresses weight loading and accuracy issues in the NVIDIA ModelOpt Llama4 Scout FP4 model. The changes include updates to the FlashInfer attention backend, a workaround in the CUTLASS MoE kernel, and corrections to weight/scale loading logic for quantized Llama4 models. A potential inconsistency in MoE scale loading in vllm/model_executor/models/llama4.py has been identified and flagged as high severity.

vllm/model_executor/models/llama4.py

nvpohanh · 2025-07-24T05:24:54Z

DO NOT MERGE yet since this depends on #21485 and #21465

vllm/engine/arg_utils.py

nvpohanh · 2025-07-25T08:21:46Z

This PR is ready for review. Thanks @jingyu-ml for helping

nvpohanh · 2025-07-25T09:00:59Z

The fastcheck failure doesn't seem to be caused by my change?

https://buildkite.com/vllm/fastcheck/builds/32228/steps/canvas?jid=019840ab-8eea-4798-aaea-ad2e2c6773ea

mgoin

LGTM. It would be nicer if we had an attribute registered to the parameter to know if fp4. Currently the uint8 logic could affect future formats

mgoin · 2025-07-26T01:53:43Z

@nvpohanh please merge with main and fix the pre-commit errors to resolve the test failures

nvpohanh · 2025-07-28T01:45:13Z

LGTM. It would be nicer if we had an attribute registered to the parameter to know if fp4. Currently the uint8 logic could affect future formats

Agreed. @jingyu-ml for vis

nvpohanh · 2025-07-28T01:46:03Z

Run pre-commit run --show-diff-on-failure --color=always --all-files --hook-stage manual
yapf................................................................................................Failed
- hook id: yapf
- exit code: 1

Traceback (most recent call last):
  File "/home/runner/.cache/pre-commit/repo20fqe0ai/py_env-python3.12/bin/yapf", line 5, in <module>
    from yapf import run_main
  File "/home/runner/.cache/pre-commit/repo20fqe0ai/py_env-python3.12/lib/python3.12/site-packages/yapf/__init__.py", line 40, in <module>
    from yapf.yapflib import yapf_api
  File "/home/runner/.cache/pre-commit/repo20fqe0ai/py_env-python3.12/lib/python3.12/site-packages/yapf/yapflib/yapf_api.py", line 38, in <module>
    from yapf.pyparser import pyparser
  File "/home/runner/.cache/pre-commit/repo20fqe0ai/py_env-python3.12/lib/python3.12/site-packages/yapf/pyparser/pyparser.py", line 44, in <module>
    from yapf.yapflib import format_token
  File "/home/runner/.cache/pre-commit/repo20fqe0ai/py_env-python3.12/lib/python3.12/site-packages/yapf/yapflib/format_token.py", line 23, in <module>
    from yapf.pytree import pytree_utils
  File "/home/runner/.cache/pre-commit/repo20fqe0ai/py_env-python3.12/lib/python3.12/site-packages/yapf/pytree/pytree_utils.py", line 30, in <module>
    from yapf_third_party._ylib2to3 import pygram
  File "/home/runner/.cache/pre-commit/repo20fqe0ai/py_env-python3.12/lib/python3.12/site-packages/yapf_third_party/_ylib2to3/pygram.py", line 39, in <module>
    pattern_grammar = driver.load_grammar(_PATTERN_GRAMMAR_FILE)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/.cache/pre-commit/repo20fqe0ai/py_env-python3.12/lib/python3.12/site-packages/yapf_third_party/_ylib2to3/pgen2/driver.py", line 248, in load_grammar
    g.load(gp)
  File "/home/runner/.cache/pre-commit/repo20fqe0ai/py_env-python3.12/lib/python3.12/site-packages/yapf_third_party/_ylib2to3/pgen2/grammar.py", line [128](https://github.com/vllm-project/vllm/actions/runs/16522417154/job/46727126025?pr=21499#step:6:133), in load
    d = pickle.load(f)
        ^^^^^^^^^^^^^^
EOFError: Ran out of input

The precommit failure doesn't seem to be caused by my change... let me try again

nvpohanh · 2025-07-28T03:27:15Z

The buildkite/ci/pr/distributed-tests-2-gpus failures do not seem to be caused by my change...

nvpohanh · 2025-07-28T07:51:28Z

Okay, I see that the test failures are indeed caused by my change:

[2025-07-28T06:14:57Z] �[31mFAILED�[0m models/multimodal/generation/test_maverick.py::�[1mtest_dummy_maverick[2-True-True-meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8-4-4-2]�[0m - AssertionError: function <function test_dummy_maverick at 0x7f0004fd4d60> failed when called with args () and kwargs {'monkeypatch': <_pytest.monkeypatch.MonkeyPatch object at 0x7f000b1545f0>, 'original_model_name': 'meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', 'text_layers': 4, 'num_experts': 4, 'vision_layers': 2, 'enforce_eager': True, 'tp': 2, 'ep': True}
[2025-07-28T06:14:57Z] �[31mFAILED�[0m models/multimodal/generation/test_maverick.py::�[1mtest_dummy_maverick[2-True-False-meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8-4-4-2]�[0m - AssertionError: function <function test_dummy_maverick at 0x7f0004fd4d60> failed when called with args () and kwargs {'monkeypatch': <_pytest.monkeypatch.MonkeyPatch object at 0x7efffcfa8ef0>, 'original_model_name': 'meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', 'text_layers': 4, 'num_experts': 4, 'vision_layers': 2, 'enforce_eager': False, 'tp': 2, 'ep': True}

I will debug this.

nvpohanh · 2025-07-29T13:41:41Z

I found that my previous accuracy check was FP8... this time is FP4 for real:

vllm (pretrained=nvidia/Llama-4-Scout-17B-16E-Instruct-FP4,quantization=modelopt_fp4,tensor_parallel_size=1,max_model_len=2048,enforce_eager=True,kv_cache_dtype=auto,trust_remote_code=True), gen_kwargs: (temperature=0.0), limit: 500.0, num_fewshot: 5, batch_size: 200
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.910|±  |0.0128|
|     |       |strict-match    |     5|exact_match|↑  |0.896|±  |0.0137|

vllm/model_executor/models/llama4.py

mgoin

I found this breaks Llama4 NVFP4 with compressed tensors

lm_eval --model vllm --model_args pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4,max_model_len=10000,enforce_eager=True,tensor_parallel_size=2 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
  File "/home/mgoin/code/vllm/vllm/model_executor/models/llama4.py", line 475, in load_weights
    moe_loaded = self.load_moe_expert_weights(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/model_executor/models/llama4.py", line 402, in load_moe_expert_weights
    weight_loader(param,
  File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1202, in weight_loader
    self._load_model_weight_or_group_weight_scale(
  File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 904, in _load_model_weight_or_group_weight_scale
    self._load_w2(shard_dim=shard_dim,
  File "/home/mgoin/code/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 971, in _load_w2
    expert_data.copy_(loaded_weight)
RuntimeError: The size of tensor a (5120) must match the size of tensor b (4096) at non-singleton dimension 0

On main I'm able to run the eval correctly

vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4,max_model_len=10000,enforce_eager=True,tensor_parallel_size=2,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9090|±  |0.0079|
|     |       |strict-match    |     5|exact_match|↑  |0.8992|±  |0.0083|

nvpohanh · 2025-07-29T23:57:20Z

@mgoin I will debug this today

nvpohanh · 2025-07-30T08:54:10Z

Pushed a new fix and added a bunch of comments to explain what's going on.

Accuracy tests:

ModelOpt Scout FP8:

lm_eval --model vllm --model_args pretrained=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8,quantization=modelopt,tensor_parallel_size=1,max_model_len=2048,enforce_eager=True,kv_cache_dtype=auto --gen_kwargs temperature=0.0 --limit 500 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size 200

vllm (pretrained=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8,quantization=modelopt,tensor_parallel_size=1,max_model_len=2048,enforce_eager=True,kv_cache_dtype=auto,trust_remote_code=True), gen_kwargs: (temperature=0.0), limit: 500.0, num_fewshot: 5, batch_size: 200
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.932|±  |0.0113|
|     |       |strict-match    |     5|exact_match|↑  |0.912|±  |0.0127|

ModelOpt Scout FP4:

lm_eval --model vllm --model_args pretrained=nvidia/Llama-4-Scout-17B-16E-Instruct-FP4,quantization=modelopt_fp4,tensor_parallel_size=1,max_model_len=2048,enforce_eager=True,kv_cache_dtype=auto --gen_kwargs temperature=0.0 --limit 500 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size 200

vllm (pretrained=nvidia/Llama-4-Scout-17B-16E-Instruct-FP4,quantization=modelopt_fp4,tensor_parallel_size=1,max_model_len=2048,enforce_eager=True,kv_cache_dtype=auto,trust_remote_code=True), gen_kwargs: (temperature=0.0), limit: 500.0, num_fewshot: 5, batch_size: 200
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.91|±  |0.0128|
|     |       |strict-match    |     5|exact_match|↑  | 0.90|±  |0.0134|

RedHat Scout NVFP4:

lm_eval --model vllm --model_args pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4,tensor_parallel_size=1,max_model_len=2048,enforce_eager=True,kv_cache_dtype=auto --gen_kwargs temperature=0.0 --limit 500 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size 200

vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4,tensor_parallel_size=1,max_model_len=2048,enforce_eager=True,kv_cache_dtype=auto,trust_remote_code=True), gen_kwargs: (temperature=0.0), limit: 500.0, num_fewshot: 5, batch_size: 200
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.916|±  |0.0124|
|     |       |strict-match    |     5|exact_match|↑  |0.900|±  |0.0134|

I also verified with reduced maverick (used in pipeline) and it worked.

I only ran TP1 and didn't have the chance to run TP2. However, I think my latest change is not related to sharding logic, so should be okay.

Fix the weight loading issues and accuray issues when using the NVIDIA ModelOpt Llama4 Scout FP4 model. Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

nvpohanh · 2025-07-30T10:05:46Z

[2025-07-30T09:57:30Z] �[31mERROR�[0m v1/test_external_lb_dp.py::�[1mtest_external_lb_single_completion[4-ibm-research/PowerMoE-3b]�[0m - Exception: Servers failed to start
[2025-07-30T09:57:30Z] �[31mERROR�[0m v1/test_external_lb_dp.py::�[1mtest_external_lb_completion_streaming[4-ibm-research/PowerMoE-3b]�[0m - Exception: Servers failed to start

Need further debugging...

nvpohanh · 2025-07-30T10:09:25Z

I see the same tests also failed in #21921 so they are probably not caused by my change...

nvpohanh · 2025-07-30T12:29:12Z

I saw errors like this in pipeline logs:

[2025-07-30T10:52:26Z] E       Please pass the argument `trust_remote_code=True` to allow custom code to be run. [type=value_error, input_value=ArgsKwargs(('Skywork/Skyw...se, 'hf_overrides': {}}), input_type=ArgsKwargs]

But is that caused by my change?

mgoin

Looks in a good state to me now, thanks for the hard work.

Validated existing FP8, INT4, and FP4 models are unaffected

vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic,max_model_len=10000,tensor_parallel_size=2,enforce_eager=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9037|±  |0.0081|
|     |       |strict-match    |     5|exact_match|↑  |0.8901|±  |0.0086|

vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16,max_model_len=10000,tensor_parallel_size=2,enforce_eager=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9151|±  |0.0077|
|     |       |strict-match    |     5|exact_match|↑  |0.8961|±  |0.0084|

vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4,max_model_len=10000,enforce_eager=True,tensor_parallel_size=2,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9075|±  |0.0080|
|     |       |strict-match    |     5|exact_match|↑  |0.8992|±  |0.0083|

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: x22x22 <wadeking@qq.com>

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Noam Gat <noamgat@gmail.com>

Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

mergify bot added llama Related to Llama models v1 labels Jul 24, 2025

nvpohanh commented Jul 24, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/cutlass_moe.py Outdated Show resolved Hide resolved

nvpohanh commented Jul 24, 2025

View reviewed changes

vllm/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

nvpohanh commented Jul 24, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Jul 24, 2025

View reviewed changes

vllm/model_executor/models/llama4.py Outdated Show resolved Hide resolved

nvpohanh force-pushed the dev/nvpohanh/scout-nvfp4-fix branch from 97acab5 to 78aa123 Compare July 24, 2025 09:09

nvpohanh commented Jul 24, 2025

View reviewed changes

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

nvpohanh force-pushed the dev/nvpohanh/scout-nvfp4-fix branch 2 times, most recently from 765aaff to afaf28d Compare July 25, 2025 08:18

nvpohanh marked this pull request as ready for review July 25, 2025 08:19

nvpohanh requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners July 25, 2025 08:19

nvpohanh force-pushed the dev/nvpohanh/scout-nvfp4-fix branch 2 times, most recently from f75578d to 91ec86d Compare July 25, 2025 12:51

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 25, 2025

mgoin approved these changes Jul 25, 2025

View reviewed changes

nvpohanh force-pushed the dev/nvpohanh/scout-nvfp4-fix branch from 91ec86d to 6ecb2bc Compare July 28, 2025 01:46

nvpohanh force-pushed the dev/nvpohanh/scout-nvfp4-fix branch from 6ecb2bc to 118cc65 Compare July 28, 2025 04:22

nvpohanh mentioned this pull request Jul 28, 2025

fix the mxfp4 packed qk weight loading issue for llama4 #21722

Closed

4 tasks

nvpohanh force-pushed the dev/nvpohanh/scout-nvfp4-fix branch from 23bd139 to c2f113a Compare July 29, 2025 13:42

Edwardf0t1 reviewed Jul 29, 2025

View reviewed changes

vllm/model_executor/models/llama4.py Outdated Show resolved Hide resolved

mgoin requested changes Jul 29, 2025

View reviewed changes

nvpohanh force-pushed the dev/nvpohanh/scout-nvfp4-fix branch from a46f5df to bb1e7e0 Compare July 30, 2025 08:19

[NVIDIA] Fix Llama4 Scout FP4 functionality issues

edfd4f9

Fix the weight loading issues and accuray issues when using the NVIDIA ModelOpt Llama4 Scout FP4 model. Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

nvpohanh force-pushed the dev/nvpohanh/scout-nvfp4-fix branch from bb1e7e0 to edfd4f9 Compare July 30, 2025 09:16

mgoin approved these changes Jul 30, 2025

View reviewed changes

vllm-bot merged commit ff08e51 into vllm-project:main Jul 30, 2025
72 of 77 checks passed

liuyumoye pushed a commit to liuyumoye/vllm that referenced this pull request Jul 31, 2025

[NVIDIA] Fix Llama4 Scout FP4 functionality issues (vllm-project#21499)

f511c9d

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025

[NVIDIA] Fix Llama4 Scout FP4 functionality issues (vllm-project#21499)

88243b4

Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: x22x22 <wadeking@qq.com>

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[NVIDIA] Fix Llama4 Scout FP4 functionality issues (vllm-project#21499)

1c571b0

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025

[NVIDIA] Fix Llama4 Scout FP4 functionality issues (vllm-project#21499)

3294a9f

Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

noamgat pushed a commit to noamgat/vllm that referenced this pull request Aug 9, 2025

[NVIDIA] Fix Llama4 Scout FP4 functionality issues (vllm-project#21499)

873c187

Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Noam Gat <noamgat@gmail.com>

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

[NVIDIA] Fix Llama4 Scout FP4 functionality issues (vllm-project#21499)

d5dcd3b

Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025

[NVIDIA] Fix Llama4 Scout FP4 functionality issues (vllm-project#21499)

2cfe6c6

Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[NVIDIA] Fix Llama4 Scout FP4 functionality issues (vllm-project#21499)

d3235c9

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[NVIDIA] Fix Llama4 Scout FP4 functionality issues (vllm-project#21499)

e6af232

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

Uh oh!

Conversation

nvpohanh commented Jul 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

nvpohanh commented Jul 24, 2025

Uh oh!

Uh oh!

nvpohanh commented Jul 25, 2025

Uh oh!

nvpohanh commented Jul 25, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin commented Jul 26, 2025

Uh oh!

nvpohanh commented Jul 28, 2025

Uh oh!

nvpohanh commented Jul 28, 2025

Uh oh!

nvpohanh commented Jul 28, 2025

Uh oh!

nvpohanh commented Jul 28, 2025

Uh oh!

nvpohanh commented Jul 29, 2025

Uh oh!

Uh oh!

mgoin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nvpohanh commented Jul 29, 2025

Uh oh!

nvpohanh commented Jul 30, 2025

Uh oh!

nvpohanh commented Jul 30, 2025

Uh oh!

nvpohanh commented Jul 30, 2025

Uh oh!

nvpohanh commented Jul 30, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nvpohanh commented Jul 24, 2025 •

edited by github-actions bot

Loading

mgoin left a comment •

edited

Loading