[CI] Fix online FP8 quantization materializing tensors on CPU by haosdent · Pull Request #38456 · vllm-project/vllm

haosdent · 2026-03-29T07:33:42Z

Purpose

Address CI failures:

After #38426 moved load_weights() outside the with target_device: context (to fix OOM), online FP8 quantization broke in two ways:

materialize_meta_tensor() uses torch.empty_strided() without a device= arg, relying on the ambient device context. Without it, tensors are created on CPU and process_weights_after_loading fails with NotImplementedError: Could not run '_C::dynamic_scaled_fp8_quant' with arguments from the 'CPU' backend.
Fp8OnlineMoEMethod.process_weights_after_loading() creates scale tensors with torch.ones(...) without device=, causing them to land on CPU and crash the Triton fused MoE kernel with ValueError: Pointer argument (at 5) cannot be accessed from Triton (cpu tensor?).

Test Plan

pytest tests/quantization/test_fp8.py::test_online_quantization \
  tests/quantization/test_fp8.py::test_online_quant_peak_mem \
  tests/quantization/test_fp8.py::test_online_quant_load_format_dummy \
  tests/compile/fullgraph/test_full_graph.py::test_fp8_kv_scale_compile[Qwen/Qwen2-0.5B-None-0] \
  tests/compile/fullgraph/test_full_graph.py::test_fp8_kv_scale_compile[Qwen/Qwen2-0.5B-None-3] -v

Test Result

All 8 tests pass:

tests/quantization/test_fp8.py::test_online_quantization[False-False-auto] PASSED
tests/quantization/test_fp8.py::test_online_quantization[False-False-fp8] PASSED
tests/quantization/test_fp8.py::test_online_quantization[False-True-auto] PASSED
tests/quantization/test_fp8.py::test_online_quantization[False-True-fp8] PASSED
tests/quantization/test_fp8.py::test_online_quant_peak_mem PASSED
tests/quantization/test_fp8.py::test_online_quant_load_format_dummy PASSED
tests/compile/fullgraph/test_full_graph.py::test_fp8_kv_scale_compile[Qwen/Qwen2-0.5B-None-0] PASSED
tests/compile/fullgraph/test_full_graph.py::test_fp8_kv_scale_compile[Qwen/Qwen2-0.5B-None-3] PASSED

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request modifies the model loading logic to support explicit device targeting during tensor materialization. By updating materialize_meta_tensor and materialize_layer to accept a device parameter, the code now ensures tensors are created on the correct platform device. The review feedback identifies the use of a private PyTorch API for type hinting and recommends using public types to ensure long-term stability.

vllm/model_executor/model_loader/reload/meta.py

haosdent · 2026-03-29T08:00:17Z

@jikunshang Can you help to take a look when you are available? Many thanks!

haosdent · 2026-03-29T08:08:26Z

Ah, didn't notice #38442 could address the CI failure mentioned in this PR , but I think #38442 is indeed better since it addresses the issue systematically.
While #38442 may take longer to review since it changes more files, I still think this PR is necessary since it could unblock the CI failure before #38442 is merged.

After vllm-project#38426 narrowed the `with target_device:` context to only wrap `initialize_model()`, code that relied on the ambient device context during `load_weights()` started creating tensors on CPU instead of GPU. This fixes three locations: 1. `materialize_meta_tensor()` / `materialize_layer()` — accept an explicit `device` parameter instead of relying on the ambient `torch.device` context. 2. `DummyModelLoader.load_weights()` — passes `device=current_platform.device_type` when materializing meta tensors. 3. `Fp8OnlineMoEMethod.process_weights_after_loading()` — the `torch.ones` calls for `w13_scale` / `w2_scale` now specify `device=layer.w13_weight.device` so the scale tensors land on the same GPU as the already-materialized weights. Signed-off-by: haosdent <haosdent@gmail.com>

jikunshang · 2026-03-29T14:34:41Z

thanks for quick fixing! #38442 is approved now. let's wait for CI result. hope it can be merged soon.

mergify · 2026-03-29T21:05:43Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @haosdent.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

haosdent requested a review from 22quinn as a code owner March 29, 2026 07:33

claude bot reviewed Mar 29, 2026

View reviewed changes

mergify bot added the bug Something isn't working label Mar 29, 2026

gemini-code-assist bot reviewed Mar 29, 2026

View reviewed changes

vllm/model_executor/model_loader/reload/meta.py Show resolved Hide resolved

vllm/model_executor/model_loader/reload/meta.py Show resolved Hide resolved

haosdent changed the title ~~[Bugfix] Fix online FP8 quantization materializing tensors on CPU~~ [CI] Fix online FP8 quantization materializing tensors on CPU Mar 29, 2026

haosdent mentioned this pull request Mar 29, 2026

[QeRL] Fix online quantized reloading #38442

Merged

haosdent force-pushed the fix-online-fp8-cpu-materialization branch from 51731f8 to 7dae654 Compare March 29, 2026 09:00

haosdent requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners March 29, 2026 09:00

mergify bot added the needs-rebase label Mar 29, 2026

haosdent closed this Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] Fix online FP8 quantization materializing tensors on CPU#38456

[CI] Fix online FP8 quantization materializing tensors on CPU#38456
haosdent wants to merge 1 commit intovllm-project:mainfrom
haosdent:fix-online-fp8-cpu-materialization

haosdent commented Mar 29, 2026 •

edited

Loading

Uh oh!

claude bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

haosdent commented Mar 29, 2026

Uh oh!

haosdent commented Mar 29, 2026

Uh oh!

jikunshang commented Mar 29, 2026

Uh oh!

mergify bot commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

haosdent commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

haosdent commented Mar 29, 2026

Uh oh!

haosdent commented Mar 29, 2026

Uh oh!

jikunshang commented Mar 29, 2026

Uh oh!

mergify bot commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

haosdent commented Mar 29, 2026 •

edited

Loading