Skip to content

[CI] Fix online FP8 quantization materializing tensors on CPU#38456

Closed
haosdent wants to merge 1 commit intovllm-project:mainfrom
haosdent:fix-online-fp8-cpu-materialization
Closed

[CI] Fix online FP8 quantization materializing tensors on CPU#38456
haosdent wants to merge 1 commit intovllm-project:mainfrom
haosdent:fix-online-fp8-cpu-materialization

Conversation

@haosdent
Copy link
Copy Markdown
Contributor

@haosdent haosdent commented Mar 29, 2026

Purpose

Address CI failures:

After #38426 moved load_weights() outside the with target_device: context (to fix OOM), online FP8 quantization broke in two ways:

  1. materialize_meta_tensor() uses torch.empty_strided() without a device= arg, relying on the ambient device context. Without it, tensors are created on CPU and process_weights_after_loading fails with NotImplementedError: Could not run '_C::dynamic_scaled_fp8_quant' with arguments from the 'CPU' backend.

  2. Fp8OnlineMoEMethod.process_weights_after_loading() creates scale tensors with torch.ones(...) without device=, causing them to land on CPU and crash the Triton fused MoE kernel with ValueError: Pointer argument (at 5) cannot be accessed from Triton (cpu tensor?).

Test Plan

pytest tests/quantization/test_fp8.py::test_online_quantization \
  tests/quantization/test_fp8.py::test_online_quant_peak_mem \
  tests/quantization/test_fp8.py::test_online_quant_load_format_dummy \
  tests/compile/fullgraph/test_full_graph.py::test_fp8_kv_scale_compile[Qwen/Qwen2-0.5B-None-0] \
  tests/compile/fullgraph/test_full_graph.py::test_fp8_kv_scale_compile[Qwen/Qwen2-0.5B-None-3] -v

Test Result

All 8 tests pass:

tests/quantization/test_fp8.py::test_online_quantization[False-False-auto] PASSED
tests/quantization/test_fp8.py::test_online_quantization[False-False-fp8] PASSED
tests/quantization/test_fp8.py::test_online_quantization[False-True-auto] PASSED
tests/quantization/test_fp8.py::test_online_quantization[False-True-fp8] PASSED
tests/quantization/test_fp8.py::test_online_quant_peak_mem PASSED
tests/quantization/test_fp8.py::test_online_quant_load_format_dummy PASSED
tests/compile/fullgraph/test_full_graph.py::test_fp8_kv_scale_compile[Qwen/Qwen2-0.5B-None-0] PASSED
tests/compile/fullgraph/test_full_graph.py::test_fp8_kv_scale_compile[Qwen/Qwen2-0.5B-None-3] PASSED

@haosdent haosdent requested a review from 22quinn as a code owner March 29, 2026 07:33
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify bot added the bug Something isn't working label Mar 29, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the model loading logic to support explicit device targeting during tensor materialization. By updating materialize_meta_tensor and materialize_layer to accept a device parameter, the code now ensures tensors are created on the correct platform device. The review feedback identifies the use of a private PyTorch API for type hinting and recommends using public types to ensure long-term stability.

@haosdent haosdent changed the title [Bugfix] Fix online FP8 quantization materializing tensors on CPU [CI] Fix online FP8 quantization materializing tensors on CPU Mar 29, 2026
@haosdent
Copy link
Copy Markdown
Contributor Author

@jikunshang Can you help to take a look when you are available? Many thanks!

@haosdent
Copy link
Copy Markdown
Contributor Author

Ah, didn't notice #38442 could address the CI failure mentioned in this PR , but I think #38442 is indeed better since it addresses the issue systematically.
While #38442 may take longer to review since it changes more files, I still think this PR is necessary since it could unblock the CI failure before #38442 is merged.

After vllm-project#38426 narrowed the `with target_device:` context to only wrap
`initialize_model()`, code that relied on the ambient device context
during `load_weights()` started creating tensors on CPU instead of GPU.

This fixes three locations:

1. `materialize_meta_tensor()` / `materialize_layer()` — accept an
   explicit `device` parameter instead of relying on the ambient
   `torch.device` context.

2. `DummyModelLoader.load_weights()` — passes
   `device=current_platform.device_type` when materializing meta
   tensors.

3. `Fp8OnlineMoEMethod.process_weights_after_loading()` — the
   `torch.ones` calls for `w13_scale` / `w2_scale` now specify
   `device=layer.w13_weight.device` so the scale tensors land on
   the same GPU as the already-materialized weights.

Signed-off-by: haosdent <haosdent@gmail.com>
@jikunshang
Copy link
Copy Markdown
Collaborator

thanks for quick fixing! #38442 is approved now. let's wait for CI result. hope it can be merged soon.

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 29, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @haosdent.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 29, 2026
@haosdent haosdent closed this Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working needs-rebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants