[Core] Unified quantization framework by lishunyang12 · Pull Request #1764 · vllm-project/vllm-omni

lishunyang12 · 2026-03-09T19:04:50Z

Purpose

Test Plan

Unit Tests

pytest tests/diffusion/quantization/test_fp8_config.py -v

E2E FP8 Quantization Results (1×A100 80GB)

bash tests/e2e/offline_inference/run_quantization_e2e.sh

All 8 generation tests passed. LPIPS quality comparison below.

Z-Image — `Tongyi-MAI/Z-Image-Turbo` (1024×1024, 50 steps, seed=42)

Config	Memory (GiB)	Mem Reduction	Avg Time	Speedup	LPIPS
BF16 baseline	19.15	—	4.87s	1.00x	(ref)
FP8	13.40	30%	3.70s	1.32x	0.0381

Visual comparison

BF16	FP8

Qwen-Image — `Qwen/Qwen-Image` (1024×1024, 50 steps, seed=142)

Config	Memory (GiB)	Mem Reduction	Avg Time	Speedup	LPIPS
BF16 baseline	53.75	—	6.69s	1.00x	(ref)
FP8	41.20	23%	5.16s	1.30x	0.2976

Visual comparison

BF16	FP8

Flux.1-dev — `black-forest-labs/FLUX.1-dev` (1024×1024, 20 steps, seed=42)

Config	Memory (GiB)	Mem Reduction	Avg Time	Speedup	LPIPS
BF16 baseline	31.43	—	3.30s	1.00x	(ref)
FP8	23.47	25%	2.79s	1.18x	0.1515

Visual comparison

BF16	FP8

Bagel-7B-MoT — `ByteDance-Seed/BAGEL-7B-MoT` (50 steps, seed=52, cfg_text=4.0)

FP8 applied to diffusion stage (Stage-1) only. LLM stage (Stage-0) remains BF16 — verified by quantization=None in engine config and finish_reason=stop.

Component	Config	Memory (GiB)	Mem Reduction
Stage-0 (LLM)	BF16	27.37	—
Stage-0 (LLM)	BF16 (with fp8 flag)	27.37	0% (correct)
Stage-1 (DiT)	BF16	26.47	—
Stage-1 (DiT)	FP8	14.20	46%

Config	Total Time	Speedup	LPIPS
BF16 baseline	24.76s	1.00x	(ref)
FP8 (diffusion only)	21.97s	1.13x	0.0566

Visual comparison

BF16	FP8

Qwen3-Omni — Pre-quantized ModelOpt FP8 (1×H200 141GB)

Tested loading pre-quantized FP8 checkpoint (asdazd/Qwen3-Omni-30B-A3B-Instruct_modelopt_FP8) via the unified quantization framework. FP8 is auto-detected from thinker_config.text_config.quantization_config and scoped to language_model only — audio encoder, vision encoder, talker, and code2wav remain BF16.

Thinker-only benchmark (max_model_len=8192, enforce_eager=True, Triton FP8 MoE backend)

Config	Model Memory (GiB)	Mem Reduction	Available KV Cache (GiB)	Weight Load Time	Decode (tok/s)
BF16 baseline	59.26	—	72.53	133.3s	41.6
FP8 (ModelOpt)	31.41	47%	100.34	8.8s	39.9

Key findings:

47% model memory reduction (59.26 → 31.41 GiB) — enables full pipeline on single 64GB GPU
38% more KV cache (72.53 → 100.34 GiB) — supports 133.79x vs 96.71x max concurrency at 8K context
15x faster weight loading (133.3s → 8.8s) — single FP8 shard vs 15 BF16 shards
Comparable throughput — 39.9 vs 41.6 tok/s (~4% difference, Triton MoE backend)
Output quality preserved — coherent text generation on identical prompts with greedy decoding
Per-component routing verified — quantization=modelopt for thinker, quantization=None for talker/code2wav

Full pipeline (thinker + talker + code2wav) on single GPU within 64GB budget

All 3 stages running on a single GPU with memory constrained to ~54 GiB total:

Stage	Component	Weights (GiB)	KV Cache (GiB)	Total (GiB)
Stage-0	Thinker (FP8)	31.37	9.95	~41.3
Stage-1	Talker (BF16)	8.50	3.71	~12.2
Stage-2	Code2Wav	0.41	~0	~0.4
	Total	40.28	13.66	~54 GiB

Full pipeline produces text + audio output end-to-end. BF16 full pipeline requires ~66+ GiB (thinker alone is 59.26 GiB) — impossible on 64GB without FP8.

Sample output (text + audio)

Prompt: "Explain the system architecture for a scalable audio generation pipeline. Answer in 15 words."

Text output: "Distributed microservices, real-time queuing, GPU-accelerated models, containerized inference, dynamic scaling, and low-latency APIs ensure high-throughput, flexible audio generation."

Audio output: output_0_78a82c12-09dc-4da9-bf61-d76b859df275.wav

Note: Default FLASHINFER_CUTLASS FP8 MoE backend requires long first-run kernel compilation on H200. Use VLLM_USE_FLASHINFER_MOE_FP8=0 to select Triton backend for immediate startup.

NVFP4 Trial — Qwen3-Omni on NVIDIA B200 (Blackwell)

Status: Loading and inference pipeline works end-to-end, but output quality is unacceptable. NVFP4 support for MoE omni models requires further investigation.

We attempted to create and load a pre-quantized NVFP4 (W4A8) checkpoint of Qwen3-Omni-30B-A3B-Instruct on an NVIDIA B200 (183 GiB, SM 100+). The goal was to validate the unified quantization framework with NVFP4 and demonstrate single-GPU deployment on RTX 5090 (32 GiB).

What was done

Quantized checkpoint created using NVIDIA ModelOpt v0.42.0 (NVFP4QTensor.quantize)
- Thinker MoE experts: manually split fused gate_up_proj/down_proj into per-expert format, packed to NVFP4
- Thinker attention (q/k/v/o_proj): calibrated via mtq.quantize with 256 prompts
- Audio tower + visual encoder: all linear layers packed to NVFP4
- Talker + code2wav: kept BF16 (MoE fused expert format incompatible with NVFP4 FusedMoE kernel)
- Checkpoint: shunyang90/Qwen3-Omni-30B-A3B-Instruct-NVFP4
Framework changes for pre-quantized checkpoint loading (kept in this PR — benefits FP8 too):
- Subclassed upstream Qwen3OmniMoeAudioEncoder to accept quant_config for audio tower linear layers
- Added pre-quantized method detection (modelopt, modelopt_fp4, modelopt_mxfp8) to route quant_config to all thinker subcomponents (audio_tower, visual, language_model)
- Dynamic quantization (e.g. --quantization fp8) still scopes to language_model only — no regression
- Fixed talker rope_scaling → rope_parameters fallback for newer transformers
- Fixed code2wav stage hf_config_name: thinker_config → code2wav_config
- Fixed init_timeout not passed from CLI to Omni()
- Fixed omni.stage_list → omni.num_stages in end2end example
Full 3-stage pipeline loaded and ran on B200:

Stage	Component	Quantization	Memory (GiB)	Init Time
Stage-0	Thinker	NVFP4	17.66	39s (cached cubins)
Stage-1	Talker	BF16	8.50	49s
Stage-2	Code2Wav	BF16	0.41	15s
	Total		26.57

What failed

Text generation produced garbage output (all ! tokens, 1024 repetitions). Audio was also unintelligible. The full pipeline completed without runtime errors — the issue is purely numerical accuracy.

Root cause analysis

weight_scale_2 mismatch (fixed but didn't resolve quality): Initially, gate_proj and up_proj per-expert had independent global scales (up to 4.35x ratio across 11,728 of 12,288 expert pairs). vLLM's FusedMoE kernel requires w1_weight_scale_2 == w3_weight_scale_2. Fixed by using shared_ws2 = max(gate.abs().max(), up.abs().max()) / (6.0 * 448.0). Verified 0 mismatches post-fix — output still garbled.
Likely remaining issues:
- NVFP4 MoE kernel (FLASHINFER_TRTLLM) may have accuracy issues with Qwen3-Omni's small expert dimensions (intermediate_size=768, hidden_size=2048) — these kernels were primarily validated on larger experts (e.g. Qwen3.5-397B with intermediate_size=2048)
- Calibration was text-only (256 prompts via mtq.quantize) — insufficient for the multimodal thinker's audio/vision encoder pathways
- The quantize_linear_weights_nvfp4 function uses simple abs().max() scaling without calibration — proper calibrated quantization (SmoothQuant-style or ModelOpt's full calibration pipeline) may be required
- Potential interaction between NVFP4 attention layers and BF16 MoE router gates during the mixed-precision forward pass

Conclusion

NVFP4 for Qwen3-Omni MoE requires either:

Upstream NVFP4 MoE kernel validation for small-expert MoE architectures
Properly calibrated per-expert quantization (not just abs().max() scaling)
Or a different quantization approach (e.g. GPTQ/AWQ-style weight-only quantization)

The framework changes from this trial are retained as they are needed for FP8 and future quantization methods.

How to Reproduce

# Unit tests
pytest tests/diffusion/quantization/test_fp8_config.py -v

# Full E2E test suite (all models + LPIPS)
bash tests/e2e/offline_inference/run_quantization_e2e.sh

# Skip heavy models
bash tests/e2e/offline_inference/run_quantization_e2e.sh --skip-flux --skip-bagel

# Z-Image FP8
python examples/offline_inference/text_to_image/text_to_image.py \
  --model Tongyi-MAI/Z-Image-Turbo --quantization fp8 \
  --prompt "a cup of coffee on the table" --seed 42

# Flux FP8
python examples/offline_inference/text_to_image/text_to_image.py \
  --model black-forest-labs/FLUX.1-dev --quantization fp8 \
  --prompt "a cup of coffee on the table" --seed 42

# Qwen-Image FP8
python examples/offline_inference/text_to_image/text_to_image.py \
  --model Qwen/Qwen-Image --quantization fp8 \
  --prompt "a cup of coffee on the table" --seed 142

# Bagel FP8 (diffusion-only)
python examples/offline_inference/bagel/end2end.py \
  --model ByteDance-Seed/BAGEL-7B-MoT --quantization fp8 \
  --modality text2img --prompts "A cute cat" --steps 50

# Qwen3-Omni FP8 (pre-quantized modelopt checkpoint, thinker-only)
pip install modelscope
python -c "from modelscope import snapshot_download; snapshot_download('asdazd/Qwen3-Omni-30B-A3B-Instruct_modelopt_FP8')"
VLLM_USE_FLASHINFER_MOE_FP8=0 python examples/offline_inference/qwen3_omni/end2end.py \
  --model /path/to/Qwen3-Omni-30B-A3B-Instruct_modelopt_FP8 \
  --stage-configs-path fp8_stage_config.yaml \
  --query-type text --modalities text

# Qwen3-Omni FP8 full pipeline on single 64GB GPU
VLLM_USE_FLASHINFER_MOE_FP8=0 python examples/offline_inference/qwen3_omni/end2end.py \
  --model /path/to/Qwen3-Omni-30B-A3B-Instruct_modelopt_FP8 \
  --stage-configs-path fp8_full_pipeline_64gb.yaml \
  --query-type text

# LPIPS benchmark (standalone)
python benchmarks/diffusion/quantization_quality.py \
  --model Tongyi-MAI/Z-Image-Turbo --quantization fp8 \
  --prompts "a cup of coffee on the table" --seed 42

Running Qwen3-Omni FP8 on a Single 64GB GPU

The FP8 quantized checkpoint enables running the full Qwen3-Omni pipeline (thinker + talker + code2wav with audio output) on a single 64GB GPU. This is impossible with BF16 (thinker alone requires 59.26 GiB).

Step 1: Download the FP8 model

pip install modelscope
python -c "from modelscope import snapshot_download; print(snapshot_download('asdazd/Qwen3-Omni-30B-A3B-Instruct_modelopt_FP8'))"

Step 2: Use the stage config

Use the included fp8_full_pipeline_64gb.yaml file, updating the model path in all 3 stages to point to your downloaded model.

The gpu_memory_utilization values in the file are tuned for H200 (141GB) simulating a 64GB budget. On an actual 64GB card, use these values:

Stage-0 (Thinker): gpu_memory_utilization: 0.65
Stage-1 (Talker): gpu_memory_utilization: 0.20
Stage-2 (Code2Wav): gpu_memory_utilization: 0.05

Step 3: Run

VLLM_USE_FLASHINFER_MOE_FP8=0 python examples/offline_inference/qwen3_omni/end2end.py \
  --model /path/to/Qwen3-Omni-30B-A3B-Instruct_modelopt_FP8 \
  --stage-configs-path fp8_full_pipeline_64gb.yaml \
  --query-type text

cc @Isotr0py @hsliuustc0106 @alex-jw-brooks @yiliu30 @kylesayrs

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2b7cde8058

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-09T19:10:08Z

+        return None
+
+    # Wrap in compat shim
+    wrapper = DiffusionQuantizationConfig(config)


Return concrete legacy wrappers from compat factory

get_diffusion_quant_config() now always returns the generic DiffusionQuantizationConfig wrapper, so legacy fields/methods on concrete wrappers are lost (for example, FP8 callers can no longer access activation_scheme/ignored_layers on the returned object). This breaks existing compatibility usage that previously received DiffusionFp8Config/DiffusionGgufConfig instances, despite the shim claiming old API preservation.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-09T19:10:08Z

+                if "method" not in config_dict and self.quantization is not None:
+                    config_dict["method"] = self.quantization


Preserve per-component configs when quantization is set

When quantization_config is a per-component mapping and quantization is also provided, injecting a top-level method key forces build_quant_config() down the single-method path, so component overrides (e.g. {"vae": None}) are silently ignored. In this scenario, users expecting mixed per-component behavior will get global quantization instead, which can change accuracy/memory characteristics unexpectedly.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-09T19:10:08Z

+        elif isinstance(value, str):
+            config = _build_single(value)
+        elif isinstance(value, dict):
+            method = value.pop("method", None)


Stop mutating nested per-component config dictionaries

_build_component_config() calls value.pop("method", None) on each component dict, but build_quant_config() only shallow-copies the top-level mapping, so nested dicts from the caller are mutated in place. Reusing the same config object (for retries or multiple model inits) can then fail because method has been removed after the first call.

Useful? React with 👍 / 👎.

yenuo26 · 2026-03-10T08:46:30Z

-    from vllm_omni.diffusion.quantization import get_diffusion_quant_config

-    config = get_diffusion_quant_config("fp8")
+def test_build_quant_config_fp8():


please add mark, you can refer to the mark description: https://github.com/vllm-project/vllm-omni/blob/main/docs/contributing/ci/tests_markers.md

Will do, thanks.

Done — added pytestmark = [pytest.mark.core_model, pytest.mark.diffusion] at module level.

yiliu30 · 2026-03-10T09:13:24Z

+# ---------------------------------------------------------------------------
+
+
+def build_quant_config(


Hi @lishunyang12, nice work unifying the quantization framework and reusing vLLM's existing infrastructure, the config factory is much cleaner.

One thing we'll need is support for loading quantization configs from disk, i.e., auto-detecting the config embedded in the model config by quantization tools like AutoRound. For example, FLUX.1-dev-AutoRound-w4a16 stores the quantization config in transformer/config.json:

"quantization_config": { "quant_method": "auto-round", "bits": 4, "group_size": 128, "packing_format": "auto_round:auto_gptq", ... }

Could you take a look at the resolve_quantization part in #1777? (The rest of that PR will be refactored once yours lands.)
It would be great if the unified framework could:

read tf_model_config.quantization_config from disk, and

route it through build_quant_config() automatically,

So users don’t need to specify quantization configs manually.

Thanks for the pointer! I went through resolve_quantization in #1777 — will incorporate disk-based auto-detection into this PR. Plan is to read tf_model_config.quantization_config and route it through build_quant_config() so users don't need to specify configs manually.

A few things I'll improve on vs #1777's current approach:

Copy the dict before popping quant_method (avoid mutating tf_model_config)

Wire up maybe_update_config properly instead of leaving it commented out

Integrate prefix propagation for checkpoint-driven methods (AutoRound/GPTQ/AWQ)

I'm also extending coverage to qwen3-omni (per-component quantization) and bagel (quant_config threading through the transformer). Will push updates soon.

Great! Thanks a lot!

Integrate prefix propagation for checkpoint-driven methods

I'd suggest we do it in a separate PR so this one can move forward quickly. But it’s totally up to you.

Sounds good, let's split prefix propagation into a follow-up.

david6666666

Review Summary (Second Pass)

Great progress on addressing the meta tensor crash and threading quant_config through Bagel and Qwen3-Omni! The per-component routing is working well. However, several critical gaps remain before this can be merged.

🚫 Blocking Issues

1. Still Missing Integration Tests for Per-Component Quantization

The test file was expanded from 88 to 204 lines, but all tests are still unit tests for config building. We need at least one integration test that actually loads a model with per-component quantization:

Without this, we can't verify the per-component routing actually works end-to-end. This is the core feature for multi-stage models (Bagel, Qwen-Omni, etc.), so it needs validation.

⚠️ High Priority

2. New is Unused

The validation utility in is a great addition, but it's not called anywhere. At minimum, it should be invoked in :

Otherwise, this code will drift and become stale.

3. Error Messages Still Lack Parameter Signatures

When fails to instantiate a method, users still get unhelpful errors:

This doesn't tell users what parameters are expected. Please add:

📝 Medium Priority

4. Documentation Still Missing

The docstring is a good start, but we need:

Migration guide for users of the old API
Example configurations for common models (Flux, Bagel, Qwen-Omni)
List of supported methods with platform compatibility notes

Example migration guide:

python
from vllm_omni.diffusion.quantization import get_diffusion_quant_config
config = get_diffusion_quant_config(fp8, activation_scheme=static)
python
from vllm_omni.quantization import build_quant_config
config = build_quant_config({method: fp8, activation_scheme: static})

📊 Summary

Issue	First Review	Second Review	Status
Meta tensor crash	❌ Not fixed	✅ Fixed	✅
Per-component routing	❌ Not implemented	✅ Implemented	✅
Integration tests	⚠️ Missing	⚠️ Still missing	❌
Validation usage	❌ Not exists	✅ Created but unused	⚠️
Error messages	⚠️ Unclear	⚠️ Still unclear	❌
Documentation	❌ Missing	❌ Still missing	❌

✅ What to Do Next

Add at least 1 integration test for per-component quantization (blocking)
**Call ** in
Improve error messages with parameter signatures
Add migration guide to docs or README

Once these are addressed, I'm happy to approve. Thanks for the great work on unifying the quantization framework! 🙏

david6666666

Review Summary (Second Pass)

Great progress on addressing the meta tensor crash and threading quant_config through Bagel and Qwen3-Omni! The per-component routing is working well. However, several critical gaps remain before this can be merged.

🚫 Blocking Issues

1. Still Missing Integration Tests for Per-Component Quantization

The test file was expanded from 88 to 204 lines, but all tests are still unit tests for config building. We need at least one integration test that actually loads a model with per-component quantization:

def test_bagel_per_component_quantization():
    """Verify Bagel loads with transformer at FP8 and VAE unquantized."""
    config = OmniDiffusionConfig(
        model="ByteDance-Seed/BAGEL-7B-MoT",
        quantization_config={
            "language_model": {"method": "fp8"},
            "vae": None,
        }
    )
    # Load model and verify quantization is applied correctly
    model = ...  # actual model loading
    # Check that language_model layers have quantization
    # Check that VAE layers don't have quantization

Without this, we can't verify the per-component routing actually works end-to-end. This is the core feature for multi-stage models (Bagel, Qwen-Omni, etc.), so it needs validation.

⚠️ High Priority

2. New `validate_quant_config()` is Unused

The validation utility in quantization/validation.py is a great addition, but it's not called anywhere. At minimum, it should be invoked in OmniDiffusionConfig.__post_init__():

# diffusion/data.py
if self.quantization_config is not None:
    warnings = validate_quant_config(
        self.quantization_config,
        dtype=self.torch_dtype,
    )
    for warning in warnings:
        logger.warning(warning)

Otherwise, this code will drift and become stale.

3. Error Messages Still Lack Parameter Signatures

When build_quant_config() fails to instantiate a method, users still get unhelpful errors:

# factory.py:78-92
raise TypeError(
    f"Cannot instantiate {config_cls.__name__} with kwargs {kwargs}. "
    f"Check the constructor or from_config() signature."
)

This doesn't tell users what parameters are expected. Please add:

import inspect
sig = inspect.signature(config_cls.__init__)
raise TypeError(
    f"Cannot instantiate {config_cls.__name__} with kwargs {kwargs}. "
    f"Expected signature: {sig}. "
    f"Supported methods: {SUPPORTED_QUANTIZATION_METHODS}"
)

📝 Medium Priority

4. Documentation Still Missing

The __init__.py docstring is a good start, but we need:

Migration guide for users of the old diffusion/quantization/ API
Example configurations for common models (Flux, Bagel, Qwen-Omni)
List of supported methods with platform compatibility notes

Example migration guide:

## Migration Guide

### Before (v0.14.0)
```python
from vllm_omni.diffusion.quantization import get_diffusion_quant_config
config = get_diffusion_quant_config("fp8", activation_scheme="static")

After (v0.16.0+)

from vllm_omni.quantization import build_quant_config
config = build_quant_config({"method": "fp8", "activation_scheme": "static"})


---

### 📊 Summary

| Issue | First Review | Second Review | Status |
|-------|-------------|---------------|--------|
| Meta tensor crash | ❌ Not fixed | ✅ Fixed | ✅ |
| Per-component routing | ❌ Not implemented | ✅ Implemented | ✅ |
| Integration tests | ⚠️ Missing | ⚠️ Still missing | ❌ |
| Validation usage | ❌ Not exists | ✅ Created but unused | ⚠️ |
| Error messages | ⚠️ Unclear | ⚠️ Still unclear | ❌ |
| Documentation | ❌ Missing | ❌ Still missing | ❌ |

---

### ✅ What to Do Next

1. **Add at least 1 integration test** for per-component quantization (blocking)
2. **Call `validate_quant_config()`** in `OmniDiffusionConfig.__post_init__()`
3. **Improve error messages** with parameter signatures
4. **Add migration guide** to docs or README

Once these are addressed, I'm happy to approve. Thanks for the great work on unifying the quantization framework! 🙏

lishunyang12 · 2026-03-10T17:12:22Z

PTAL @david6666666 @yiliu30

alex-jw-brooks

Thanks for the PR! Some thoughts

alex-jw-brooks · 2026-03-11T18:20:22Z

+            if self.quantization_config is not None:
+                warnings = validate_quant_config(
+                    self.quantization_config,
+                    dtype=self.dtype if isinstance(self.dtype, torch.dtype) else torch.bfloat16,


We should pass self.dtype here instead, because it's already normalized to a torch dtype with a bfloat16 right above this

Done — using self.dtype now since it's already normalized above.

alex-jw-brooks · 2026-03-11T18:33:12Z

+                    dtype=self.dtype if isinstance(self.dtype, torch.dtype) else torch.bfloat16,
+                )
+                for warning in warnings:
+                    logger.warning(warning)


I think it would be better to just warn in the validation function instead of returning the warnings here

Done — validate_quant_config now logs warnings directly via logger.warning() instead of returning a list.

alex-jw-brooks · 2026-03-11T18:37:37Z

+                architectures=["Qwen3MoeForCausalLM"],
+            )
+            if language_quant_config is not quant_config:
+                from dataclasses import replace


Can you move inline imports that are inline to be at the top of files unless they explicitly need to be to avoid things like optional and circular dep issues?

I think adding a debug log here would also be helpful

Done — moved both imports to top-level and added a debug log when per-component quant resolves.

alex-jw-brooks · 2026-03-11T18:38:01Z

        talker_config.text_config.rope_parameters["rope_theta"] = talker_config.text_config.rope_theta
-        self.quant_config = vllm_config.quant_config
+        quant_config = vllm_config.quant_config
+        from vllm_omni.quantization.component_config import ComponentQuantizationConfig


Same comment about inline imports

alex-jw-brooks · 2026-03-11T18:46:14Z

+    if kwargs:
+        try:
+            return config_cls(**kwargs)
+        except TypeError:
+            pass
+        try:
+            return config_cls.from_config(kwargs)
+        except (TypeError, KeyError, ValueError):
+            sig = inspect.signature(config_cls.__init__)
+            raise TypeError(
+                f"Cannot instantiate {config_cls.__name__} with kwargs {kwargs}. Expected signature: {sig}"
+            ) from None
+
+    try:
+        return config_cls()
+    except TypeError:
+        pass
+    try:
+        return config_cls.from_config({})
+    except (TypeError, KeyError, ValueError):
+        sig = inspect.signature(config_cls.__init__)
+        raise TypeError(
+            f"Cannot instantiate {config_cls.__name__} without arguments. "
+            f"Expected signature: {sig}. "
+            f"Provide constructor kwargs via dict config."
+        ) from None


This is pretty redundant

Suggested change

if kwargs:

try:

return config_cls(**kwargs)

except TypeError:

pass

try:

return config_cls.from_config(kwargs)

except (TypeError, KeyError, ValueError):

sig = inspect.signature(config_cls.__init__)

raise TypeError(

f"Cannot instantiate {config_cls.__name__} with kwargs {kwargs}. Expected signature: {sig}"

) from None

try:

return config_cls()

except TypeError:

pass

try:

return config_cls.from_config({})

except (TypeError, KeyError, ValueError):

sig = inspect.signature(config_cls.__init__)

raise TypeError(

f"Cannot instantiate {config_cls.__name__} without arguments. "

f"Expected signature: {sig}. "

f"Provide constructor kwargs via dict config."

) from None

config_kwargs = kwargs if kwargs else {}

try:

return config_cls(**config_kwargs)

except TypeError:

pass

try:

return config_cls.from_config(config_kwargs)

except (TypeError, KeyError, ValueError):

sig = inspect.signature(config_cls.__init__)

raise TypeError(

f"Cannot instantiate {config_cls.__name__} with kwargs {config_kwargs}. Expected signature: {sig}"

) from None

Done — merged the two paths into a single try/except chain as suggested.

alex-jw-brooks · 2026-03-11T19:01:51Z

+
+    if isinstance(spec, str):
+        if spec.lower() == "none":
+            return None


I think we should remove this and just have None be passed as a type instead of a string

Done — removed "none" string support. Callers should pass None directly.

alex-jw-brooks · 2026-03-11T19:07:31Z

-    "get_vllm_quant_config_for_layers",
-    "SUPPORTED_QUANTIZATION_METHODS",
-]
+raise ImportError("vllm_omni.diffusion.quantization has been removed. Use vllm_omni.quantization instead.")


I think we should also add a time for when this will be deleted to avoid having an empty package like this for too long

Done — added "This stub will be removed in v0.3.0" to the docstring.

alex-jw-brooks · 2026-03-11T19:11:51Z

+    # dict ({"transformer": "fp8", "vae": None}).
    quantization: str | None = None
-    quantization_config: "DiffusionQuantizationConfig | dict[str, Any] | None" = None
+    quantization_config: QuantizationConfig | dict[str, Any] | None = None


I think it would be cleaner to only support one kwarg for the actual config; maybe just have one attr that can either be a str or QuantConfig etc, since we can just use resolve the str to {"method": <str>} right?

This will also avoid situations where config / quant config are in conflict

Done — consolidated into a single quantization_config field that accepts str | QuantizationConfig | dict | None. The str case is resolved to {"method": <str>} internally. Removed the separate quantization field.

alex-jw-brooks · 2026-03-11T19:13:46Z

+    # Supported methods: "fp8", "gguf" (more via vllm_omni.quantization)
+    # Can be a string ("fp8"), dict ({"method": "fp8", ...}), or per-component
+    # dict ({"transformer": "fp8", "vae": None}).
    quantization: str | None = None


Also, I wonder if it would be better to have a better heuristic for indicating if things are component dicts or not - I'm not sure if it would be better to allow either quantization_config or a per_component_quantization_config (but not both together). The current way of checking if something is a flat config or not seems opaque, and knowing directly would be more clear.

Or we could have a flag in the quant config dict that explicitly specifies is_component. Checking for method feels strange, since some people may assume setting the method at the top level for a dict with multiple components would just use the same method for all of them

The current heuristic in _is_per_component_dict requires at least one value to be None or a dict with a "method" key — so a flat dict like {"activation_scheme": "static"} won't be misdetected as per-component. This avoids needing a separate field while keeping the detection reliable. If you think a separate per_component_quantization_config field would be clearer, happy to discuss — but I think the single-field approach is simpler for users.

alex-jw-brooks · 2026-03-11T19:40:00Z


-        self.language_model = Qwen2MoTForCausalLM(llm_config)
+        quant_config = od_config.quantization_config
+        self.language_model = Qwen2MoTForCausalLM(llm_config, quant_config=quant_config, prefix="bagel.language_model")


Can you add a comment explaining why the bagel pipeline explicitly sets the prefixes here, while other pipelines don't?

Done — added a comment explaining that Bagel uses explicit prefixes because its HF config nests the language model under bagel.language_model rather than a top-level transformer key.

alex-jw-brooks · 2026-03-11T19:45:32Z

        This handles vLLM's quantization methods that need to process weights
        after loading (e.g., FP8 online quantization from BF16/FP16 weights).
        """
        for _, module in model.named_modules():


It may be helpful to validate somewhere and at least warn if a component quant config is provided, but it's invalid - I think currently on the diffusion path, since the model isn't passed earlier, prefixes won't be validated and it'll be silently unused here?

Good catch — added a validate_quant_config(config, model=model) call in diffusers_loader.py after model loading. This validates component prefixes against actual model modules and warns about mismatches.

kylesayrs · 2026-03-11T20:17:17Z

+        config = self._resolve(prefix)
+        if config is None:
+            return None
+        return config.get_quant_method(layer, prefix)


vLLM has the concept of remapping from quantization prefixes to module prefixes. This system isn't perfect, but something to consider in this design (ie, quantization config prefixes do not perfectly match vLLM model definition prefixes due to renaming, ect.)

Good point — added a note in the _resolve docstring about WeightsMapper prefix remapping. The validation in _validate_component_prefixes also mentions this caveat now. For now the prefix matching is top-level only, which should work for the common cases (Bagel, Qwen3-Omni). We can refine this if we hit edge cases with deeper remapping.

kylesayrs · 2026-03-11T20:18:11Z

+
+    @classmethod
+    def get_min_capability(cls) -> int:
+        return 0


Do you mean to take a min across all component configs?

Yes — changed get_min_capability from a @classmethod returning 0 to an instance method that computes min() across all component configs (including the default).

kylesayrs · 2026-03-11T20:19:03Z

+
+    def _resolve(self, prefix: str) -> QuantizationConfig | None:
+        for comp_prefix in self._sorted_prefixes:
+            if prefix.startswith(comp_prefix):


Similarly here. If the quant config has a prefix that does not match a component prefix, this check will fail

Right — if the quant config prefix doesn't match any model prefix, _resolve returns None (falls through to default or unquantized). The _validate_component_prefixes function now warns about unmatched prefixes after model loading, so users get a heads-up. The docstring also notes the WeightsMapper caveat.

yiliu30 · 2026-03-12T05:59:33Z

Synced offline with @lishunyang12, and the offline model support will be submitted in a separate PR.

lishunyang12 · 2026-03-12T11:12:49Z

Addressed all review feedback — latest is 82c95ef. Changes since last round:

Consolidated quantization + quantization_config into single quantization_config field
validate_quant_config now logs warnings directly, called in both __post_init__ and diffusers_loader
Simplified _build_single factory, error messages now include inspect.signature
Removed "none" string support — pass None directly
get_min_capability computes min across all component configs
Moved inline imports to top-level in thinker/talker, added debug log
Added v0.3.0 deprecation timeline to diffusion/quantization/__init__.py
Added comment in Bagel explaining explicit prefix usage
Added WeightsMapper remapping note in _resolve docstring
Added pytestmark = [pytest.mark.core_model, pytest.mark.diffusion] test markers
Fixed tests for new validation API

@alex-jw-brooks @kylesayrs @yenuo26 ready for another look when you get a chance.

lishunyang12 · 2026-03-12T13:01:41Z

Quick note on merge order with related PRs:

Int8 Quantization Support for DiT (Z-Image & Qwen-Image) #1470 (Int8) — close to mergeable after a few fixes. Planning to land that first, then rebase this PR and integrate int8 into the unified framework.
[Feature]: Bitsandbytes Quantization Support for Diffusion Pipelines #1528 (BNB) — needs more design work on the CLI flag (conflict_handler="resolve") and the from_pretrained monkeypatch mechanism. Can merge before or after this PR, but either way will need adaptation.

Will rebase once #1470 lands. cc @alex-jw-brooks @kylesayrs

Isotr0py · 2026-03-12T17:15:05Z

+    try:
+        return config_cls(**config_kwargs)
+    except TypeError:
+        pass
+    try:
+        return config_cls.from_config(config_kwargs)


Suggested change

try:

return config_cls(**config_kwargs)

except TypeError:

pass

try:

return config_cls.from_config(config_kwargs)

try:

return config_cls.from_config(config_kwargs)

I think we always expect that config_cls is initialized through config_cls.from_config, because it's an abstractmethod.

Done — removed the init fallback, now always goes through from_config().

Isotr0py · 2026-03-12T17:21:55Z

+    if torch.cuda.is_available():
+        capability = torch.cuda.get_device_capability()
+        min_cap = config.get_min_capability()
+        device_cap = capability[0] * 10 + capability[1]
+        if device_cap < min_cap:


We can use current_platform and has_device_capability here.

Done. Also removed validation.py entirely — it was duplicating checks that vLLM already does during model loading.

Isotr0py · 2026-03-12T17:25:04Z

+COMPONENT_SKIP_DEFAULTS: dict[str, list[str]] = {
+    "diffusion": [
+        "norm",
+        "layer_norm",
+        "group_norm",
+        "time_embed",
+        "label_emb",
+        "pos_embed",
+    ],
+    "audio": [
+        "norm",
+        "embed",
+        "codec",
+    ],
+    "generic": [
+        "norm",
+    ],
+}
+
+
+def get_default_skip_patterns(family: str = "generic") -> list[str]:
+    """Get default skip patterns for a model family."""
+    return list(COMPONENT_SKIP_DEFAULTS.get(family, COMPONENT_SKIP_DEFAULTS["generic"]))


Is this used to skip quantization for specific layers during online quantization?

It wasn't used anywhere. Removed.

lishunyang12 · 2026-03-13T18:47:06Z

@amy-why-3459 I am not familiar with the current benchmark for qwen3 omni, can you help me to point out useful ones so that we can use for benchmarking quantization? Thanks

amy-why-3459 · 2026-03-14T01:15:07Z

@amy-why-3459 I am not familiar with the current benchmark for qwen3 omni, can you help me to point out useful ones so that we can use for benchmarking quantization? Thanks

Of course! Please contact me when you're free, and I'll send you some baseline performance results.

amy-why-3459 · 2026-03-14T01:34:31Z

vllm bench serve     --omni   --dataset-name random   
--port 28889   --max-concurrency 10   
--model /home/models/Qwen3-Omni-30B-A3B-Instruct   
--endpoint /v1/chat/completions   
--backend openai-chat-omni   
--num-prompts 100   
--random-input-len 100   
--ignore-eos   
--percentile-metrics ttft,tpot,itl,e2el,audio_ttfp,audio_rtf  
 --random-output-len 100   
--extra_body '{"modalities": ["text", "audio"]}'

solved

yjb767868009 · 2026-03-24T11:13:24Z

int8 test is passed.

Integrate the unified quantization framework from main (vllm-project#1764) while preserving our AutoRound W4A16 feature additions: - Add INC/AutoRound to unified factory (_build_inc with bits→weight_bits mapping and checkpoint metadata filtering) - Port auto-detection from TransformerConfig into OmniDiffusionConfig.__post_init__ - Port weight validation (_is_expected_quantized_weight, _check_unloaded_weights) to use self.od_config.quantization_config - Remove old vllm_omni/diffusion/quantization/ wrapper hierarchy - Add comprehensive INC/AutoRound unit tests using build_quant_config pattern Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: yiliu30 <yi4.liu@intel.com>

Add FP8 quantization support to LongCat-Image and LongCat-Image-Edit pipelines, following the unified quantization framework introduced in vllm-project#1764. Changes: - Replace plain `nn.Linear` layers in `LongCatImageTransformer2DModel` with quantization-aware vLLM linear layers (`ReplicatedLinear`, `QKVParallelLinear`, `RowParallelLinear`, `ColumnParallelLinear`) and propagate `quant_config` through `FeedForward`, `LongCatImageAttention`, `LongCatImageTransformerBlock`, and `LongCatImageSingleTransformerBlock` - Pass `quant_config=od_config.quantization_config` to the transformer in both `LongCatImagePipeline` and `LongCatImageEditPipeline` - Fix `load_weights` in both pipelines to include VAE and text encoder parameters in the returned loaded-weights set, preventing spurious missing-weight warnings - Fix `TypeError`: `LongCatImageSingleTransformerBlock.__init__` was receiving an unsupported `prefix` keyword argument, causing a crash on startup whenever any quantization config was set - Document LongCat-Image in the FP8 quantization user guide Signed-off-by: lcukyfuture <zlf994478451@outlook.com>

Add FP8 quantization support to LongCat-Image and LongCat-Image-Edit pipelines, following the unified quantization framework introduced in vllm-project#1764. Changes: - Replace plain `nn.Linear` layers in `LongCatImageTransformer2DModel` with quantization-aware vLLM linear layers (`ReplicatedLinear`, `QKVParallelLinear`, `RowParallelLinear`, `ColumnParallelLinear`) and propagate `quant_config` through `FeedForward`, `LongCatImageAttention`, `LongCatImageTransformerBlock`, and `LongCatImageSingleTransformerBlock` - Pass `quant_config=od_config.quantization_config` to the transformer in both `LongCatImagePipeline` and `LongCatImageEditPipeline` - Fix `load_weights` in both pipelines to include VAE and text encoder parameters in the returned loaded-weights set - Fix `TypeError`: `LongCatImageSingleTransformerBlock.__init__` was receiving an unsupported `prefix` keyword argument, causing a crash on startup with any quantization config Signed-off-by: lcukyfuture <zlf994478451@outlook.com>

…arkers The unified quantization framework (vllm-project#1764) consolidated source code at vllm_omni/quantization/, but tests were still under tests/diffusion/quantization/, and they had no Buildkite CI coverage. This PR: - Moves tests/diffusion/quantization/ to tests/quantization/ to mirror the source layout. - Aligns pytest markers with the actual test type: * test_int8_config.py: core_model + cuda + L4 (GPU smoke test) * test_inc_config.py: core_model + cpu (pure config builder) * test_fp8_config.py: core_model + cpu (drop redundant diffusion marker) * test_gguf_config.py: core_model + cpu (drop redundant diffusion marker) - Updates the test docstring and contributing doc to reference the new path. After this change, the existing CUDA Unit Test with single card step (pytest -m 'core_model and cuda and L4 and not distributed_cuda') will automatically pick up the GPU quantization tests, and the Simple Unit Test step will pick up the CPU ones — so no dedicated Buildkite step is needed. Fixes vllm-project#2614 Signed-off-by: pjh4993 <pjh4993@naver.com>

lishunyang12 requested a review from hsliuustc0106 as a code owner March 9, 2026 19:04

chatgpt-codex-connector Bot reviewed Mar 9, 2026

View reviewed changes

lishunyang12 force-pushed the unified-quantization-framework branch 2 times, most recently from e9b4bc9 to c94fe49 Compare March 9, 2026 19:25

lishunyang12 mentioned this pull request Mar 9, 2026

[RFC]: Unified Quantization Framework for all models/all platforms/all methods #1763

Closed

1 task

yiliu30 mentioned this pull request Mar 10, 2026

[AutoRound] Add offline quantized W4A16 model support #1777

Merged

5 tasks

yenuo26 reviewed Mar 10, 2026

View reviewed changes

yiliu30 reviewed Mar 10, 2026

View reviewed changes

david6666666 requested changes Mar 10, 2026

View reviewed changes

lishunyang12 force-pushed the unified-quantization-framework branch from 900553a to 59436ee Compare March 10, 2026 16:33

This was referenced Mar 11, 2026

[RFC]: vLLM-Omni 2026 Q1 Roadmap #677

Open

[Feature]: NextStep1.1 quantized model is not supported #1815

Open

alex-jw-brooks previously requested changes Mar 11, 2026

View reviewed changes

alex-jw-brooks reviewed Mar 11, 2026

View reviewed changes

kylesayrs reviewed Mar 11, 2026

View reviewed changes

david6666666 mentioned this pull request Mar 12, 2026

[RFC]: v0.18.0 diffusion support JiusiServe/vllm-omni#160

Closed

10 tasks

lishunyang12 mentioned this pull request Mar 12, 2026

[Benchmark] Add quantization quality benchmark script (LPIPS) #1575

Closed

4 tasks

This was referenced Mar 12, 2026

[RFC] Q1 Quantization Support #1057

Closed

[RFC]: Continuous Quantization Support #1854

Open

Gaohan123 added this to the v0.18.0 milestone Mar 12, 2026

Isotr0py reviewed Mar 12, 2026

View reviewed changes

david6666666 added the high priority high priority issue, needs to be done asap label Mar 16, 2026

Gaohan123 enabled auto-merge (squash) March 24, 2026 09:13

Gaohan123 merged commit 6b93459 into vllm-project:main Mar 24, 2026
7 of 8 checks passed

ultranationalism mentioned this pull request Mar 24, 2026

[RFC]: Weight key remapping interface for quantization backends with non-standard checkpoint naming #2146

Open

amy-why-3459 mentioned this pull request Mar 25, 2026

[RFC]: Omni Model 2026 Q1 Roadmap #1191

Open

1 task

zhangj1an pushed a commit to zhangj1an/vllm-omni that referenced this pull request Mar 26, 2026

[Core] Unified quantization framework (vllm-project#1764)

b377b76

zhangj1an pushed a commit to zhangj1an/vllm-omni that referenced this pull request Mar 26, 2026

[Core] Unified quantization framework (vllm-project#1764)

4dcac79

lishunyang12 mentioned this pull request Mar 31, 2026

[Quantization] FP8 quantization framework for diffusion attention #1413

Draft

6 tasks

lyj-jjj mentioned this pull request Apr 8, 2026

[RFC]: support DiT MM online FP8 quantization on NPU #2592

Open

1 task

lishunyang12 mentioned this pull request Apr 8, 2026

[RFC]: Continuous Quantization Support for NPU #2438

Open

1 task

lcukyfuture mentioned this pull request Apr 9, 2026

[Feat] FP8 quantization support for LongCat-Image and LongCat-Image-Edit #2633

Open

5 tasks

pjh4993 mentioned this pull request Apr 13, 2026

[Bug]: fp8 online quantization produces catastrophic LPIPS regression across diffusion transformers (Z-Image, FLUX.1-dev, Qwen-Image) pjh4993/vllm-omni#10

Open

1 task

pjh4993 mentioned this pull request Apr 13, 2026

[Bug]: fp8 online quantization produces catastrophic LPIPS regression across diffusion transformers (Z-Image, FLUX.1-dev, Qwen-Image) #2728

Closed

1 task

		if "method" not in config_dict and self.quantization is not None:
		config_dict["method"] = self.quantization

		# ---------------------------------------------------------------------------


		def build_quant_config(

Conversation

lishunyang12 commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Unit Tests

E2E FP8 Quantization Results (1×A100 80GB)

Z-Image — Tongyi-MAI/Z-Image-Turbo (1024×1024, 50 steps, seed=42)

Qwen-Image — Qwen/Qwen-Image (1024×1024, 50 steps, seed=142)

Flux.1-dev — black-forest-labs/FLUX.1-dev (1024×1024, 20 steps, seed=42)

Bagel-7B-MoT — ByteDance-Seed/BAGEL-7B-MoT (50 steps, seed=52, cfg_text=4.0)

Qwen3-Omni — Pre-quantized ModelOpt FP8 (1×H200 141GB)

NVFP4 Trial — Qwen3-Omni on NVIDIA B200 (Blackwell)

What was done

What failed

Root cause analysis

Conclusion

How to Reproduce

Running Qwen3-Omni FP8 on a Single 64GB GPU

Step 1: Download the FP8 model

Step 2: Use the stage config

Step 3: Run

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiliu30 Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david6666666 left a comment

Choose a reason for hiding this comment

Review Summary (Second Pass)

🚫 Blocking Issues

1. Still Missing Integration Tests for Per-Component Quantization

⚠️ High Priority

2. New is Unused

3. Error Messages Still Lack Parameter Signatures

📝 Medium Priority

4. Documentation Still Missing

📊 Summary

✅ What to Do Next

Uh oh!

david6666666 left a comment

Choose a reason for hiding this comment

Review Summary (Second Pass)

🚫 Blocking Issues

1. Still Missing Integration Tests for Per-Component Quantization

⚠️ High Priority

2. New validate_quant_config() is Unused

3. Error Messages Still Lack Parameter Signatures

📝 Medium Priority

4. Documentation Still Missing

After (v0.16.0+)

Uh oh!

lishunyang12 commented Mar 10, 2026

Uh oh!

alex-jw-brooks left a comment

lishunyang12 commented Mar 9, 2026 •

edited

Loading

Z-Image — `Tongyi-MAI/Z-Image-Turbo` (1024×1024, 50 steps, seed=42)

Qwen-Image — `Qwen/Qwen-Image` (1024×1024, 50 steps, seed=142)

Flux.1-dev — `black-forest-labs/FLUX.1-dev` (1024×1024, 20 steps, seed=42)

Bagel-7B-MoT — `ByteDance-Seed/BAGEL-7B-MoT` (50 steps, seed=52, cfg_text=4.0)

yiliu30 Mar 10, 2026 •

edited

Loading

2. New `validate_quant_config()` is Unused