Skip to content

[Core] Unified quantization framework#1764

Merged
Gaohan123 merged 56 commits intovllm-project:mainfrom
lishunyang12:unified-quantization-framework
Mar 24, 2026
Merged

[Core] Unified quantization framework#1764
Gaohan123 merged 56 commits intovllm-project:mainfrom
lishunyang12:unified-quantization-framework

Conversation

@lishunyang12
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 commented Mar 9, 2026

Purpose

Close #1763.

Test Plan

Unit Tests

pytest tests/diffusion/quantization/test_fp8_config.py -v

E2E FP8 Quantization Results (1×A100 80GB)

bash tests/e2e/offline_inference/run_quantization_e2e.sh

All 8 generation tests passed. LPIPS quality comparison below.

Z-Image — Tongyi-MAI/Z-Image-Turbo (1024×1024, 50 steps, seed=42)

Config Memory (GiB) Mem Reduction Avg Time Speedup LPIPS
BF16 baseline 19.15 4.87s 1.00x (ref)
FP8 13.40 30% 3.70s 1.32x 0.0381
Visual comparison
BF16 FP8
zimage_bf16 zimage_fp8

Qwen-Image — Qwen/Qwen-Image (1024×1024, 50 steps, seed=142)

Config Memory (GiB) Mem Reduction Avg Time Speedup LPIPS
BF16 baseline 53.75 6.69s 1.00x (ref)
FP8 41.20 23% 5.16s 1.30x 0.2976
Visual comparison
BF16 FP8
qwen_bf16 qwen_fp8

Flux.1-dev — black-forest-labs/FLUX.1-dev (1024×1024, 20 steps, seed=42)

Config Memory (GiB) Mem Reduction Avg Time Speedup LPIPS
BF16 baseline 31.43 3.30s 1.00x (ref)
FP8 23.47 25% 2.79s 1.18x 0.1515
Visual comparison
BF16 FP8
flux_bf16 flux_fp8

Bagel-7B-MoT — ByteDance-Seed/BAGEL-7B-MoT (50 steps, seed=52, cfg_text=4.0)

FP8 applied to diffusion stage (Stage-1) only. LLM stage (Stage-0) remains BF16 — verified by quantization=None in engine config and finish_reason=stop.

Component Config Memory (GiB) Mem Reduction
Stage-0 (LLM) BF16 27.37
Stage-0 (LLM) BF16 (with fp8 flag) 27.37 0% (correct)
Stage-1 (DiT) BF16 26.47
Stage-1 (DiT) FP8 14.20 46%
Config Total Time Speedup LPIPS
BF16 baseline 24.76s 1.00x (ref)
FP8 (diffusion only) 21.97s 1.13x 0.0566
Visual comparison
BF16 FP8
bagel_bf16 bagel_fp8

Qwen3-Omni — Pre-quantized ModelOpt FP8 (1×H200 141GB)

Tested loading pre-quantized FP8 checkpoint (asdazd/Qwen3-Omni-30B-A3B-Instruct_modelopt_FP8) via the unified quantization framework. FP8 is auto-detected from thinker_config.text_config.quantization_config and scoped to language_model only — audio encoder, vision encoder, talker, and code2wav remain BF16.

Thinker-only benchmark (max_model_len=8192, enforce_eager=True, Triton FP8 MoE backend)

Config Model Memory (GiB) Mem Reduction Available KV Cache (GiB) Weight Load Time Decode (tok/s)
BF16 baseline 59.26 72.53 133.3s 41.6
FP8 (ModelOpt) 31.41 47% 100.34 8.8s 39.9

Key findings:

  • 47% model memory reduction (59.26 → 31.41 GiB) — enables full pipeline on single 64GB GPU
  • 38% more KV cache (72.53 → 100.34 GiB) — supports 133.79x vs 96.71x max concurrency at 8K context
  • 15x faster weight loading (133.3s → 8.8s) — single FP8 shard vs 15 BF16 shards
  • Comparable throughput — 39.9 vs 41.6 tok/s (~4% difference, Triton MoE backend)
  • Output quality preserved — coherent text generation on identical prompts with greedy decoding
  • Per-component routing verifiedquantization=modelopt for thinker, quantization=None for talker/code2wav

Full pipeline (thinker + talker + code2wav) on single GPU within 64GB budget

All 3 stages running on a single GPU with memory constrained to ~54 GiB total:

Stage Component Weights (GiB) KV Cache (GiB) Total (GiB)
Stage-0 Thinker (FP8) 31.37 9.95 ~41.3
Stage-1 Talker (BF16) 8.50 3.71 ~12.2
Stage-2 Code2Wav 0.41 ~0 ~0.4
Total 40.28 13.66 ~54 GiB

Full pipeline produces text + audio output end-to-end. BF16 full pipeline requires ~66+ GiB (thinker alone is 59.26 GiB) — impossible on 64GB without FP8.

Sample output (text + audio)

Prompt: "Explain the system architecture for a scalable audio generation pipeline. Answer in 15 words."

Text output: "Distributed microservices, real-time queuing, GPU-accelerated models, containerized inference, dynamic scaling, and low-latency APIs ensure high-throughput, flexible audio generation."

Audio output: output_0_78a82c12-09dc-4da9-bf61-d76b859df275.wav

Note: Default FLASHINFER_CUTLASS FP8 MoE backend requires long first-run kernel compilation on H200. Use VLLM_USE_FLASHINFER_MOE_FP8=0 to select Triton backend for immediate startup.


NVFP4 Trial — Qwen3-Omni on NVIDIA B200 (Blackwell)

Status: Loading and inference pipeline works end-to-end, but output quality is unacceptable. NVFP4 support for MoE omni models requires further investigation.

We attempted to create and load a pre-quantized NVFP4 (W4A8) checkpoint of Qwen3-Omni-30B-A3B-Instruct on an NVIDIA B200 (183 GiB, SM 100+). The goal was to validate the unified quantization framework with NVFP4 and demonstrate single-GPU deployment on RTX 5090 (32 GiB).

What was done

  1. Quantized checkpoint created using NVIDIA ModelOpt v0.42.0 (NVFP4QTensor.quantize)

    • Thinker MoE experts: manually split fused gate_up_proj/down_proj into per-expert format, packed to NVFP4
    • Thinker attention (q/k/v/o_proj): calibrated via mtq.quantize with 256 prompts
    • Audio tower + visual encoder: all linear layers packed to NVFP4
    • Talker + code2wav: kept BF16 (MoE fused expert format incompatible with NVFP4 FusedMoE kernel)
    • Checkpoint: shunyang90/Qwen3-Omni-30B-A3B-Instruct-NVFP4
  2. Framework changes for pre-quantized checkpoint loading (kept in this PR — benefits FP8 too):

    • Subclassed upstream Qwen3OmniMoeAudioEncoder to accept quant_config for audio tower linear layers
    • Added pre-quantized method detection (modelopt, modelopt_fp4, modelopt_mxfp8) to route quant_config to all thinker subcomponents (audio_tower, visual, language_model)
    • Dynamic quantization (e.g. --quantization fp8) still scopes to language_model only — no regression
    • Fixed talker rope_scalingrope_parameters fallback for newer transformers
    • Fixed code2wav stage hf_config_name: thinker_configcode2wav_config
    • Fixed init_timeout not passed from CLI to Omni()
    • Fixed omni.stage_listomni.num_stages in end2end example
  3. Full 3-stage pipeline loaded and ran on B200:

Stage Component Quantization Memory (GiB) Init Time
Stage-0 Thinker NVFP4 17.66 39s (cached cubins)
Stage-1 Talker BF16 8.50 49s
Stage-2 Code2Wav BF16 0.41 15s
Total 26.57

What failed

Text generation produced garbage output (all ! tokens, 1024 repetitions). Audio was also unintelligible. The full pipeline completed without runtime errors — the issue is purely numerical accuracy.

Root cause analysis

  1. weight_scale_2 mismatch (fixed but didn't resolve quality): Initially, gate_proj and up_proj per-expert had independent global scales (up to 4.35x ratio across 11,728 of 12,288 expert pairs). vLLM's FusedMoE kernel requires w1_weight_scale_2 == w3_weight_scale_2. Fixed by using shared_ws2 = max(gate.abs().max(), up.abs().max()) / (6.0 * 448.0). Verified 0 mismatches post-fix — output still garbled.

  2. Likely remaining issues:

    • NVFP4 MoE kernel (FLASHINFER_TRTLLM) may have accuracy issues with Qwen3-Omni's small expert dimensions (intermediate_size=768, hidden_size=2048) — these kernels were primarily validated on larger experts (e.g. Qwen3.5-397B with intermediate_size=2048)
    • Calibration was text-only (256 prompts via mtq.quantize) — insufficient for the multimodal thinker's audio/vision encoder pathways
    • The quantize_linear_weights_nvfp4 function uses simple abs().max() scaling without calibration — proper calibrated quantization (SmoothQuant-style or ModelOpt's full calibration pipeline) may be required
    • Potential interaction between NVFP4 attention layers and BF16 MoE router gates during the mixed-precision forward pass

Conclusion

NVFP4 for Qwen3-Omni MoE requires either:

  • Upstream NVFP4 MoE kernel validation for small-expert MoE architectures
  • Properly calibrated per-expert quantization (not just abs().max() scaling)
  • Or a different quantization approach (e.g. GPTQ/AWQ-style weight-only quantization)

The framework changes from this trial are retained as they are needed for FP8 and future quantization methods.


How to Reproduce

# Unit tests
pytest tests/diffusion/quantization/test_fp8_config.py -v

# Full E2E test suite (all models + LPIPS)
bash tests/e2e/offline_inference/run_quantization_e2e.sh

# Skip heavy models
bash tests/e2e/offline_inference/run_quantization_e2e.sh --skip-flux --skip-bagel

# Z-Image FP8
python examples/offline_inference/text_to_image/text_to_image.py \
  --model Tongyi-MAI/Z-Image-Turbo --quantization fp8 \
  --prompt "a cup of coffee on the table" --seed 42

# Flux FP8
python examples/offline_inference/text_to_image/text_to_image.py \
  --model black-forest-labs/FLUX.1-dev --quantization fp8 \
  --prompt "a cup of coffee on the table" --seed 42

# Qwen-Image FP8
python examples/offline_inference/text_to_image/text_to_image.py \
  --model Qwen/Qwen-Image --quantization fp8 \
  --prompt "a cup of coffee on the table" --seed 142

# Bagel FP8 (diffusion-only)
python examples/offline_inference/bagel/end2end.py \
  --model ByteDance-Seed/BAGEL-7B-MoT --quantization fp8 \
  --modality text2img --prompts "A cute cat" --steps 50

# Qwen3-Omni FP8 (pre-quantized modelopt checkpoint, thinker-only)
pip install modelscope
python -c "from modelscope import snapshot_download; snapshot_download('asdazd/Qwen3-Omni-30B-A3B-Instruct_modelopt_FP8')"
VLLM_USE_FLASHINFER_MOE_FP8=0 python examples/offline_inference/qwen3_omni/end2end.py \
  --model /path/to/Qwen3-Omni-30B-A3B-Instruct_modelopt_FP8 \
  --stage-configs-path fp8_stage_config.yaml \
  --query-type text --modalities text

# Qwen3-Omni FP8 full pipeline on single 64GB GPU
VLLM_USE_FLASHINFER_MOE_FP8=0 python examples/offline_inference/qwen3_omni/end2end.py \
  --model /path/to/Qwen3-Omni-30B-A3B-Instruct_modelopt_FP8 \
  --stage-configs-path fp8_full_pipeline_64gb.yaml \
  --query-type text

# LPIPS benchmark (standalone)
python benchmarks/diffusion/quantization_quality.py \
  --model Tongyi-MAI/Z-Image-Turbo --quantization fp8 \
  --prompts "a cup of coffee on the table" --seed 42

Running Qwen3-Omni FP8 on a Single 64GB GPU

The FP8 quantized checkpoint enables running the full Qwen3-Omni pipeline (thinker + talker + code2wav with audio output) on a single 64GB GPU. This is impossible with BF16 (thinker alone requires 59.26 GiB).

Step 1: Download the FP8 model

pip install modelscope
python -c "from modelscope import snapshot_download; print(snapshot_download('asdazd/Qwen3-Omni-30B-A3B-Instruct_modelopt_FP8'))"

Step 2: Use the stage config

Use the included fp8_full_pipeline_64gb.yaml file, updating the model path in all 3 stages to point to your downloaded model.

The gpu_memory_utilization values in the file are tuned for H200 (141GB) simulating a 64GB budget. On an actual 64GB card, use these values:

  • Stage-0 (Thinker): gpu_memory_utilization: 0.65
  • Stage-1 (Talker): gpu_memory_utilization: 0.20
  • Stage-2 (Code2Wav): gpu_memory_utilization: 0.05

Step 3: Run

VLLM_USE_FLASHINFER_MOE_FP8=0 python examples/offline_inference/qwen3_omni/end2end.py \
  --model /path/to/Qwen3-Omni-30B-A3B-Instruct_modelopt_FP8 \
  --stage-configs-path fp8_full_pipeline_64gb.yaml \
  --query-type text

cc @Isotr0py @hsliuustc0106 @alex-jw-brooks @yiliu30 @kylesayrs

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2b7cde8058

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/quantization/compat.py Outdated
return None

# Wrap in compat shim
wrapper = DiffusionQuantizationConfig(config)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Return concrete legacy wrappers from compat factory

get_diffusion_quant_config() now always returns the generic DiffusionQuantizationConfig wrapper, so legacy fields/methods on concrete wrappers are lost (for example, FP8 callers can no longer access activation_scheme/ignored_layers on the returned object). This breaks existing compatibility usage that previously received DiffusionFp8Config/DiffusionGgufConfig instances, despite the shim claiming old API preservation.

Useful? React with 👍 / 👎.

Comment thread vllm_omni/diffusion/data.py Outdated
Comment on lines +572 to +573
if "method" not in config_dict and self.quantization is not None:
config_dict["method"] = self.quantization
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve per-component configs when quantization is set

When quantization_config is a per-component mapping and quantization is also provided, injecting a top-level method key forces build_quant_config() down the single-method path, so component overrides (e.g. {"vae": None}) are silently ignored. In this scenario, users expecting mixed per-component behavior will get global quantization instead, which can change accuracy/memory characteristics unexpectedly.

Useful? React with 👍 / 👎.

Comment thread vllm_omni/quantization/factory.py Outdated
elif isinstance(value, str):
config = _build_single(value)
elif isinstance(value, dict):
method = value.pop("method", None)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop mutating nested per-component config dictionaries

_build_component_config() calls value.pop("method", None) on each component dict, but build_quant_config() only shallow-copies the top-level mapping, so nested dicts from the caller are mutated in place. Reusing the same config object (for retries or multiple model inits) can then fail because method has been removed after the first call.

Useful? React with 👍 / 👎.

from vllm_omni.diffusion.quantization import get_diffusion_quant_config

config = get_diffusion_quant_config("fp8")
def test_build_quant_config_fp8():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added pytestmark = [pytest.mark.core_model, pytest.mark.diffusion] at module level.

# ---------------------------------------------------------------------------


def build_quant_config(
Copy link
Copy Markdown
Contributor

@yiliu30 yiliu30 Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lishunyang12, nice work unifying the quantization framework and reusing vLLM's existing infrastructure, the config factory is much cleaner.

One thing we'll need is support for loading quantization configs from disk, i.e., auto-detecting the config embedded in the model config by quantization tools like AutoRound. For example, FLUX.1-dev-AutoRound-w4a16 stores the quantization config in transformer/config.json:

  "quantization_config": {
      "quant_method": "auto-round",
      "bits": 4,
      "group_size": 128,
      "packing_format": "auto_round:auto_gptq",
      ...
  }

Could you take a look at the resolve_quantization part in #1777? (The rest of that PR will be refactored once yours lands.)
It would be great if the unified framework could:

  • read tf_model_config.quantization_config from disk, and
  • route it through build_quant_config() automatically,

So users don’t need to specify quantization configs manually.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer! I went through resolve_quantization in #1777 — will incorporate disk-based auto-detection into this PR. Plan is to read tf_model_config.quantization_config and route it through build_quant_config() so users don't need to specify configs manually.

A few things I'll improve on vs #1777's current approach:

  • Copy the dict before popping quant_method (avoid mutating tf_model_config)
  • Wire up maybe_update_config properly instead of leaving it commented out
  • Integrate prefix propagation for checkpoint-driven methods (AutoRound/GPTQ/AWQ)

I'm also extending coverage to qwen3-omni (per-component quantization) and bagel (quant_config threading through the transformer). Will push updates soon.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks a lot!

Integrate prefix propagation for checkpoint-driven methods

I'd suggest we do it in a separate PR so this one can move forward quickly. But it’s totally up to you.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, let's split prefix propagation into a follow-up.

Copy link
Copy Markdown
Collaborator

@david6666666 david6666666 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary (Second Pass)

Great progress on addressing the meta tensor crash and threading quant_config through Bagel and Qwen3-Omni! The per-component routing is working well. However, several critical gaps remain before this can be merged.


🚫 Blocking Issues

1. Still Missing Integration Tests for Per-Component Quantization

The test file was expanded from 88 to 204 lines, but all tests are still unit tests for config building. We need at least one integration test that actually loads a model with per-component quantization:

Without this, we can't verify the per-component routing actually works end-to-end. This is the core feature for multi-stage models (Bagel, Qwen-Omni, etc.), so it needs validation.


⚠️ High Priority

2. New is Unused

The validation utility in is a great addition, but it's not called anywhere. At minimum, it should be invoked in :

Otherwise, this code will drift and become stale.

3. Error Messages Still Lack Parameter Signatures

When fails to instantiate a method, users still get unhelpful errors:

This doesn't tell users what parameters are expected. Please add:


📝 Medium Priority

4. Documentation Still Missing

The docstring is a good start, but we need:

  • Migration guide for users of the old API
  • Example configurations for common models (Flux, Bagel, Qwen-Omni)
  • List of supported methods with platform compatibility notes

Example migration guide:

python
from vllm_omni.diffusion.quantization import get_diffusion_quant_config
config = get_diffusion_quant_config(fp8, activation_scheme=static)
python
from vllm_omni.quantization import build_quant_config
config = build_quant_config({method: fp8, activation_scheme: static})


📊 Summary

Issue First Review Second Review Status
Meta tensor crash ❌ Not fixed ✅ Fixed
Per-component routing ❌ Not implemented ✅ Implemented
Integration tests ⚠️ Missing ⚠️ Still missing
Validation usage ❌ Not exists ✅ Created but unused ⚠️
Error messages ⚠️ Unclear ⚠️ Still unclear
Documentation ❌ Missing ❌ Still missing

✅ What to Do Next

  1. Add at least 1 integration test for per-component quantization (blocking)
  2. **Call ** in
  3. Improve error messages with parameter signatures
  4. Add migration guide to docs or README

Once these are addressed, I'm happy to approve. Thanks for the great work on unifying the quantization framework! 🙏

Copy link
Copy Markdown
Collaborator

@david6666666 david6666666 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary (Second Pass)

Great progress on addressing the meta tensor crash and threading quant_config through Bagel and Qwen3-Omni! The per-component routing is working well. However, several critical gaps remain before this can be merged.


🚫 Blocking Issues

1. Still Missing Integration Tests for Per-Component Quantization

The test file was expanded from 88 to 204 lines, but all tests are still unit tests for config building. We need at least one integration test that actually loads a model with per-component quantization:

def test_bagel_per_component_quantization():
    """Verify Bagel loads with transformer at FP8 and VAE unquantized."""
    config = OmniDiffusionConfig(
        model="ByteDance-Seed/BAGEL-7B-MoT",
        quantization_config={
            "language_model": {"method": "fp8"},
            "vae": None,
        }
    )
    # Load model and verify quantization is applied correctly
    model = ...  # actual model loading
    # Check that language_model layers have quantization
    # Check that VAE layers don't have quantization

Without this, we can't verify the per-component routing actually works end-to-end. This is the core feature for multi-stage models (Bagel, Qwen-Omni, etc.), so it needs validation.


⚠️ High Priority

2. New validate_quant_config() is Unused

The validation utility in quantization/validation.py is a great addition, but it's not called anywhere. At minimum, it should be invoked in OmniDiffusionConfig.__post_init__():

# diffusion/data.py
if self.quantization_config is not None:
    warnings = validate_quant_config(
        self.quantization_config,
        dtype=self.torch_dtype,
    )
    for warning in warnings:
        logger.warning(warning)

Otherwise, this code will drift and become stale.

3. Error Messages Still Lack Parameter Signatures

When build_quant_config() fails to instantiate a method, users still get unhelpful errors:

# factory.py:78-92
raise TypeError(
    f"Cannot instantiate {config_cls.__name__} with kwargs {kwargs}. "
    f"Check the constructor or from_config() signature."
)

This doesn't tell users what parameters are expected. Please add:

import inspect
sig = inspect.signature(config_cls.__init__)
raise TypeError(
    f"Cannot instantiate {config_cls.__name__} with kwargs {kwargs}. "
    f"Expected signature: {sig}. "
    f"Supported methods: {SUPPORTED_QUANTIZATION_METHODS}"
)

📝 Medium Priority

4. Documentation Still Missing

The __init__.py docstring is a good start, but we need:

  • Migration guide for users of the old diffusion/quantization/ API
  • Example configurations for common models (Flux, Bagel, Qwen-Omni)
  • List of supported methods with platform compatibility notes

Example migration guide:

## Migration Guide

### Before (v0.14.0)
```python
from vllm_omni.diffusion.quantization import get_diffusion_quant_config
config = get_diffusion_quant_config("fp8", activation_scheme="static")

After (v0.16.0+)

from vllm_omni.quantization import build_quant_config
config = build_quant_config({"method": "fp8", "activation_scheme": "static"})

---

### 📊 Summary

| Issue | First Review | Second Review | Status |
|-------|-------------|---------------|--------|
| Meta tensor crash | ❌ Not fixed | ✅ Fixed | ✅ |
| Per-component routing | ❌ Not implemented | ✅ Implemented | ✅ |
| Integration tests | ⚠️ Missing | ⚠️ Still missing | ❌ |
| Validation usage | ❌ Not exists | ✅ Created but unused | ⚠️ |
| Error messages | ⚠️ Unclear | ⚠️ Still unclear | ❌ |
| Documentation | ❌ Missing | ❌ Still missing | ❌ |

---

### ✅ What to Do Next

1. **Add at least 1 integration test** for per-component quantization (blocking)
2. **Call `validate_quant_config()`** in `OmniDiffusionConfig.__post_init__()`
3. **Improve error messages** with parameter signatures
4. **Add migration guide** to docs or README

Once these are addressed, I'm happy to approve. Thanks for the great work on unifying the quantization framework! 🙏

@lishunyang12 lishunyang12 force-pushed the unified-quantization-framework branch from 900553a to 59436ee Compare March 10, 2026 16:33
@lishunyang12
Copy link
Copy Markdown
Collaborator Author

PTAL @david6666666 @yiliu30

Copy link
Copy Markdown
Contributor

@alex-jw-brooks alex-jw-brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Some thoughts

Comment thread vllm_omni/diffusion/data.py Outdated
if self.quantization_config is not None:
warnings = validate_quant_config(
self.quantization_config,
dtype=self.dtype if isinstance(self.dtype, torch.dtype) else torch.bfloat16,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should pass self.dtype here instead, because it's already normalized to a torch dtype with a bfloat16 right above this

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — using self.dtype now since it's already normalized above.

Comment thread vllm_omni/diffusion/data.py Outdated
dtype=self.dtype if isinstance(self.dtype, torch.dtype) else torch.bfloat16,
)
for warning in warnings:
logger.warning(warning)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to just warn in the validation function instead of returning the warnings here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — validate_quant_config now logs warnings directly via logger.warning() instead of returning a list.

architectures=["Qwen3MoeForCausalLM"],
)
if language_quant_config is not quant_config:
from dataclasses import replace
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move inline imports that are inline to be at the top of files unless they explicitly need to be to avoid things like optional and circular dep issues?

I think adding a debug log here would also be helpful

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — moved both imports to top-level and added a debug log when per-component quant resolves.

talker_config.text_config.rope_parameters["rope_theta"] = talker_config.text_config.rope_theta
self.quant_config = vllm_config.quant_config
quant_config = vllm_config.quant_config
from vllm_omni.quantization.component_config import ComponentQuantizationConfig
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about inline imports

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread vllm_omni/quantization/factory.py Outdated
Comment on lines +56 to +81
if kwargs:
try:
return config_cls(**kwargs)
except TypeError:
pass
try:
return config_cls.from_config(kwargs)
except (TypeError, KeyError, ValueError):
sig = inspect.signature(config_cls.__init__)
raise TypeError(
f"Cannot instantiate {config_cls.__name__} with kwargs {kwargs}. Expected signature: {sig}"
) from None

try:
return config_cls()
except TypeError:
pass
try:
return config_cls.from_config({})
except (TypeError, KeyError, ValueError):
sig = inspect.signature(config_cls.__init__)
raise TypeError(
f"Cannot instantiate {config_cls.__name__} without arguments. "
f"Expected signature: {sig}. "
f"Provide constructor kwargs via dict config."
) from None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty redundant

Suggested change
if kwargs:
try:
return config_cls(**kwargs)
except TypeError:
pass
try:
return config_cls.from_config(kwargs)
except (TypeError, KeyError, ValueError):
sig = inspect.signature(config_cls.__init__)
raise TypeError(
f"Cannot instantiate {config_cls.__name__} with kwargs {kwargs}. Expected signature: {sig}"
) from None
try:
return config_cls()
except TypeError:
pass
try:
return config_cls.from_config({})
except (TypeError, KeyError, ValueError):
sig = inspect.signature(config_cls.__init__)
raise TypeError(
f"Cannot instantiate {config_cls.__name__} without arguments. "
f"Expected signature: {sig}. "
f"Provide constructor kwargs via dict config."
) from None
config_kwargs = kwargs if kwargs else {}
try:
return config_cls(**config_kwargs)
except TypeError:
pass
try:
return config_cls.from_config(config_kwargs)
except (TypeError, KeyError, ValueError):
sig = inspect.signature(config_cls.__init__)
raise TypeError(
f"Cannot instantiate {config_cls.__name__} with kwargs {config_kwargs}. Expected signature: {sig}"
) from None

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — merged the two paths into a single try/except chain as suggested.


if isinstance(spec, str):
if spec.lower() == "none":
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this and just have None be passed as a type instead of a string

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed "none" string support. Callers should pass None directly.

"get_vllm_quant_config_for_layers",
"SUPPORTED_QUANTIZATION_METHODS",
]
raise ImportError("vllm_omni.diffusion.quantization has been removed. Use vllm_omni.quantization instead.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also add a time for when this will be deleted to avoid having an empty package like this for too long

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added "This stub will be removed in v0.3.0" to the docstring.

Comment thread vllm_omni/diffusion/data.py Outdated
# dict ({"transformer": "fp8", "vae": None}).
quantization: str | None = None
quantization_config: "DiffusionQuantizationConfig | dict[str, Any] | None" = None
quantization_config: QuantizationConfig | dict[str, Any] | None = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cleaner to only support one kwarg for the actual config; maybe just have one attr that can either be a str or QuantConfig etc, since we can just use resolve the str to {"method": <str>} right?

This will also avoid situations where config / quant config are in conflict

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — consolidated into a single quantization_config field that accepts str | QuantizationConfig | dict | None. The str case is resolved to {"method": <str>} internally. Removed the separate quantization field.

Comment thread vllm_omni/diffusion/data.py Outdated
# Supported methods: "fp8", "gguf" (more via vllm_omni.quantization)
# Can be a string ("fp8"), dict ({"method": "fp8", ...}), or per-component
# dict ({"transformer": "fp8", "vae": None}).
quantization: str | None = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I wonder if it would be better to have a better heuristic for indicating if things are component dicts or not - I'm not sure if it would be better to allow either quantization_config or a per_component_quantization_config (but not both together). The current way of checking if something is a flat config or not seems opaque, and knowing directly would be more clear.

Or we could have a flag in the quant config dict that explicitly specifies is_component. Checking for method feels strange, since some people may assume setting the method at the top level for a dict with multiple components would just use the same method for all of them

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current heuristic in _is_per_component_dict requires at least one value to be None or a dict with a "method" key — so a flat dict like {"activation_scheme": "static"} won't be misdetected as per-component. This avoids needing a separate field while keeping the detection reliable. If you think a separate per_component_quantization_config field would be clearer, happy to discuss — but I think the single-field approach is simpler for users.


self.language_model = Qwen2MoTForCausalLM(llm_config)
quant_config = od_config.quantization_config
self.language_model = Qwen2MoTForCausalLM(llm_config, quant_config=quant_config, prefix="bagel.language_model")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining why the bagel pipeline explicitly sets the prefixes here, while other pipelines don't?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added a comment explaining that Bagel uses explicit prefixes because its HF config nests the language model under bagel.language_model rather than a top-level transformer key.

This handles vLLM's quantization methods that need to process weights
after loading (e.g., FP8 online quantization from BF16/FP16 weights).
"""
for _, module in model.named_modules():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be helpful to validate somewhere and at least warn if a component quant config is provided, but it's invalid - I think currently on the diffusion path, since the model isn't passed earlier, prefixes won't be validated and it'll be silently unused here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — added a validate_quant_config(config, model=model) call in diffusers_loader.py after model loading. This validates component prefixes against actual model modules and warns about mismatches.

config = self._resolve(prefix)
if config is None:
return None
return config.get_quant_method(layer, prefix)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vLLM has the concept of remapping from quantization prefixes to module prefixes. This system isn't perfect, but something to consider in this design (ie, quantization config prefixes do not perfectly match vLLM model definition prefixes due to renaming, ect.)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — added a note in the _resolve docstring about WeightsMapper prefix remapping. The validation in _validate_component_prefixes also mentions this caveat now. For now the prefix matching is top-level only, which should work for the common cases (Bagel, Qwen3-Omni). We can refine this if we hit edge cases with deeper remapping.


@classmethod
def get_min_capability(cls) -> int:
return 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to take a min across all component configs?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — changed get_min_capability from a @classmethod returning 0 to an instance method that computes min() across all component configs (including the default).


def _resolve(self, prefix: str) -> QuantizationConfig | None:
for comp_prefix in self._sorted_prefixes:
if prefix.startswith(comp_prefix):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here. If the quant config has a prefix that does not match a component prefix, this check will fail

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right — if the quant config prefix doesn't match any model prefix, _resolve returns None (falls through to default or unquantized). The _validate_component_prefixes function now warns about unmatched prefixes after model loading, so users get a heads-up. The docstring also notes the WeightsMapper caveat.

@yiliu30
Copy link
Copy Markdown
Contributor

yiliu30 commented Mar 12, 2026

Synced offline with @lishunyang12, and the offline model support will be submitted in a separate PR.

@lishunyang12
Copy link
Copy Markdown
Collaborator Author

lishunyang12 commented Mar 12, 2026

Addressed all review feedback — latest is 82c95ef. Changes since last round:

  • Consolidated quantization + quantization_config into single quantization_config field
  • validate_quant_config now logs warnings directly, called in both __post_init__ and diffusers_loader
  • Simplified _build_single factory, error messages now include inspect.signature
  • Removed "none" string support — pass None directly
  • get_min_capability computes min across all component configs
  • Moved inline imports to top-level in thinker/talker, added debug log
  • Added v0.3.0 deprecation timeline to diffusion/quantization/__init__.py
  • Added comment in Bagel explaining explicit prefix usage
  • Added WeightsMapper remapping note in _resolve docstring
  • Added pytestmark = [pytest.mark.core_model, pytest.mark.diffusion] test markers
  • Fixed tests for new validation API

@alex-jw-brooks @kylesayrs @yenuo26 ready for another look when you get a chance.

@lishunyang12
Copy link
Copy Markdown
Collaborator Author

Quick note on merge order with related PRs:

Will rebase once #1470 lands. cc @alex-jw-brooks @kylesayrs

@Gaohan123 Gaohan123 added this to the v0.18.0 milestone Mar 12, 2026
Comment thread vllm_omni/quantization/factory.py Outdated
Comment on lines +61 to +66
try:
return config_cls(**config_kwargs)
except TypeError:
pass
try:
return config_cls.from_config(config_kwargs)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try:
return config_cls(**config_kwargs)
except TypeError:
pass
try:
return config_cls.from_config(config_kwargs)
try:
return config_cls.from_config(config_kwargs)

I think we always expect that config_cls is initialized through config_cls.from_config, because it's an abstractmethod.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed the init fallback, now always goes through from_config().

Comment thread vllm_omni/quantization/validation.py Outdated
Comment on lines +31 to +35
if torch.cuda.is_available():
capability = torch.cuda.get_device_capability()
min_cap = config.get_min_capability()
device_cap = capability[0] * 10 + capability[1]
if device_cap < min_cap:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use current_platform and has_device_capability here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also removed validation.py entirely — it was duplicating checks that vLLM already does during model loading.

Comment thread vllm_omni/quantization/defaults.py Outdated
Comment on lines +7 to +29
COMPONENT_SKIP_DEFAULTS: dict[str, list[str]] = {
"diffusion": [
"norm",
"layer_norm",
"group_norm",
"time_embed",
"label_emb",
"pos_embed",
],
"audio": [
"norm",
"embed",
"codec",
],
"generic": [
"norm",
],
}


def get_default_skip_patterns(family: str = "generic") -> list[str]:
"""Get default skip patterns for a model family."""
return list(COMPONENT_SKIP_DEFAULTS.get(family, COMPONENT_SKIP_DEFAULTS["generic"]))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used to skip quantization for specific layers during online quantization?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't used anywhere. Removed.

@lishunyang12
Copy link
Copy Markdown
Collaborator Author

@amy-why-3459 I am not familiar with the current benchmark for qwen3 omni, can you help me to point out useful ones so that we can use for benchmarking quantization? Thanks

@amy-why-3459
Copy link
Copy Markdown
Contributor

@amy-why-3459 I am not familiar with the current benchmark for qwen3 omni, can you help me to point out useful ones so that we can use for benchmarking quantization? Thanks

Of course! Please contact me when you're free, and I'll send you some baseline performance results.

@amy-why-3459
Copy link
Copy Markdown
Contributor

vllm bench serve     --omni   --dataset-name random   
--port 28889   --max-concurrency 10   
--model /home/models/Qwen3-Omni-30B-A3B-Instruct   
--endpoint /v1/chat/completions   
--backend openai-chat-omni   
--num-prompts 100   
--random-input-len 100   
--ignore-eos   
--percentile-metrics ttft,tpot,itl,e2el,audio_ttfp,audio_rtf  
 --random-output-len 100   
--extra_body '{"modalities": ["text", "audio"]}'

@david6666666 david6666666 added the high priority high priority issue, needs to be done asap label Mar 16, 2026
@Gaohan123 Gaohan123 enabled auto-merge (squash) March 24, 2026 09:13
@Gaohan123 Gaohan123 merged commit 6b93459 into vllm-project:main Mar 24, 2026
7 of 8 checks passed
@yjb767868009
Copy link
Copy Markdown
Contributor

int8 test is passed.

yiliu30 added a commit to yiliu30/vllm-omni-fork that referenced this pull request Mar 24, 2026
Integrate the unified quantization framework from main (vllm-project#1764) while
preserving our AutoRound W4A16 feature additions:

- Add INC/AutoRound to unified factory (_build_inc with bits→weight_bits
  mapping and checkpoint metadata filtering)
- Port auto-detection from TransformerConfig into OmniDiffusionConfig.__post_init__
- Port weight validation (_is_expected_quantized_weight, _check_unloaded_weights)
  to use self.od_config.quantization_config
- Remove old vllm_omni/diffusion/quantization/ wrapper hierarchy
- Add comprehensive INC/AutoRound unit tests using build_quant_config pattern

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: yiliu30 <yi4.liu@intel.com>
zhangj1an pushed a commit to zhangj1an/vllm-omni that referenced this pull request Mar 26, 2026
zhangj1an pushed a commit to zhangj1an/vllm-omni that referenced this pull request Mar 26, 2026
lcukyfuture added a commit to lcukyfuture/vllm-omni that referenced this pull request Apr 9, 2026
Add FP8 quantization support to LongCat-Image and LongCat-Image-Edit
pipelines, following the unified quantization framework introduced in
vllm-project#1764.

Changes:
- Replace plain `nn.Linear` layers in `LongCatImageTransformer2DModel`
  with quantization-aware vLLM linear layers (`ReplicatedLinear`,
  `QKVParallelLinear`, `RowParallelLinear`, `ColumnParallelLinear`)
  and propagate `quant_config` through `FeedForward`,
  `LongCatImageAttention`, `LongCatImageTransformerBlock`, and
  `LongCatImageSingleTransformerBlock`
- Pass `quant_config=od_config.quantization_config` to the transformer
  in both `LongCatImagePipeline` and `LongCatImageEditPipeline`
- Fix `load_weights` in both pipelines to include VAE and text encoder
  parameters in the returned loaded-weights set, preventing spurious
  missing-weight warnings
- Fix `TypeError`: `LongCatImageSingleTransformerBlock.__init__` was
  receiving an unsupported `prefix` keyword argument, causing a crash
  on startup whenever any quantization config was set
- Document LongCat-Image in the FP8 quantization user guide

Signed-off-by: lcukyfuture <zlf994478451@outlook.com>
lcukyfuture added a commit to lcukyfuture/vllm-omni that referenced this pull request Apr 9, 2026
Add FP8 quantization support to LongCat-Image and LongCat-Image-Edit
pipelines, following the unified quantization framework introduced in
vllm-project#1764.

Changes:
- Replace plain `nn.Linear` layers in `LongCatImageTransformer2DModel`
  with quantization-aware vLLM linear layers (`ReplicatedLinear`,
  `QKVParallelLinear`, `RowParallelLinear`, `ColumnParallelLinear`)
  and propagate `quant_config` through `FeedForward`,
  `LongCatImageAttention`, `LongCatImageTransformerBlock`, and
  `LongCatImageSingleTransformerBlock`
- Pass `quant_config=od_config.quantization_config` to the transformer
  in both `LongCatImagePipeline` and `LongCatImageEditPipeline`
- Fix `load_weights` in both pipelines to include VAE and text encoder
  parameters in the returned loaded-weights set
- Fix `TypeError`: `LongCatImageSingleTransformerBlock.__init__` was
  receiving an unsupported `prefix` keyword argument, causing a crash
  on startup with any quantization config

Signed-off-by: lcukyfuture <zlf994478451@outlook.com>
pjh4993 added a commit to pjh4993/vllm-omni that referenced this pull request Apr 10, 2026
…arkers

The unified quantization framework (vllm-project#1764) consolidated source code at
vllm_omni/quantization/, but tests were still under tests/diffusion/quantization/,
and they had no Buildkite CI coverage.

This PR:

- Moves tests/diffusion/quantization/ to tests/quantization/ to mirror the
  source layout.
- Aligns pytest markers with the actual test type:
  * test_int8_config.py: core_model + cuda + L4 (GPU smoke test)
  * test_inc_config.py:  core_model + cpu (pure config builder)
  * test_fp8_config.py:  core_model + cpu (drop redundant diffusion marker)
  * test_gguf_config.py: core_model + cpu (drop redundant diffusion marker)
- Updates the test docstring and contributing doc to reference the new path.

After this change, the existing CUDA Unit Test with single card step
(pytest -m 'core_model and cuda and L4 and not distributed_cuda') will
automatically pick up the GPU quantization tests, and the Simple Unit
Test step will pick up the CPU ones — so no dedicated Buildkite step
is needed.

Fixes vllm-project#2614

Signed-off-by: pjh4993 <pjh4993@naver.com>
pjh4993 added a commit to pjh4993/vllm-omni that referenced this pull request Apr 10, 2026
…arkers

The unified quantization framework (vllm-project#1764) consolidated source code at
vllm_omni/quantization/, but tests were still under tests/diffusion/quantization/,
and they had no Buildkite CI coverage.

This PR:

- Moves tests/diffusion/quantization/ to tests/quantization/ to mirror the
  source layout.
- Aligns pytest markers with the actual test type:
  * test_int8_config.py: core_model + cuda + L4 (GPU smoke test)
  * test_inc_config.py:  core_model + cpu (pure config builder)
  * test_fp8_config.py:  core_model + cpu (drop redundant diffusion marker)
  * test_gguf_config.py: core_model + cpu (drop redundant diffusion marker)
- Updates the test docstring and contributing doc to reference the new path.

After this change, the existing CUDA Unit Test with single card step
(pytest -m 'core_model and cuda and L4 and not distributed_cuda') will
automatically pick up the GPU quantization tests, and the Simple Unit
Test step will pick up the CPU ones — so no dedicated Buildkite step
is needed.

Fixes vllm-project#2614

Signed-off-by: pjh4993 <pjh4993@naver.com>
pjh4993 added a commit to pjh4993/vllm-omni that referenced this pull request Apr 10, 2026
…arkers

The unified quantization framework (vllm-project#1764) consolidated source code at
vllm_omni/quantization/, but tests were still under tests/diffusion/quantization/,
and they had no Buildkite CI coverage.

This PR:

- Moves tests/diffusion/quantization/ to tests/quantization/ to mirror the
  source layout.
- Aligns pytest markers with the actual test type:
  * test_int8_config.py: core_model + cuda + L4 (GPU smoke test)
  * test_inc_config.py:  core_model + cpu (pure config builder)
  * test_fp8_config.py:  core_model + cpu (drop redundant diffusion marker)
  * test_gguf_config.py: core_model + cpu (drop redundant diffusion marker)
- Updates the test docstring and contributing doc to reference the new path.

After this change, the existing CUDA Unit Test with single card step
(pytest -m 'core_model and cuda and L4 and not distributed_cuda') will
automatically pick up the GPU quantization tests, and the Simple Unit
Test step will pick up the CPU ones — so no dedicated Buildkite step
is needed.

Fixes vllm-project#2614

Signed-off-by: pjh4993 <pjh4993@naver.com>
pjh4993 added a commit to pjh4993/vllm-omni that referenced this pull request Apr 11, 2026
…arkers

The unified quantization framework (vllm-project#1764) consolidated source code at
vllm_omni/quantization/, but tests were still under tests/diffusion/quantization/,
and they had no Buildkite CI coverage.

This PR:

- Moves tests/diffusion/quantization/ to tests/quantization/ to mirror the
  source layout.
- Aligns pytest markers with the actual test type:
  * test_int8_config.py: core_model + cuda + L4 (GPU smoke test)
  * test_inc_config.py:  core_model + cpu (pure config builder)
  * test_fp8_config.py:  core_model + cpu (drop redundant diffusion marker)
  * test_gguf_config.py: core_model + cpu (drop redundant diffusion marker)
- Updates the test docstring and contributing doc to reference the new path.

After this change, the existing CUDA Unit Test with single card step
(pytest -m 'core_model and cuda and L4 and not distributed_cuda') will
automatically pick up the GPU quantization tests, and the Simple Unit
Test step will pick up the CPU ones — so no dedicated Buildkite step
is needed.

Fixes vllm-project#2614

Signed-off-by: pjh4993 <pjh4993@naver.com>
pjh4993 added a commit to pjh4993/vllm-omni that referenced this pull request Apr 12, 2026
…arkers

The unified quantization framework (vllm-project#1764) consolidated source code at
vllm_omni/quantization/, but tests were still under tests/diffusion/quantization/,
and they had no Buildkite CI coverage.

This PR:

- Moves tests/diffusion/quantization/ to tests/quantization/ to mirror the
  source layout.
- Aligns pytest markers with the actual test type:
  * test_int8_config.py: core_model + cuda + L4 (GPU smoke test)
  * test_inc_config.py:  core_model + cpu (pure config builder)
  * test_fp8_config.py:  core_model + cpu (drop redundant diffusion marker)
  * test_gguf_config.py: core_model + cpu (drop redundant diffusion marker)
- Updates the test docstring and contributing doc to reference the new path.

After this change, the existing CUDA Unit Test with single card step
(pytest -m 'core_model and cuda and L4 and not distributed_cuda') will
automatically pick up the GPU quantization tests, and the Simple Unit
Test step will pick up the CPU ones — so no dedicated Buildkite step
is needed.

Fixes vllm-project#2614

Signed-off-by: pjh4993 <pjh4993@naver.com>
pjh4993 added a commit to pjh4993/vllm-omni that referenced this pull request Apr 13, 2026
…arkers

The unified quantization framework (vllm-project#1764) consolidated source code at
vllm_omni/quantization/, but tests were still under tests/diffusion/quantization/,
and they had no Buildkite CI coverage.

This PR:

- Moves tests/diffusion/quantization/ to tests/quantization/ to mirror the
  source layout.
- Aligns pytest markers with the actual test type:
  * test_int8_config.py: core_model + cuda + L4 (GPU smoke test)
  * test_inc_config.py:  core_model + cpu (pure config builder)
  * test_fp8_config.py:  core_model + cpu (drop redundant diffusion marker)
  * test_gguf_config.py: core_model + cpu (drop redundant diffusion marker)
- Updates the test docstring and contributing doc to reference the new path.

After this change, the existing CUDA Unit Test with single card step
(pytest -m 'core_model and cuda and L4 and not distributed_cuda') will
automatically pick up the GPU quantization tests, and the Simple Unit
Test step will pick up the CPU ones — so no dedicated Buildkite step
is needed.

Fixes vllm-project#2614

Signed-off-by: pjh4993 <pjh4993@naver.com>
pjh4993 added a commit to pjh4993/vllm-omni that referenced this pull request Apr 13, 2026
…arkers

The unified quantization framework (vllm-project#1764) consolidated source code at
vllm_omni/quantization/, but tests were still under tests/diffusion/quantization/,
and they had no Buildkite CI coverage.

This PR:

- Moves tests/diffusion/quantization/ to tests/quantization/ to mirror the
  source layout.
- Aligns pytest markers with the actual test type:
  * test_int8_config.py: core_model + cuda + L4 (GPU smoke test)
  * test_inc_config.py:  core_model + cpu (pure config builder)
  * test_fp8_config.py:  core_model + cpu (drop redundant diffusion marker)
  * test_gguf_config.py: core_model + cpu (drop redundant diffusion marker)
- Updates the test docstring and contributing doc to reference the new path.

After this change, the existing CUDA Unit Test with single card step
(pytest -m 'core_model and cuda and L4 and not distributed_cuda') will
automatically pick up the GPU quantization tests, and the Simple Unit
Test step will pick up the CPU ones — so no dedicated Buildkite step
is needed.

Fixes vllm-project#2614

Signed-off-by: pjh4993 <pjh4993@naver.com>
pjh4993 added a commit to pjh4993/vllm-omni that referenced this pull request Apr 13, 2026
…arkers

The unified quantization framework (vllm-project#1764) consolidated source code at
vllm_omni/quantization/, but tests were still under tests/diffusion/quantization/,
and they had no Buildkite CI coverage.

This PR:

- Moves tests/diffusion/quantization/ to tests/quantization/ to mirror the
  source layout.
- Aligns pytest markers with the actual test type:
  * test_int8_config.py: core_model + cuda + L4 (GPU smoke test)
  * test_inc_config.py:  core_model + cpu (pure config builder)
  * test_fp8_config.py:  core_model + cpu (drop redundant diffusion marker)
  * test_gguf_config.py: core_model + cpu (drop redundant diffusion marker)
- Updates the test docstring and contributing doc to reference the new path.

After this change, the existing CUDA Unit Test with single card step
(pytest -m 'core_model and cuda and L4 and not distributed_cuda') will
automatically pick up the GPU quantization tests, and the Simple Unit
Test step will pick up the CPU ones — so no dedicated Buildkite step
is needed.

Fixes vllm-project#2614

Signed-off-by: pjh4993 <pjh4993@naver.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority high priority issue, needs to be done asap ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Unified Quantization Framework for all models/all platforms/all methods