[Core] Unified quantization framework#1764
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2b7cde8058
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| return None | ||
|
|
||
| # Wrap in compat shim | ||
| wrapper = DiffusionQuantizationConfig(config) |
There was a problem hiding this comment.
Return concrete legacy wrappers from compat factory
get_diffusion_quant_config() now always returns the generic DiffusionQuantizationConfig wrapper, so legacy fields/methods on concrete wrappers are lost (for example, FP8 callers can no longer access activation_scheme/ignored_layers on the returned object). This breaks existing compatibility usage that previously received DiffusionFp8Config/DiffusionGgufConfig instances, despite the shim claiming old API preservation.
Useful? React with 👍 / 👎.
| if "method" not in config_dict and self.quantization is not None: | ||
| config_dict["method"] = self.quantization |
There was a problem hiding this comment.
Preserve per-component configs when quantization is set
When quantization_config is a per-component mapping and quantization is also provided, injecting a top-level method key forces build_quant_config() down the single-method path, so component overrides (e.g. {"vae": None}) are silently ignored. In this scenario, users expecting mixed per-component behavior will get global quantization instead, which can change accuracy/memory characteristics unexpectedly.
Useful? React with 👍 / 👎.
| elif isinstance(value, str): | ||
| config = _build_single(value) | ||
| elif isinstance(value, dict): | ||
| method = value.pop("method", None) |
There was a problem hiding this comment.
Stop mutating nested per-component config dictionaries
_build_component_config() calls value.pop("method", None) on each component dict, but build_quant_config() only shallow-copies the top-level mapping, so nested dicts from the caller are mutated in place. Reusing the same config object (for retries or multiple model inits) can then fail because method has been removed after the first call.
Useful? React with 👍 / 👎.
e9b4bc9 to
c94fe49
Compare
| from vllm_omni.diffusion.quantization import get_diffusion_quant_config | ||
|
|
||
| config = get_diffusion_quant_config("fp8") | ||
| def test_build_quant_config_fp8(): |
There was a problem hiding this comment.
please add mark, you can refer to the mark description: https://github.com/vllm-project/vllm-omni/blob/main/docs/contributing/ci/tests_markers.md
There was a problem hiding this comment.
Will do, thanks.
There was a problem hiding this comment.
Done — added pytestmark = [pytest.mark.core_model, pytest.mark.diffusion] at module level.
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| def build_quant_config( |
There was a problem hiding this comment.
Hi @lishunyang12, nice work unifying the quantization framework and reusing vLLM's existing infrastructure, the config factory is much cleaner.
One thing we'll need is support for loading quantization configs from disk, i.e., auto-detecting the config embedded in the model config by quantization tools like AutoRound. For example, FLUX.1-dev-AutoRound-w4a16 stores the quantization config in transformer/config.json:
"quantization_config": {
"quant_method": "auto-round",
"bits": 4,
"group_size": 128,
"packing_format": "auto_round:auto_gptq",
...
}Could you take a look at the resolve_quantization part in #1777? (The rest of that PR will be refactored once yours lands.)
It would be great if the unified framework could:
- read
tf_model_config.quantization_configfrom disk, and - route it through
build_quant_config()automatically,
So users don’t need to specify quantization configs manually.
There was a problem hiding this comment.
Thanks for the pointer! I went through resolve_quantization in #1777 — will incorporate disk-based auto-detection into this PR. Plan is to read tf_model_config.quantization_config and route it through build_quant_config() so users don't need to specify configs manually.
A few things I'll improve on vs #1777's current approach:
- Copy the dict before popping
quant_method(avoid mutatingtf_model_config) - Wire up
maybe_update_configproperly instead of leaving it commented out - Integrate prefix propagation for checkpoint-driven methods (AutoRound/GPTQ/AWQ)
I'm also extending coverage to qwen3-omni (per-component quantization) and bagel (quant_config threading through the transformer). Will push updates soon.
There was a problem hiding this comment.
Great! Thanks a lot!
Integrate prefix propagation for checkpoint-driven methods
I'd suggest we do it in a separate PR so this one can move forward quickly. But it’s totally up to you.
There was a problem hiding this comment.
Sounds good, let's split prefix propagation into a follow-up.
david6666666
left a comment
There was a problem hiding this comment.
Review Summary (Second Pass)
Great progress on addressing the meta tensor crash and threading quant_config through Bagel and Qwen3-Omni! The per-component routing is working well. However, several critical gaps remain before this can be merged.
🚫 Blocking Issues
1. Still Missing Integration Tests for Per-Component Quantization
The test file was expanded from 88 to 204 lines, but all tests are still unit tests for config building. We need at least one integration test that actually loads a model with per-component quantization:
Without this, we can't verify the per-component routing actually works end-to-end. This is the core feature for multi-stage models (Bagel, Qwen-Omni, etc.), so it needs validation.
⚠️ High Priority
2. New is Unused
The validation utility in is a great addition, but it's not called anywhere. At minimum, it should be invoked in :
Otherwise, this code will drift and become stale.
3. Error Messages Still Lack Parameter Signatures
When fails to instantiate a method, users still get unhelpful errors:
This doesn't tell users what parameters are expected. Please add:
📝 Medium Priority
4. Documentation Still Missing
The docstring is a good start, but we need:
- Migration guide for users of the old API
- Example configurations for common models (Flux, Bagel, Qwen-Omni)
- List of supported methods with platform compatibility notes
Example migration guide:
python
from vllm_omni.diffusion.quantization import get_diffusion_quant_config
config = get_diffusion_quant_config(fp8, activation_scheme=static)
python
from vllm_omni.quantization import build_quant_config
config = build_quant_config({method: fp8, activation_scheme: static})
📊 Summary
| Issue | First Review | Second Review | Status |
|---|---|---|---|
| Meta tensor crash | ❌ Not fixed | ✅ Fixed | ✅ |
| Per-component routing | ❌ Not implemented | ✅ Implemented | ✅ |
| Integration tests | ❌ | ||
| Validation usage | ❌ Not exists | ✅ Created but unused | |
| Error messages | ❌ | ||
| Documentation | ❌ Missing | ❌ Still missing | ❌ |
✅ What to Do Next
- Add at least 1 integration test for per-component quantization (blocking)
- **Call ** in
- Improve error messages with parameter signatures
- Add migration guide to docs or README
Once these are addressed, I'm happy to approve. Thanks for the great work on unifying the quantization framework! 🙏
david6666666
left a comment
There was a problem hiding this comment.
Review Summary (Second Pass)
Great progress on addressing the meta tensor crash and threading quant_config through Bagel and Qwen3-Omni! The per-component routing is working well. However, several critical gaps remain before this can be merged.
🚫 Blocking Issues
1. Still Missing Integration Tests for Per-Component Quantization
The test file was expanded from 88 to 204 lines, but all tests are still unit tests for config building. We need at least one integration test that actually loads a model with per-component quantization:
def test_bagel_per_component_quantization():
"""Verify Bagel loads with transformer at FP8 and VAE unquantized."""
config = OmniDiffusionConfig(
model="ByteDance-Seed/BAGEL-7B-MoT",
quantization_config={
"language_model": {"method": "fp8"},
"vae": None,
}
)
# Load model and verify quantization is applied correctly
model = ... # actual model loading
# Check that language_model layers have quantization
# Check that VAE layers don't have quantizationWithout this, we can't verify the per-component routing actually works end-to-end. This is the core feature for multi-stage models (Bagel, Qwen-Omni, etc.), so it needs validation.
⚠️ High Priority
2. New validate_quant_config() is Unused
The validation utility in quantization/validation.py is a great addition, but it's not called anywhere. At minimum, it should be invoked in OmniDiffusionConfig.__post_init__():
# diffusion/data.py
if self.quantization_config is not None:
warnings = validate_quant_config(
self.quantization_config,
dtype=self.torch_dtype,
)
for warning in warnings:
logger.warning(warning)Otherwise, this code will drift and become stale.
3. Error Messages Still Lack Parameter Signatures
When build_quant_config() fails to instantiate a method, users still get unhelpful errors:
# factory.py:78-92
raise TypeError(
f"Cannot instantiate {config_cls.__name__} with kwargs {kwargs}. "
f"Check the constructor or from_config() signature."
)This doesn't tell users what parameters are expected. Please add:
import inspect
sig = inspect.signature(config_cls.__init__)
raise TypeError(
f"Cannot instantiate {config_cls.__name__} with kwargs {kwargs}. "
f"Expected signature: {sig}. "
f"Supported methods: {SUPPORTED_QUANTIZATION_METHODS}"
)📝 Medium Priority
4. Documentation Still Missing
The __init__.py docstring is a good start, but we need:
- Migration guide for users of the old
diffusion/quantization/API - Example configurations for common models (Flux, Bagel, Qwen-Omni)
- List of supported methods with platform compatibility notes
Example migration guide:
## Migration Guide
### Before (v0.14.0)
```python
from vllm_omni.diffusion.quantization import get_diffusion_quant_config
config = get_diffusion_quant_config("fp8", activation_scheme="static")After (v0.16.0+)
from vllm_omni.quantization import build_quant_config
config = build_quant_config({"method": "fp8", "activation_scheme": "static"})
---
### 📊 Summary
| Issue | First Review | Second Review | Status |
|-------|-------------|---------------|--------|
| Meta tensor crash | ❌ Not fixed | ✅ Fixed | ✅ |
| Per-component routing | ❌ Not implemented | ✅ Implemented | ✅ |
| Integration tests | ⚠️ Missing | ⚠️ Still missing | ❌ |
| Validation usage | ❌ Not exists | ✅ Created but unused | ⚠️ |
| Error messages | ⚠️ Unclear | ⚠️ Still unclear | ❌ |
| Documentation | ❌ Missing | ❌ Still missing | ❌ |
---
### ✅ What to Do Next
1. **Add at least 1 integration test** for per-component quantization (blocking)
2. **Call `validate_quant_config()`** in `OmniDiffusionConfig.__post_init__()`
3. **Improve error messages** with parameter signatures
4. **Add migration guide** to docs or README
Once these are addressed, I'm happy to approve. Thanks for the great work on unifying the quantization framework! 🙏
900553a to
59436ee
Compare
|
PTAL @david6666666 @yiliu30 |
alex-jw-brooks
left a comment
There was a problem hiding this comment.
Thanks for the PR! Some thoughts
| if self.quantization_config is not None: | ||
| warnings = validate_quant_config( | ||
| self.quantization_config, | ||
| dtype=self.dtype if isinstance(self.dtype, torch.dtype) else torch.bfloat16, |
There was a problem hiding this comment.
We should pass self.dtype here instead, because it's already normalized to a torch dtype with a bfloat16 right above this
There was a problem hiding this comment.
Done — using self.dtype now since it's already normalized above.
| dtype=self.dtype if isinstance(self.dtype, torch.dtype) else torch.bfloat16, | ||
| ) | ||
| for warning in warnings: | ||
| logger.warning(warning) |
There was a problem hiding this comment.
I think it would be better to just warn in the validation function instead of returning the warnings here
There was a problem hiding this comment.
Done — validate_quant_config now logs warnings directly via logger.warning() instead of returning a list.
| architectures=["Qwen3MoeForCausalLM"], | ||
| ) | ||
| if language_quant_config is not quant_config: | ||
| from dataclasses import replace |
There was a problem hiding this comment.
Can you move inline imports that are inline to be at the top of files unless they explicitly need to be to avoid things like optional and circular dep issues?
I think adding a debug log here would also be helpful
There was a problem hiding this comment.
Done — moved both imports to top-level and added a debug log when per-component quant resolves.
| talker_config.text_config.rope_parameters["rope_theta"] = talker_config.text_config.rope_theta | ||
| self.quant_config = vllm_config.quant_config | ||
| quant_config = vllm_config.quant_config | ||
| from vllm_omni.quantization.component_config import ComponentQuantizationConfig |
There was a problem hiding this comment.
Same comment about inline imports
| if kwargs: | ||
| try: | ||
| return config_cls(**kwargs) | ||
| except TypeError: | ||
| pass | ||
| try: | ||
| return config_cls.from_config(kwargs) | ||
| except (TypeError, KeyError, ValueError): | ||
| sig = inspect.signature(config_cls.__init__) | ||
| raise TypeError( | ||
| f"Cannot instantiate {config_cls.__name__} with kwargs {kwargs}. Expected signature: {sig}" | ||
| ) from None | ||
|
|
||
| try: | ||
| return config_cls() | ||
| except TypeError: | ||
| pass | ||
| try: | ||
| return config_cls.from_config({}) | ||
| except (TypeError, KeyError, ValueError): | ||
| sig = inspect.signature(config_cls.__init__) | ||
| raise TypeError( | ||
| f"Cannot instantiate {config_cls.__name__} without arguments. " | ||
| f"Expected signature: {sig}. " | ||
| f"Provide constructor kwargs via dict config." | ||
| ) from None |
There was a problem hiding this comment.
This is pretty redundant
| if kwargs: | |
| try: | |
| return config_cls(**kwargs) | |
| except TypeError: | |
| pass | |
| try: | |
| return config_cls.from_config(kwargs) | |
| except (TypeError, KeyError, ValueError): | |
| sig = inspect.signature(config_cls.__init__) | |
| raise TypeError( | |
| f"Cannot instantiate {config_cls.__name__} with kwargs {kwargs}. Expected signature: {sig}" | |
| ) from None | |
| try: | |
| return config_cls() | |
| except TypeError: | |
| pass | |
| try: | |
| return config_cls.from_config({}) | |
| except (TypeError, KeyError, ValueError): | |
| sig = inspect.signature(config_cls.__init__) | |
| raise TypeError( | |
| f"Cannot instantiate {config_cls.__name__} without arguments. " | |
| f"Expected signature: {sig}. " | |
| f"Provide constructor kwargs via dict config." | |
| ) from None | |
| config_kwargs = kwargs if kwargs else {} | |
| try: | |
| return config_cls(**config_kwargs) | |
| except TypeError: | |
| pass | |
| try: | |
| return config_cls.from_config(config_kwargs) | |
| except (TypeError, KeyError, ValueError): | |
| sig = inspect.signature(config_cls.__init__) | |
| raise TypeError( | |
| f"Cannot instantiate {config_cls.__name__} with kwargs {config_kwargs}. Expected signature: {sig}" | |
| ) from None |
There was a problem hiding this comment.
Done — merged the two paths into a single try/except chain as suggested.
|
|
||
| if isinstance(spec, str): | ||
| if spec.lower() == "none": | ||
| return None |
There was a problem hiding this comment.
I think we should remove this and just have None be passed as a type instead of a string
There was a problem hiding this comment.
Done — removed "none" string support. Callers should pass None directly.
| "get_vllm_quant_config_for_layers", | ||
| "SUPPORTED_QUANTIZATION_METHODS", | ||
| ] | ||
| raise ImportError("vllm_omni.diffusion.quantization has been removed. Use vllm_omni.quantization instead.") |
There was a problem hiding this comment.
I think we should also add a time for when this will be deleted to avoid having an empty package like this for too long
There was a problem hiding this comment.
Done — added "This stub will be removed in v0.3.0" to the docstring.
| # dict ({"transformer": "fp8", "vae": None}). | ||
| quantization: str | None = None | ||
| quantization_config: "DiffusionQuantizationConfig | dict[str, Any] | None" = None | ||
| quantization_config: QuantizationConfig | dict[str, Any] | None = None |
There was a problem hiding this comment.
I think it would be cleaner to only support one kwarg for the actual config; maybe just have one attr that can either be a str or QuantConfig etc, since we can just use resolve the str to {"method": <str>} right?
This will also avoid situations where config / quant config are in conflict
There was a problem hiding this comment.
Done — consolidated into a single quantization_config field that accepts str | QuantizationConfig | dict | None. The str case is resolved to {"method": <str>} internally. Removed the separate quantization field.
| # Supported methods: "fp8", "gguf" (more via vllm_omni.quantization) | ||
| # Can be a string ("fp8"), dict ({"method": "fp8", ...}), or per-component | ||
| # dict ({"transformer": "fp8", "vae": None}). | ||
| quantization: str | None = None |
There was a problem hiding this comment.
Also, I wonder if it would be better to have a better heuristic for indicating if things are component dicts or not - I'm not sure if it would be better to allow either quantization_config or a per_component_quantization_config (but not both together). The current way of checking if something is a flat config or not seems opaque, and knowing directly would be more clear.
Or we could have a flag in the quant config dict that explicitly specifies is_component. Checking for method feels strange, since some people may assume setting the method at the top level for a dict with multiple components would just use the same method for all of them
There was a problem hiding this comment.
The current heuristic in _is_per_component_dict requires at least one value to be None or a dict with a "method" key — so a flat dict like {"activation_scheme": "static"} won't be misdetected as per-component. This avoids needing a separate field while keeping the detection reliable. If you think a separate per_component_quantization_config field would be clearer, happy to discuss — but I think the single-field approach is simpler for users.
|
|
||
| self.language_model = Qwen2MoTForCausalLM(llm_config) | ||
| quant_config = od_config.quantization_config | ||
| self.language_model = Qwen2MoTForCausalLM(llm_config, quant_config=quant_config, prefix="bagel.language_model") |
There was a problem hiding this comment.
Can you add a comment explaining why the bagel pipeline explicitly sets the prefixes here, while other pipelines don't?
There was a problem hiding this comment.
Done — added a comment explaining that Bagel uses explicit prefixes because its HF config nests the language model under bagel.language_model rather than a top-level transformer key.
| This handles vLLM's quantization methods that need to process weights | ||
| after loading (e.g., FP8 online quantization from BF16/FP16 weights). | ||
| """ | ||
| for _, module in model.named_modules(): |
There was a problem hiding this comment.
It may be helpful to validate somewhere and at least warn if a component quant config is provided, but it's invalid - I think currently on the diffusion path, since the model isn't passed earlier, prefixes won't be validated and it'll be silently unused here?
There was a problem hiding this comment.
Good catch — added a validate_quant_config(config, model=model) call in diffusers_loader.py after model loading. This validates component prefixes against actual model modules and warns about mismatches.
| config = self._resolve(prefix) | ||
| if config is None: | ||
| return None | ||
| return config.get_quant_method(layer, prefix) |
There was a problem hiding this comment.
vLLM has the concept of remapping from quantization prefixes to module prefixes. This system isn't perfect, but something to consider in this design (ie, quantization config prefixes do not perfectly match vLLM model definition prefixes due to renaming, ect.)
There was a problem hiding this comment.
Good point — added a note in the _resolve docstring about WeightsMapper prefix remapping. The validation in _validate_component_prefixes also mentions this caveat now. For now the prefix matching is top-level only, which should work for the common cases (Bagel, Qwen3-Omni). We can refine this if we hit edge cases with deeper remapping.
|
|
||
| @classmethod | ||
| def get_min_capability(cls) -> int: | ||
| return 0 |
There was a problem hiding this comment.
Do you mean to take a min across all component configs?
There was a problem hiding this comment.
Yes — changed get_min_capability from a @classmethod returning 0 to an instance method that computes min() across all component configs (including the default).
|
|
||
| def _resolve(self, prefix: str) -> QuantizationConfig | None: | ||
| for comp_prefix in self._sorted_prefixes: | ||
| if prefix.startswith(comp_prefix): |
There was a problem hiding this comment.
Similarly here. If the quant config has a prefix that does not match a component prefix, this check will fail
There was a problem hiding this comment.
Right — if the quant config prefix doesn't match any model prefix, _resolve returns None (falls through to default or unquantized). The _validate_component_prefixes function now warns about unmatched prefixes after model loading, so users get a heads-up. The docstring also notes the WeightsMapper caveat.
|
Synced offline with @lishunyang12, and the offline model support will be submitted in a separate PR. |
|
Addressed all review feedback — latest is 82c95ef. Changes since last round:
@alex-jw-brooks @kylesayrs @yenuo26 ready for another look when you get a chance. |
|
Quick note on merge order with related PRs:
Will rebase once #1470 lands. cc @alex-jw-brooks @kylesayrs |
| try: | ||
| return config_cls(**config_kwargs) | ||
| except TypeError: | ||
| pass | ||
| try: | ||
| return config_cls.from_config(config_kwargs) |
There was a problem hiding this comment.
| try: | |
| return config_cls(**config_kwargs) | |
| except TypeError: | |
| pass | |
| try: | |
| return config_cls.from_config(config_kwargs) | |
| try: | |
| return config_cls.from_config(config_kwargs) |
I think we always expect that config_cls is initialized through config_cls.from_config, because it's an abstractmethod.
There was a problem hiding this comment.
Done — removed the init fallback, now always goes through from_config().
| if torch.cuda.is_available(): | ||
| capability = torch.cuda.get_device_capability() | ||
| min_cap = config.get_min_capability() | ||
| device_cap = capability[0] * 10 + capability[1] | ||
| if device_cap < min_cap: |
There was a problem hiding this comment.
We can use current_platform and has_device_capability here.
There was a problem hiding this comment.
Done. Also removed validation.py entirely — it was duplicating checks that vLLM already does during model loading.
| COMPONENT_SKIP_DEFAULTS: dict[str, list[str]] = { | ||
| "diffusion": [ | ||
| "norm", | ||
| "layer_norm", | ||
| "group_norm", | ||
| "time_embed", | ||
| "label_emb", | ||
| "pos_embed", | ||
| ], | ||
| "audio": [ | ||
| "norm", | ||
| "embed", | ||
| "codec", | ||
| ], | ||
| "generic": [ | ||
| "norm", | ||
| ], | ||
| } | ||
|
|
||
|
|
||
| def get_default_skip_patterns(family: str = "generic") -> list[str]: | ||
| """Get default skip patterns for a model family.""" | ||
| return list(COMPONENT_SKIP_DEFAULTS.get(family, COMPONENT_SKIP_DEFAULTS["generic"])) |
There was a problem hiding this comment.
Is this used to skip quantization for specific layers during online quantization?
There was a problem hiding this comment.
It wasn't used anywhere. Removed.
|
@amy-why-3459 I am not familiar with the current benchmark for qwen3 omni, can you help me to point out useful ones so that we can use for benchmarking quantization? Thanks |
Of course! Please contact me when you're free, and I'll send you some baseline performance results. |
|
|
int8 test is passed. |
Integrate the unified quantization framework from main (vllm-project#1764) while preserving our AutoRound W4A16 feature additions: - Add INC/AutoRound to unified factory (_build_inc with bits→weight_bits mapping and checkpoint metadata filtering) - Port auto-detection from TransformerConfig into OmniDiffusionConfig.__post_init__ - Port weight validation (_is_expected_quantized_weight, _check_unloaded_weights) to use self.od_config.quantization_config - Remove old vllm_omni/diffusion/quantization/ wrapper hierarchy - Add comprehensive INC/AutoRound unit tests using build_quant_config pattern Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: yiliu30 <yi4.liu@intel.com>
Add FP8 quantization support to LongCat-Image and LongCat-Image-Edit pipelines, following the unified quantization framework introduced in vllm-project#1764. Changes: - Replace plain `nn.Linear` layers in `LongCatImageTransformer2DModel` with quantization-aware vLLM linear layers (`ReplicatedLinear`, `QKVParallelLinear`, `RowParallelLinear`, `ColumnParallelLinear`) and propagate `quant_config` through `FeedForward`, `LongCatImageAttention`, `LongCatImageTransformerBlock`, and `LongCatImageSingleTransformerBlock` - Pass `quant_config=od_config.quantization_config` to the transformer in both `LongCatImagePipeline` and `LongCatImageEditPipeline` - Fix `load_weights` in both pipelines to include VAE and text encoder parameters in the returned loaded-weights set, preventing spurious missing-weight warnings - Fix `TypeError`: `LongCatImageSingleTransformerBlock.__init__` was receiving an unsupported `prefix` keyword argument, causing a crash on startup whenever any quantization config was set - Document LongCat-Image in the FP8 quantization user guide Signed-off-by: lcukyfuture <zlf994478451@outlook.com>
Add FP8 quantization support to LongCat-Image and LongCat-Image-Edit pipelines, following the unified quantization framework introduced in vllm-project#1764. Changes: - Replace plain `nn.Linear` layers in `LongCatImageTransformer2DModel` with quantization-aware vLLM linear layers (`ReplicatedLinear`, `QKVParallelLinear`, `RowParallelLinear`, `ColumnParallelLinear`) and propagate `quant_config` through `FeedForward`, `LongCatImageAttention`, `LongCatImageTransformerBlock`, and `LongCatImageSingleTransformerBlock` - Pass `quant_config=od_config.quantization_config` to the transformer in both `LongCatImagePipeline` and `LongCatImageEditPipeline` - Fix `load_weights` in both pipelines to include VAE and text encoder parameters in the returned loaded-weights set - Fix `TypeError`: `LongCatImageSingleTransformerBlock.__init__` was receiving an unsupported `prefix` keyword argument, causing a crash on startup with any quantization config Signed-off-by: lcukyfuture <zlf994478451@outlook.com>
…arkers The unified quantization framework (vllm-project#1764) consolidated source code at vllm_omni/quantization/, but tests were still under tests/diffusion/quantization/, and they had no Buildkite CI coverage. This PR: - Moves tests/diffusion/quantization/ to tests/quantization/ to mirror the source layout. - Aligns pytest markers with the actual test type: * test_int8_config.py: core_model + cuda + L4 (GPU smoke test) * test_inc_config.py: core_model + cpu (pure config builder) * test_fp8_config.py: core_model + cpu (drop redundant diffusion marker) * test_gguf_config.py: core_model + cpu (drop redundant diffusion marker) - Updates the test docstring and contributing doc to reference the new path. After this change, the existing CUDA Unit Test with single card step (pytest -m 'core_model and cuda and L4 and not distributed_cuda') will automatically pick up the GPU quantization tests, and the Simple Unit Test step will pick up the CPU ones — so no dedicated Buildkite step is needed. Fixes vllm-project#2614 Signed-off-by: pjh4993 <pjh4993@naver.com>
…arkers The unified quantization framework (vllm-project#1764) consolidated source code at vllm_omni/quantization/, but tests were still under tests/diffusion/quantization/, and they had no Buildkite CI coverage. This PR: - Moves tests/diffusion/quantization/ to tests/quantization/ to mirror the source layout. - Aligns pytest markers with the actual test type: * test_int8_config.py: core_model + cuda + L4 (GPU smoke test) * test_inc_config.py: core_model + cpu (pure config builder) * test_fp8_config.py: core_model + cpu (drop redundant diffusion marker) * test_gguf_config.py: core_model + cpu (drop redundant diffusion marker) - Updates the test docstring and contributing doc to reference the new path. After this change, the existing CUDA Unit Test with single card step (pytest -m 'core_model and cuda and L4 and not distributed_cuda') will automatically pick up the GPU quantization tests, and the Simple Unit Test step will pick up the CPU ones — so no dedicated Buildkite step is needed. Fixes vllm-project#2614 Signed-off-by: pjh4993 <pjh4993@naver.com>
…arkers The unified quantization framework (vllm-project#1764) consolidated source code at vllm_omni/quantization/, but tests were still under tests/diffusion/quantization/, and they had no Buildkite CI coverage. This PR: - Moves tests/diffusion/quantization/ to tests/quantization/ to mirror the source layout. - Aligns pytest markers with the actual test type: * test_int8_config.py: core_model + cuda + L4 (GPU smoke test) * test_inc_config.py: core_model + cpu (pure config builder) * test_fp8_config.py: core_model + cpu (drop redundant diffusion marker) * test_gguf_config.py: core_model + cpu (drop redundant diffusion marker) - Updates the test docstring and contributing doc to reference the new path. After this change, the existing CUDA Unit Test with single card step (pytest -m 'core_model and cuda and L4 and not distributed_cuda') will automatically pick up the GPU quantization tests, and the Simple Unit Test step will pick up the CPU ones — so no dedicated Buildkite step is needed. Fixes vllm-project#2614 Signed-off-by: pjh4993 <pjh4993@naver.com>
…arkers The unified quantization framework (vllm-project#1764) consolidated source code at vllm_omni/quantization/, but tests were still under tests/diffusion/quantization/, and they had no Buildkite CI coverage. This PR: - Moves tests/diffusion/quantization/ to tests/quantization/ to mirror the source layout. - Aligns pytest markers with the actual test type: * test_int8_config.py: core_model + cuda + L4 (GPU smoke test) * test_inc_config.py: core_model + cpu (pure config builder) * test_fp8_config.py: core_model + cpu (drop redundant diffusion marker) * test_gguf_config.py: core_model + cpu (drop redundant diffusion marker) - Updates the test docstring and contributing doc to reference the new path. After this change, the existing CUDA Unit Test with single card step (pytest -m 'core_model and cuda and L4 and not distributed_cuda') will automatically pick up the GPU quantization tests, and the Simple Unit Test step will pick up the CPU ones — so no dedicated Buildkite step is needed. Fixes vllm-project#2614 Signed-off-by: pjh4993 <pjh4993@naver.com>
…arkers The unified quantization framework (vllm-project#1764) consolidated source code at vllm_omni/quantization/, but tests were still under tests/diffusion/quantization/, and they had no Buildkite CI coverage. This PR: - Moves tests/diffusion/quantization/ to tests/quantization/ to mirror the source layout. - Aligns pytest markers with the actual test type: * test_int8_config.py: core_model + cuda + L4 (GPU smoke test) * test_inc_config.py: core_model + cpu (pure config builder) * test_fp8_config.py: core_model + cpu (drop redundant diffusion marker) * test_gguf_config.py: core_model + cpu (drop redundant diffusion marker) - Updates the test docstring and contributing doc to reference the new path. After this change, the existing CUDA Unit Test with single card step (pytest -m 'core_model and cuda and L4 and not distributed_cuda') will automatically pick up the GPU quantization tests, and the Simple Unit Test step will pick up the CPU ones — so no dedicated Buildkite step is needed. Fixes vllm-project#2614 Signed-off-by: pjh4993 <pjh4993@naver.com>
…arkers The unified quantization framework (vllm-project#1764) consolidated source code at vllm_omni/quantization/, but tests were still under tests/diffusion/quantization/, and they had no Buildkite CI coverage. This PR: - Moves tests/diffusion/quantization/ to tests/quantization/ to mirror the source layout. - Aligns pytest markers with the actual test type: * test_int8_config.py: core_model + cuda + L4 (GPU smoke test) * test_inc_config.py: core_model + cpu (pure config builder) * test_fp8_config.py: core_model + cpu (drop redundant diffusion marker) * test_gguf_config.py: core_model + cpu (drop redundant diffusion marker) - Updates the test docstring and contributing doc to reference the new path. After this change, the existing CUDA Unit Test with single card step (pytest -m 'core_model and cuda and L4 and not distributed_cuda') will automatically pick up the GPU quantization tests, and the Simple Unit Test step will pick up the CPU ones — so no dedicated Buildkite step is needed. Fixes vllm-project#2614 Signed-off-by: pjh4993 <pjh4993@naver.com>
…arkers The unified quantization framework (vllm-project#1764) consolidated source code at vllm_omni/quantization/, but tests were still under tests/diffusion/quantization/, and they had no Buildkite CI coverage. This PR: - Moves tests/diffusion/quantization/ to tests/quantization/ to mirror the source layout. - Aligns pytest markers with the actual test type: * test_int8_config.py: core_model + cuda + L4 (GPU smoke test) * test_inc_config.py: core_model + cpu (pure config builder) * test_fp8_config.py: core_model + cpu (drop redundant diffusion marker) * test_gguf_config.py: core_model + cpu (drop redundant diffusion marker) - Updates the test docstring and contributing doc to reference the new path. After this change, the existing CUDA Unit Test with single card step (pytest -m 'core_model and cuda and L4 and not distributed_cuda') will automatically pick up the GPU quantization tests, and the Simple Unit Test step will pick up the CPU ones — so no dedicated Buildkite step is needed. Fixes vllm-project#2614 Signed-off-by: pjh4993 <pjh4993@naver.com>
…arkers The unified quantization framework (vllm-project#1764) consolidated source code at vllm_omni/quantization/, but tests were still under tests/diffusion/quantization/, and they had no Buildkite CI coverage. This PR: - Moves tests/diffusion/quantization/ to tests/quantization/ to mirror the source layout. - Aligns pytest markers with the actual test type: * test_int8_config.py: core_model + cuda + L4 (GPU smoke test) * test_inc_config.py: core_model + cpu (pure config builder) * test_fp8_config.py: core_model + cpu (drop redundant diffusion marker) * test_gguf_config.py: core_model + cpu (drop redundant diffusion marker) - Updates the test docstring and contributing doc to reference the new path. After this change, the existing CUDA Unit Test with single card step (pytest -m 'core_model and cuda and L4 and not distributed_cuda') will automatically pick up the GPU quantization tests, and the Simple Unit Test step will pick up the CPU ones — so no dedicated Buildkite step is needed. Fixes vllm-project#2614 Signed-off-by: pjh4993 <pjh4993@naver.com>
Purpose
Close #1763.
Test Plan
Unit Tests
E2E FP8 Quantization Results (1×A100 80GB)
All 8 generation tests passed. LPIPS quality comparison below.
Z-Image —
Tongyi-MAI/Z-Image-Turbo(1024×1024, 50 steps, seed=42)Visual comparison
Qwen-Image —
Qwen/Qwen-Image(1024×1024, 50 steps, seed=142)Visual comparison
Flux.1-dev —
black-forest-labs/FLUX.1-dev(1024×1024, 20 steps, seed=42)Visual comparison
Bagel-7B-MoT —
ByteDance-Seed/BAGEL-7B-MoT(50 steps, seed=52, cfg_text=4.0)FP8 applied to diffusion stage (Stage-1) only. LLM stage (Stage-0) remains BF16 — verified by
quantization=Nonein engine config andfinish_reason=stop.Visual comparison
Qwen3-Omni — Pre-quantized ModelOpt FP8 (1×H200 141GB)
Tested loading pre-quantized FP8 checkpoint (asdazd/Qwen3-Omni-30B-A3B-Instruct_modelopt_FP8) via the unified quantization framework. FP8 is auto-detected from
thinker_config.text_config.quantization_configand scoped tolanguage_modelonly — audio encoder, vision encoder, talker, and code2wav remain BF16.Thinker-only benchmark (
max_model_len=8192,enforce_eager=True, Triton FP8 MoE backend)Key findings:
quantization=modeloptfor thinker,quantization=Nonefor talker/code2wavFull pipeline (thinker + talker + code2wav) on single GPU within 64GB budget
All 3 stages running on a single GPU with memory constrained to ~54 GiB total:
Full pipeline produces text + audio output end-to-end. BF16 full pipeline requires ~66+ GiB (thinker alone is 59.26 GiB) — impossible on 64GB without FP8.
Sample output (text + audio)
Prompt: "Explain the system architecture for a scalable audio generation pipeline. Answer in 15 words."
Text output: "Distributed microservices, real-time queuing, GPU-accelerated models, containerized inference, dynamic scaling, and low-latency APIs ensure high-throughput, flexible audio generation."
Audio output: output_0_78a82c12-09dc-4da9-bf61-d76b859df275.wav
Note: Default
FLASHINFER_CUTLASSFP8 MoE backend requires long first-run kernel compilation on H200. UseVLLM_USE_FLASHINFER_MOE_FP8=0to select Triton backend for immediate startup.NVFP4 Trial — Qwen3-Omni on NVIDIA B200 (Blackwell)
We attempted to create and load a pre-quantized NVFP4 (W4A8) checkpoint of Qwen3-Omni-30B-A3B-Instruct on an NVIDIA B200 (183 GiB, SM 100+). The goal was to validate the unified quantization framework with NVFP4 and demonstrate single-GPU deployment on RTX 5090 (32 GiB).
What was done
Quantized checkpoint created using NVIDIA ModelOpt v0.42.0 (
NVFP4QTensor.quantize)gate_up_proj/down_projinto per-expert format, packed to NVFP4mtq.quantizewith 256 promptsFramework changes for pre-quantized checkpoint loading (kept in this PR — benefits FP8 too):
Qwen3OmniMoeAudioEncoderto acceptquant_configfor audio tower linear layersmodelopt,modelopt_fp4,modelopt_mxfp8) to routequant_configto all thinker subcomponents (audio_tower, visual, language_model)--quantization fp8) still scopes tolanguage_modelonly — no regressionrope_scaling→rope_parametersfallback for newer transformerscode2wavstagehf_config_name:thinker_config→code2wav_configinit_timeoutnot passed from CLI toOmni()omni.stage_list→omni.num_stagesin end2end exampleFull 3-stage pipeline loaded and ran on B200:
What failed
Text generation produced garbage output (all
!tokens, 1024 repetitions). Audio was also unintelligible. The full pipeline completed without runtime errors — the issue is purely numerical accuracy.Root cause analysis
weight_scale_2mismatch (fixed but didn't resolve quality): Initially,gate_projandup_projper-expert had independent global scales (up to 4.35x ratio across 11,728 of 12,288 expert pairs). vLLM's FusedMoE kernel requiresw1_weight_scale_2 == w3_weight_scale_2. Fixed by usingshared_ws2 = max(gate.abs().max(), up.abs().max()) / (6.0 * 448.0). Verified 0 mismatches post-fix — output still garbled.Likely remaining issues:
FLASHINFER_TRTLLM) may have accuracy issues with Qwen3-Omni's small expert dimensions (intermediate_size=768,hidden_size=2048) — these kernels were primarily validated on larger experts (e.g. Qwen3.5-397B withintermediate_size=2048)mtq.quantize) — insufficient for the multimodal thinker's audio/vision encoder pathwaysquantize_linear_weights_nvfp4function uses simpleabs().max()scaling without calibration — proper calibrated quantization (SmoothQuant-style or ModelOpt's full calibration pipeline) may be requiredConclusion
NVFP4 for Qwen3-Omni MoE requires either:
abs().max()scaling)The framework changes from this trial are retained as they are needed for FP8 and future quantization methods.
How to Reproduce
Running Qwen3-Omni FP8 on a Single 64GB GPU
The FP8 quantized checkpoint enables running the full Qwen3-Omni pipeline (thinker + talker + code2wav with audio output) on a single 64GB GPU. This is impossible with BF16 (thinker alone requires 59.26 GiB).
Step 1: Download the FP8 model
pip install modelscope python -c "from modelscope import snapshot_download; print(snapshot_download('asdazd/Qwen3-Omni-30B-A3B-Instruct_modelopt_FP8'))"Step 2: Use the stage config
Use the included
fp8_full_pipeline_64gb.yamlfile, updating themodelpath in all 3 stages to point to your downloaded model.The
gpu_memory_utilizationvalues in the file are tuned for H200 (141GB) simulating a 64GB budget. On an actual 64GB card, use these values:gpu_memory_utilization: 0.65gpu_memory_utilization: 0.20gpu_memory_utilization: 0.05Step 3: Run
cc @Isotr0py @hsliuustc0106 @alex-jw-brooks @yiliu30 @kylesayrs