[Quantization] feat: add FP8 for Omnigen2#2441
[Quantization] feat: add FP8 for Omnigen2#2441lishunyang12 merged 18 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
This comment was marked as resolved.
This comment was marked as resolved.
I will run it another day. |
Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
|
Hey @lishunyang12, this PR is now ready for review. Please take a look when you are free, thank you! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2d61c30023
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: Zhang <jianmusings@gmail.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Follows #1854.
Add FP8 online quantization support for the OmniGen2 model, including Attn and MLP.
On top of that, OmniGen2 has
hidden_size=2520, which is not a multiple of 16 (2520 % 16 == 8). This causes vLLM'scutlass_scaled_mmto fall back to a slow Tritonscaled_mm_kernelfor every FP8 linear layer (QKV, attn output, gate_up_proj, down_proj). To fix that, we pad weight tensors to multiples of 16 inomnigen2_transformer.py, so the native CUTLASS FP8 tensor-core kernel is used instead of the Triton fallback.quantisation layer details
Layers quantized (FP8)
attn.to_qkvattn.to_outfeed_forward.gate_up_projfeed_forward.down_projLayers kept at full precision
norm1.linear— producesscale_msa, gate_msa, scale_mlp, gate_mlpviatanh(). These multiplicative control signals are precision-sensitive; FP8 quantization errors compound across 38 blocks (32 main + 6 refiner) and visibly degrade generation quality.x_embedder,ref_image_patch_embedder,caption_embedder), timestep MLP, and output norm projections — small and precision-sensitive, consistent with existing diffusion model quantization policy.Test Plan
image editing task
We have two sample images as shown below, and the prompt is:Test Result
GEdit-Bench Evaluation
We ran ~10% of GEdit-Bench (55 samples, 5 per task group, English only) using a local Qwen2.5-VL-7B-Instruct judge. Generation used 20 inference steps at 512×512, seed 42.
FP8 overall score is within 1.3% of BF16, confirming minimal quality degradation from quantization.
Per-task breakdown (Q_O overall score)
Why minimal VRAM savings?
The OmniGen2 diffusion transformer is ~3.6B params (the full model including the Qwen2.5-VL backbone is ~11B, but only the transformer is FP8-quantized). So the transformer weight memory is a smaller fraction of total VRAM, and the main dominating factor are activations, the VAE decoder, and CUDA context overhead. Additionally, FP8 online quantization loads BF16 weights from disk and converts to FP8 at runtime, so peak allocation during loading is still BF16-sized. Significant VRAM reduction requires FP8 serialized checkpoints (pre-quantized weights stored as FP8 on disk).Why is the FP8 image not pixel-identical to BF16?
FP8 has 3–4 bits of mantissa vs BF16's 7 bits, so every quantized matmul introduces small rounding errors. These errors compound across 4 quantized linears × 38 blocks (32 main + 6 refiner) × 30 denoising steps = ~4,560 quantized matmuls per image. Diffusion models are sensitive to early-step perturbations, so small numerical differences can steer the denoising trajectory slightly differently. The output is semantically equivalent (same composition, quality, and identity) but not numerically identical — this is expected behavior for FP8 quantization, consistent with other FP8 diffusion models (FLUX, SD3, HunyuanImage).Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)