[Feat] FP8 quantization support for LongCat-Image and LongCat-Image-Edit by lcukyfuture · Pull Request #2633 · vllm-project/vllm-omni

lcukyfuture · 2026-04-09T08:15:51Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Add FP8 quantization support to LongCat-Image and LongCat-Image-Edit pipelines, following the unified quantization framework introduced in #1764.

Test Plan

1 * NVIDIA RTX6000 Ada (48G)
1024×1024, 50 steps, seed=42, LPIPS

Test Result

LongCat-Image (text-to-image)

Config	Avg Time	Speedup	Memory (GiB)	Mem Reduction	Mean LPIPS
BF16 baseline	19.87s	1.00x	33.46	—	(ref)
fp8	13.24s	1.50x	29.56	12%	0.0767
fp8 + `proj_out`	15.08s	1.32x	30.46	9%	0.0192

BF16 baseline	fp8	fp8 + skip `proj_out`

LongCat-Image-Edit (image editing)

Config	Avg Time	Speedup	Mean LPIPS
BF16 baseline	41.49s	1.00x	(ref)
fp8	30.35s	1.37x	0.0162

	BF16 baseline	fp8
Base
Edit

Findings

LongCat's bottleneck is its unquantizable text encoder(Qwen2.5_VL).

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Add FP8 quantization support to LongCat-Image and LongCat-Image-Edit pipelines, following the unified quantization framework introduced in vllm-project#1764. Changes: - Replace plain `nn.Linear` layers in `LongCatImageTransformer2DModel` with quantization-aware vLLM linear layers (`ReplicatedLinear`, `QKVParallelLinear`, `RowParallelLinear`, `ColumnParallelLinear`) and propagate `quant_config` through `FeedForward`, `LongCatImageAttention`, `LongCatImageTransformerBlock`, and `LongCatImageSingleTransformerBlock` - Pass `quant_config=od_config.quantization_config` to the transformer in both `LongCatImagePipeline` and `LongCatImageEditPipeline` - Fix `load_weights` in both pipelines to include VAE and text encoder parameters in the returned loaded-weights set - Fix `TypeError`: `LongCatImageSingleTransformerBlock.__init__` was receiving an unsupported `prefix` keyword argument, causing a crash on startup with any quantization config Signed-off-by: lcukyfuture <zlf994478451@outlook.com>

chatgpt-codex-connector · 2026-04-09T08:16:02Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Signed-off-by: lcukyfuture <zlf994478451@outlook.com>

…-image Signed-off-by: lcukyfuture <zlf994478451@outlook.com>

Signed-off-by: Lingfeng Zhang <48312954+lcukyfuture@users.noreply.github.com>

Signed-off-by: lcukyfuture <zlf994478451@outlook.com>

lishunyang12 · 2026-04-11T17:32:15Z


        self.norm_out = AdaLayerNormContinuous(self.inner_dim, self.inner_dim, elementwise_affine=False, eps=1e-6)
-        self.proj_out = nn.Linear(self.inner_dim, patch_size * patch_size * self.out_channels, bias=True)
+        self.proj_out = ReplicatedLinear(


Your own numbers show quantizing this final proj_out regresses LPIPS from 0.0192 to 0.0767 (~4x worse) for only a ~12% speed gain. Flux keeps the final proj_out as nn.Linear for the same reason — let's match that here and skip quantization on this layer by default, rather than making users discover the ignored_layers flag.

lishunyang12 · 2026-04-11T17:32:15Z

+        help="Task type: t2i (text-to-image), t2v (text-to-video), or image_edit (image editing).",
    )
    parser.add_argument(
        "--quantization",


Dropping nargs="+" is a breaking change — the earlier docstring example (--quantization fp8 int8 bitsandbytes) no longer works. Can you keep multi-method support?

lishunyang12

Review: FP8 quantization support for LongCat-Image and LongCat-Image-Edit

Good work overall. The quantization wiring follows the established pattern from Flux and other models, the benchmark results are solid (1.50x speedup at acceptable LPIPS), and the test coverage additions are appropriate. I have a few issues to flag, one of which is likely to cause problems with the ignored_layers feature.

Issue 1 (Medium): Incomplete prefix propagation for quantization layer matching

In LongCatImageSingleTransformerBlock, the ReplicatedLinear layers use bare prefixes:

self.proj_mlp = ReplicatedLinear(..., prefix="proj_mlp")
self.proj_out = ReplicatedLinear(..., prefix="proj_out")

But in the Flux reference implementation, full hierarchical prefixes are passed:

self.proj_mlp = ReplicatedLinear(..., prefix=f"{prefix}.proj_mlp")
self.proj_out = ReplicatedLinear(..., prefix=f"{prefix}.proj_out")

And the parent model passes indexed prefixes like prefix=f"single_transformer_blocks.{i}" when creating blocks.

The LongCat PR does not pass prefix to LongCatImageTransformerBlock or LongCatImageSingleTransformerBlock, and internally uses bare names. This means:

Basic FP8 (quantize all linears) works fine, as confirmed by the benchmark.
The ignored_layers feature advertised in the benchmark script (e.g., {"method":"fp8","ignored_layers":["proj_out"]}) may behave incorrectly, since the quantization framework matches layers by their full prefix path, not bare names.

Similarly, in LongCatImageAttention, the QKVParallelLinear and RowParallelLinear use bare prefix="to_qkv", prefix="to_out", etc.

Recommendation: Propagate the prefix parameter from the transformer model down through each block and sub-module, following the Flux pattern, so that ignored_layers matching works correctly. The benchmark PR description shows fp8 + skip proj_out as a tested config, so this should be functional.

Issue 2 (Nit): Benchmark switches from `torch.cuda.max_memory_allocated()` to `pynvml` process-wide memory

The _get_gpu_memory_gib() function uses pynvml.nvmlDeviceGetMemoryInfo(handle).used, which reports device-wide GPU memory usage (all processes), not just the current process. The previous code used torch.cuda.max_memory_allocated() which is process-scoped and reports peak allocation.

These measure fundamentally different things. On a shared GPU, the pynvml number will include memory from other processes. Also, the old metric was peak memory; the new one is instantaneous memory at the time of the call (post-generation), which may miss peak usage during inference.

The test file (test_quantization_quality.py) still correctly uses torch.cuda.max_memory_allocated() for its _generate_image_edit function, which is good. But the benchmark script's numbers in the PR description (memory column) may be slightly misleading.

This is fine if the benchmark is intended to run on a dedicated GPU, but worth a comment in the code.

Issue 3 (Minor): `load_weights` change marks VAE/text_encoder params as "loaded" without actually loading them

loaded_weights |= {f"vae.{name}" for name, _ in self.vae.named_parameters()}
loaded_weights |= {f"text_encoder.{name}" for name, _ in self.text_encoder.named_parameters()}

This follows the pattern from pipeline_z_image.py and pipeline_flux2_klein.py, so it's consistent with the codebase convention. Just confirming: these weights are loaded separately via from_pretrained() in __init__, so marking them as loaded prevents the weight loader from warning about unloaded parameters. This is correct.

Minor notes:

The .contiguous() calls added before quantized linear layers (lines 215, 232, 333-334 in the transformer) are a reasonable defensive measure for FP8 kernels that require contiguous inputs. This matches what other models do.
The benchmark refactoring from multi-quantization loop to single-quantization is a simplification that reduces complexity. The removal of the "Multiple quantization methods" example from the docstring is consistent.
Test configs use num_inference_steps=20 which is good for CI speed while still being meaningful for quality checks.

Summary: The core FP8 quantization integration is correct and well-tested. The main actionable item is fixing the prefix propagation (Issue 1) to ensure ignored_layers works as intended. The rest is solid.

lcukyfuture · 2026-04-17T06:23:16Z

@lishunyang12 , Thanks for your review. I have some important things to take care of in these two weeks, and I will fix these issues later(before the end of next week). Sorry for the delay.

hsliuustc0106 · 2026-04-17T06:34:02Z

please explain which part do you quantize

lcukyfuture · 2026-04-27T02:43:37Z

Quantize Layers:

LongCat Component	Quantized Layers
Transformer input projections	context_embedder, x_embedder
Dual-stream attention	to_qkv, add_kv_proj, to_out, to_add_out
Dual-stream FFN	ff.w_in, ff.w_out, ff_context.w_in, ff_context.w_out
Single-stream attention	attn.to_qkv
Single-stream MLP	proj_mlp, proj_out

Signed-off-by: lcukyfuture <zlf994478451@outlook.com>

lcukyfuture requested a review from hsliuustc0106 as a code owner April 9, 2026 08:15

[Style] Fix ruff E501 line-too-long errors in longcat_image models

93611b7

Signed-off-by: lcukyfuture <zlf994478451@outlook.com>

lcukyfuture force-pushed the feat/fp8-longcat-image branch 2 times, most recently from 7512391 to d9340fa Compare April 9, 2026 08:50

[Feat] FP8 quantization: add benchmarks and quality tests for longcat…

90c7ebe

…-image Signed-off-by: lcukyfuture <zlf994478451@outlook.com>

lcukyfuture force-pushed the feat/fp8-longcat-image branch from d9340fa to 90c7ebe Compare April 9, 2026 08:51

lcukyfuture and others added 2 commits April 9, 2026 17:32

Merge branch 'main' into feat/fp8-longcat-image

1de308b

Signed-off-by: Lingfeng Zhang <48312954+lcukyfuture@users.noreply.github.com>

[Style] Fix ruff import sort and format in benchmarks and test

991a11e

Signed-off-by: lcukyfuture <zlf994478451@outlook.com>

lishunyang12 reviewed Apr 11, 2026

View reviewed changes

xiaohajiayou mentioned this pull request Apr 14, 2026

[CI Failure]: Diffusion · Other · Function Test with H100, bagel run error, ValueError: Unknown distributed executor backend: None #2662

Closed

1 task

wtomin mentioned this pull request Apr 15, 2026

[RFC]: Continuous Diffusion Model Acceleration Support #1217

Open

1 task

lishunyang12 mentioned this pull request Apr 15, 2026

[RFC]: Continuous Quantization Support #1854

Open

lishunyang12 added the quantization Code related to quantization label Apr 15, 2026

lishunyang12 reviewed Apr 16, 2026

View reviewed changes

lcukyfuture added 2 commits April 27, 2026 10:52

Address LongCat FP8 quantization review

0850670

Signed-off-by: lcukyfuture <zlf994478451@outlook.com>

Format LongCat FP8 review fix

325ca3b

Signed-off-by: lcukyfuture <zlf994478451@outlook.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] FP8 quantization support for LongCat-Image and LongCat-Image-Edit#2633

[Feat] FP8 quantization support for LongCat-Image and LongCat-Image-Edit#2633
lcukyfuture wants to merge 7 commits into
vllm-project:mainfrom
lcukyfuture:feat/fp8-longcat-image

lcukyfuture commented Apr 9, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented Apr 9, 2026

Uh oh!

lishunyang12 Apr 11, 2026

Uh oh!

lishunyang12 Apr 11, 2026

Uh oh!

lishunyang12 left a comment

Uh oh!

lcukyfuture commented Apr 17, 2026

Uh oh!

hsliuustc0106 commented Apr 17, 2026

Uh oh!

lcukyfuture commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lcukyfuture commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

LongCat-Image (text-to-image)

LongCat-Image-Edit (image editing)

Findings

Uh oh!

chatgpt-codex-connector Bot commented Apr 9, 2026

Uh oh!

lishunyang12 Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

lishunyang12 Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Review: FP8 quantization support for LongCat-Image and LongCat-Image-Edit

Issue 1 (Medium): Incomplete prefix propagation for quantization layer matching

Issue 2 (Nit): Benchmark switches from torch.cuda.max_memory_allocated() to pynvml process-wide memory

Issue 3 (Minor): load_weights change marks VAE/text_encoder params as "loaded" without actually loading them

Minor notes:

Uh oh!

lcukyfuture commented Apr 17, 2026

Uh oh!

hsliuustc0106 commented Apr 17, 2026

Uh oh!

lcukyfuture commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lcukyfuture commented Apr 9, 2026 •

edited

Loading

Issue 2 (Nit): Benchmark switches from `torch.cuda.max_memory_allocated()` to `pynvml` process-wide memory

Issue 3 (Minor): `load_weights` change marks VAE/text_encoder params as "loaded" without actually loading them