Skip to content

[CI] Add Flux2 Klein Tests#2027

Merged
hsliuustc0106 merged 4 commits intovllm-project:mainfrom
alex-jw-brooks:flux2_tests
Mar 22, 2026
Merged

[CI] Add Flux2 Klein Tests#2027
hsliuustc0106 merged 4 commits intovllm-project:mainfrom
alex-jw-brooks:flux2_tests

Conversation

@alex-jw-brooks
Copy link
Copy Markdown
Contributor

Purpose

Adds Flux2 tests for #1832

Test Plan

For the 4B model, I'm not 100% sure the tests can run on the L4 cards, but it's very close, so I think worth a try; I also took a pass at the buildkite config since it doesn't look like it's been added for L4 card models yet.

Tests added are:

  • cache_dit + cpu offload (1 L4 card)
  • cache_dit + ring 2 + ulysses 2 + fp8 (4 L4 cards)
  • cache_dir + tp + cfg parallel + gguf (4 L4 cards)

Test Result

Tests pass locally

@fhfuih @nuclearwu can you please take a look? This should cover everything for flux2klein except HSDP at the moment I think, but can also reduce cases if needed

Signed-off-by: Alex Brooks <albrooks@redhat.com>
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

please change the test label from test-nightly to test-merge so that we can test it before merging.

Signed-off-by: Alex Brooks <albrooks@redhat.com>
@alex-jw-brooks
Copy link
Copy Markdown
Contributor Author

Sure, thanks @hsliuustc0106 - moved it into test-merge and to depend on upload-merge-pipeline for now

@fhfuih
Copy link
Copy Markdown
Contributor

fhfuih commented Mar 20, 2026

I wonder if L4 can hold this model during run time. But let's see first. Waiting for a "ready" label to get CI started

"--ulysses-degree",
"2",
"--quantization",
"fp8",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Flux 2 Klein support FP8? As per #1217 I only see GGUF. Correct me if I'm wrong and there is a recent update.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does, at least for some layers - I think it was added as a side effect of this PR, because it pushes the vLLM quant config down through the DiT layers. I checked to make sure that the quant post processing is called for the DiT layers, and the memory with it on is ~ 5gb lower 🙂

NEGATIVE_PROMPT = "blurry, low quality"


# Currently Flux2 tests target Flux2 Klein.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add HSDP test case (#1900 just merged 4 days ago) thanks!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! I just went ahead and combined it with the CPU offload test

Signed-off-by: Alex Brooks <albrooks@redhat.com>
@alex-jw-brooks
Copy link
Copy Markdown
Contributor Author

alex-jw-brooks commented Mar 20, 2026

Thanks @fhfuih, I think it'll be close, so worth a try. From my local testing, the memory seems to float around 18-22 Gb vram for this model when quant is not used 🤞

Since the tests in this effort mostly focus on making sure we can run the models without exploding with different acceleration methods/combinations, we could also consider running them with randomly initialized models for tests that aren't explicitly testing the outputs to keep the CI from getting too heavy. what do you think?

Similar to the way in which libs like transformers will use a small randomly initialized model created from a config for their basic tests, with heavy @slow tests to actually verify outputs (example)

@wtomin wtomin added the ready label to trigger buildkite CI label Mar 20, 2026
@wtomin
Copy link
Copy Markdown
Collaborator

wtomin commented Mar 20, 2026

Since the tests in this effort mostly focus on making sure we can run the models without exploding with different acceleration methods/combinations, we could also consider running them with randomly initialized models for tests that aren't explicitly testing the outputs to keep the CI from getting too heavy. what do you think?

I think it is a good idea. I guess we can reuse huggingface internal testing checkpoints. For example, https://huggingface.co/hf-internal-testing/tiny-flux2-klein.

Only functionalities will be checked in these tests, not speed/accruacy, I think we can tolerate the random weights. @fhfuih What dou you think?

@fhfuih
Copy link
Copy Markdown
Contributor

fhfuih commented Mar 20, 2026

Only functionalities will be checked in these tests, not speed/accruacy, I think we can tolerate the random weights. @fhfuih What dou you think?

Yeah, I also agree. Random weights should not break the diffusion features.


@alex-jw-brooks apologize for any confusion caused, but after some internal discussion, we just decided that we should reduce the number of test cases for not-high-priority models. Could you help settle a recommended feature combination for this model, and edit the test script to only include that feature combination? If you need help finding a good combination of diffusion features, see if hsliuustc0106/vllm-omni-skills#19 this AI skill can help, or search relevant PR in this repo that introduces this model or relevant features (for any example code snippets).

CC @yangjianjuan

Signed-off-by: Alex Brooks <albrooks@redhat.com>
@alex-jw-brooks
Copy link
Copy Markdown
Contributor Author

alex-jw-brooks commented Mar 21, 2026

Sounds good thanks @fhfuih @wtomin - for flux2, Ulysses SP & FP8 with cache dit generally give a good speedup (a bit less than 2x), so I've updated the test to test that with TP2 for coverage, since TP for this configuration only adds a little overhead. Should also be good to fit in L4 since memory looks to be around 17 Gb per gpu 🤞

for tiny model testing, maybe let's save it for a follow-up? I think it would be better to have a separate PR to add it generically so that new models can be added easily, and just parametrize over some common tests for configurations for those models to validate them (kind of like our common multimodal tests in vLLM)

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

common multimodal tests

I think https://github.com/vllm-project/vllm/blob/main/tests/models/multimodal/generation/test_common.py will help a lot, please discuss under the RFC #1623

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hsliuustc0106 hsliuustc0106 merged commit a5574a2 into vllm-project:main Mar 22, 2026
8 checks passed
spencerr221 added a commit to spencerr221/vllm-omni that referenced this pull request Mar 24, 2026
Add e2e tests for Stable Diffusion 3.5 medium model following the same
pattern as Flux2 Klein tests. This PR adds test coverage for SD3.5 with
commonly used acceleration features.

Tests added:
* cache_dit + cfg_parallel + tp (4 L4 cards)

The test configuration uses:
- Cache-DiT for faster inference
- CFG-Parallel (size=2) for classifier-free guidance parallelization
- Tensor-Parallel (size=2) for distributed inference

This combination provides good performance improvements while keeping
memory usage reasonable (~18-22GB VRAM per GPU based on similar models).

Test parameters:
- Resolution: 1024x1024 (standard for SD3.5)
- Inference steps: 28 (recommended for quality)
- Guidance scale: 4.5 (SD3.5 default)

Following the pattern from PR vllm-project#2027, only one test case is included
to keep CI lightweight while ensuring the model works with key
acceleration features.

Signed-off-by: LiuBingyu <liubingyu62@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants