[diffusion][CI]: Add individual component accuracy CI for diffusion models#18709
[diffusion][CI]: Add individual component accuracy CI for diffusion models#18709BBuf merged 35 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @Ratish1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a comprehensive accuracy testing framework for diffusion models within SGLang. It establishes a robust CI process by enabling component-level comparisons against established Hugging Face baselines. The framework includes an extensible engine for loading and aligning model weights, along with dedicated adapters for various diffusion model architectures, ensuring consistent and reliable accuracy validation across different configurations and GPU setups. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive and well-structured framework for component-level accuracy testing of diffusion models. The separation of concerns into adapters, configuration, and a core engine is excellent. My feedback focuses on improving maintainability and robustness by refactoring for conciseness, reducing reliance on fragile string-based logic, and hardening utility functions against unexpected inputs.
|
Nice done |
- replace model-specific adapter flow with generic hook-based component profiles - move component comparison execution into a unified native-hook accuracy engine - wire all 1-GPU/2-GPU accuracy suites to the native hook execution path - add topology-aware parallel orchestration for mixed component test suites
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
bravo. |
Yes ran it locally and works fine. I can attach logs after rerunning once more since there are 4 test files to run and takes some time for the 2 GPU runs. |
|
Hey @mickqian , my bad for the late reply, got caught up in some other work. Here is the logs for all 4 test fiiles below: 1 GPU A & B:
2 GPU A & B:
|
d594a1c to
322be95
Compare
|
@mickqian I think the pr is ready now, do you have any other suggestions? |
|
We should try to keep CI runtime short — ideally each test completes within a few minutes. |
Yes I think the text encoders are the problem for that. Will skip similar models sharing same components. |
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
|
TODO: wire the tests into run_suite.py. |
|
could you add a sleep after each server shutdown, to make CI more reliable? |
Yes, will do it in this PR #21960 . Thanks. |




Motivation
This PR introduces component-level accuracy testing for the diffusion runtime in
sglang.multimodal_gen. Related: #12987This PR adds a dedicated component-accuracy framework for validating SGLang runtime components against their reference Hugging Face implementations.
For individual component accuracy, the reference side is:
transformerandvaetext_encoderThat is the core value of this PR. It gives us a targeted correctness layer between raw checkpoint loading and full
serving tests.
This is especially important for diffusion runtime components because they are often not raw HF modules. In practice they may differ in:
A component-level parity framework is therefore necessary to validate the actual runtime module implementation instead of relying only on end-to-end pipeline behavior.
Modification
Add a component-accuracy framework for diffusion runtime modules
This PR adds a dedicated accuracy flow for comparing runtime-loaded SGLang components against Hugging Face reference components.
The framework supports the main component families used by the diffusion runtime:
The comparison flow is:
ServerArgsThis gives us a stable component-level correctness signal that can run inside CI.
Use the real SGLang runtime loader path
The SGLang side of the comparison is loaded through the actual runtime component loader stack.
That means the framework exercises the same runtime-side loading logic used by the codebase rather than constructing fake
test-only modules. This is important because the goal is to validate the real runtime implementation, including:
Compare against Hugging Face reference implementations
The reference side is loaded from the corresponding Hugging Face implementation for each component family:
transformerandvaeuse diffuserstext_encoderuses transformersThis is intentionally a component-level reference path, so that the framework can compare one runtime component at a time and localize issues much more precisely than end-to-end tests allow.
Add weight-alignment logic for runtime parity testing
A core part of this PR is the weight-alignment path used before comparison.
This exists because the runtime component and the Hugging Face reference component do not always expose weights in the same form. The framework therefore handles:
This allows the SGLang runtime component to be compared in the layout it actually uses in the codebase, while still
grounding the comparison in the Hugging Face reference weights.
Add forward adapters for component-level comparison
The PR introduces native hook/input adapters that build deterministic synthetic inputs for component comparisons.
These adapters exist so that the framework can produce a valid shared input bundle for both:
This is needed because the two sides do not always expose identical forward signatures. The adapters normalize those differences so the framework can run a clean parity comparison.
Add staged low-memory execution for selected 1-GPU cases
Some large 1-GPU cases cannot safely keep both the runtime component and the reference component resident at the same time.
For these cases, the framework supports staged execution:
This keeps the framework usable for memory-constrained cases without changing the comparison contract.
Add distributed-aware runtime setup and cleanup
The framework initializes the same distributed/model-parallel context needed for the selected testcase topology,
including 1-GPU and 2-GPU accuracy runs.
It also performs explicit cleanup after each case so that:
Add testcase policy for thresholds and unsupported boundaries
The framework also includes testcase policy for:
This keeps the reported results meaningful and avoids mixing together:
In short, this PR adds the first component-accuracy framework for the diffusion runtime and grounds it against the
corresponding Hugging Face reference implementations.
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci