Skip to content

[diffusion][CI]: Add individual component accuracy CI for diffusion models#18709

Merged
BBuf merged 35 commits intosgl-project:mainfrom
Ratish1:feat/accuracy-test
Apr 1, 2026
Merged

[diffusion][CI]: Add individual component accuracy CI for diffusion models#18709
BBuf merged 35 commits intosgl-project:mainfrom
Ratish1:feat/accuracy-test

Conversation

@Ratish1
Copy link
Copy Markdown
Collaborator

@Ratish1 Ratish1 commented Feb 12, 2026

Motivation

This PR introduces component-level accuracy testing for the diffusion runtime in sglang.multimodal_gen. Related: #12987

This PR adds a dedicated component-accuracy framework for validating SGLang runtime components against their reference Hugging Face implementations.

For individual component accuracy, the reference side is:

  • diffusers for transformer and vae
  • transformers for text_encoder

That is the core value of this PR. It gives us a targeted correctness layer between raw checkpoint loading and full
serving tests.

This is especially important for diffusion runtime components because they are often not raw HF modules. In practice they may differ in:

  • parameter naming
  • fused versus split tensor layouts
  • tensor-parallel sharding
  • runtime-specific initialization and loading paths

A component-level parity framework is therefore necessary to validate the actual runtime module implementation instead of relying only on end-to-end pipeline behavior.

Modification

Add a component-accuracy framework for diffusion runtime modules

This PR adds a dedicated accuracy flow for comparing runtime-loaded SGLang components against Hugging Face reference components.

The framework supports the main component families used by the diffusion runtime:

  • transformer
  • VAE
  • text encoder

The comparison flow is:

  1. resolve the concrete component source for a testcase
  2. build runtime-style ServerArgs
  3. initialize the required distributed/runtime state
  4. load the SGLang component through the actual runtime loader path
  5. load the corresponding Hugging Face reference component
  6. align reference weights into the runtime component
  7. run both components
  8. compare outputs with cosine similarity

This gives us a stable component-level correctness signal that can run inside CI.

Use the real SGLang runtime loader path

The SGLang side of the comparison is loaded through the actual runtime component loader stack.

That means the framework exercises the same runtime-side loading logic used by the codebase rather than constructing fake
test-only modules. This is important because the goal is to validate the real runtime implementation, including:

  • runtime model class resolution
  • distributed/model-parallel setup
  • component-specific loader logic
  • runtime-specific tensor layouts

Compare against Hugging Face reference implementations

The reference side is loaded from the corresponding Hugging Face implementation for each component family:

  • transformer and vae use diffusers
  • text_encoder uses transformers

This is intentionally a component-level reference path, so that the framework can compare one runtime component at a time and localize issues much more precisely than end-to-end tests allow.

Add weight-alignment logic for runtime parity testing

A core part of this PR is the weight-alignment path used before comparison.

This exists because the runtime component and the Hugging Face reference component do not always expose weights in the same form. The framework therefore handles:

  • parameter name remapping
  • known fused tensor layouts such as QKV and gate/up projections
  • tensor-parallel-aware copying into sharded runtime parameters

This allows the SGLang runtime component to be compared in the layout it actually uses in the codebase, while still
grounding the comparison in the Hugging Face reference weights.

Add forward adapters for component-level comparison

The PR introduces native hook/input adapters that build deterministic synthetic inputs for component comparisons.

These adapters exist so that the framework can produce a valid shared input bundle for both:

  • the SGLang runtime component
  • the Hugging Face reference component

This is needed because the two sides do not always expose identical forward signatures. The adapters normalize those differences so the framework can run a clean parity comparison.

Add staged low-memory execution for selected 1-GPU cases

Some large 1-GPU cases cannot safely keep both the runtime component and the reference component resident at the same time.

For these cases, the framework supports staged execution:

  • run the SGLang side first
  • move the output to CPU
  • release the first stage
  • run the reference side
  • compare CPU outputs

This keeps the framework usable for memory-constrained cases without changing the comparison contract.

Add distributed-aware runtime setup and cleanup

The framework initializes the same distributed/model-parallel context needed for the selected testcase topology,
including 1-GPU and 2-GPU accuracy runs.

It also performs explicit cleanup after each case so that:

  • runtime state does not leak between cases
  • multi-GPU runs exit cleanly
  • failures are easier to localize per testcase

Add testcase policy for thresholds and unsupported boundaries

The framework also includes testcase policy for:

  • threshold overrides where distributed/runtime variance is acceptable
  • unsupported reference-loading boundaries that should be skipped rather than reported as false failures

This keeps the reported results meaningful and avoids mixing together:

  • real runtime correctness issues
  • unsupported comparison boundaries
  • expected minor variance in multi-GPU parity runs

In short, this PR adds the first component-accuracy framework for the diffusion runtime and grounds it against the
corresponding Hugging Face reference implementations.

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions bot added the diffusion SGLang Diffusion label Feb 12, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @Ratish1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive accuracy testing framework for diffusion models within SGLang. It establishes a robust CI process by enabling component-level comparisons against established Hugging Face baselines. The framework includes an extensible engine for loading and aligning model weights, along with dedicated adapters for various diffusion model architectures, ensuring consistent and reliable accuracy validation across different configurations and GPU setups.

Highlights

  • Component-level Accuracy Tests: Added component-level accuracy tests for VAE, transformer, and text encoder components of SGLang diffusion models, comparing them against Hugging Face Diffusers/Transformers baselines.
  • Deterministic Inputs and Centralized Configuration: Implemented deterministic inputs and centralized thresholds/skips to ensure CI stability and manage model-specific variations.
  • New Accuracy Engine: Introduced a new accuracy engine responsible for component loading, weight alignment, and cosine-similarity checks between SGLang and reference models.
  • Model-Specific Adapters: Developed model-specific adapters for Flux, Wan, Qwen, ZImage, and Hunyuan models to handle input generation for accuracy tests.
  • Test Suites: Created 1-GPU and 2-GPU test suites (test_accuracy_*) to cover various testing scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/multimodal_gen/test/server/accuracy_adapters.py
    • Added a new file defining abstract ComponentAdapter and concrete implementations for Flux, Wan, Qwen, ZImage, and Hunyuan models to manage input generation and model execution for accuracy tests.
  • python/sglang/multimodal_gen/test/server/accuracy_config.py
    • Added a new file to centralize default accuracy thresholds, per-case overrides, and explicit skip reasons for VAE, Transformer, and Text Encoder components.
  • python/sglang/multimodal_gen/test/server/accuracy_utils.py
    • Added a new file providing utility functions for deterministic random number generation, tensor statistics logging, and robust extraction of output tensors from model results.
  • python/sglang/multimodal_gen/test/server/component_accuracy.py
    • Added a new file implementing the core AccuracyEngine responsible for loading SGLang and reference model components, performing TP-aware weight transfer, and conducting cosine similarity checks.
  • python/sglang/multimodal_gen/test/server/test_accuracy_1_gpu_a.py
    • Added a new file containing pytest-based accuracy tests for VAE, Transformer, and Text Encoder components on a single GPU for a specific set of diffusion models.
  • python/sglang/multimodal_gen/test/server/test_accuracy_1_gpu_b.py
    • Added a new file containing pytest-based accuracy tests for VAE, Transformer, and Text Encoder components on a single GPU for another specific set of diffusion models.
  • python/sglang/multimodal_gen/test/server/test_accuracy_2_gpu_a.py
    • Added a new file containing pytest-based accuracy tests for VAE, Transformer, and Text Encoder components across two GPUs for a specific set of diffusion models, utilizing torchrun.
  • python/sglang/multimodal_gen/test/server/test_accuracy_2_gpu_b.py
    • Added a new file containing pytest-based accuracy tests for VAE, Transformer, and Text Encoder components across two GPUs for another specific set of diffusion models, utilizing torchrun.
  • test/registered/diffusion/test_diffusion_component_accuracy_1gpu_a.py
    • Added a new CI registration file to run the 1-GPU accuracy test suite A.
  • test/registered/diffusion/test_diffusion_component_accuracy_1gpu_b.py
    • Added a new CI registration file to run the 1-GPU accuracy test suite B.
  • test/registered/diffusion/test_diffusion_component_accuracy_2gpu_a.py
    • Added a new CI registration file to run the 2-GPU accuracy test suite A.
  • test/registered/diffusion/test_diffusion_component_accuracy_2gpu_b.py
    • Added a new CI registration file to run the 2-GPU accuracy test suite B.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive and well-structured framework for component-level accuracy testing of diffusion models. The separation of concerns into adapters, configuration, and a core engine is excellent. My feedback focuses on improving maintainability and robustness by refactoring for conciseness, reducing reliance on fragile string-based logic, and hardening utility functions against unexpected inputs.

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

Nice done

- replace model-specific adapter flow with generic hook-based component profiles
- move component comparison execution into a unified native-hook accuracy engine
- wire all 1-GPU/2-GPU accuracy suites to the native hook execution path
- add topology-aware parallel orchestration for mixed component test suites
@Ratish1 Ratish1 marked this pull request as ready for review March 2, 2026 18:38
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@mickqian
Copy link
Copy Markdown
Collaborator

mickqian commented Mar 8, 2026

bravo.
is it working fine now? could you attach some logs

@Ratish1
Copy link
Copy Markdown
Collaborator Author

Ratish1 commented Mar 8, 2026

bravo. is it working fine now? could you attach some logs

Yes ran it locally and works fine. I can attach logs after rerunning once more since there are 4 test files to run and takes some time for the 2 GPU runs.

@Ratish1
Copy link
Copy Markdown
Collaborator Author

Ratish1 commented Mar 9, 2026

Hey @mickqian , my bad for the late reply, got caught up in some other work. Here is the logs for all 4 test fiiles below:

1 GPU A & B:

image image

2 GPU A & B:

image image

Copy link
Copy Markdown
Collaborator

@BBuf BBuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now.

@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Mar 28, 2026

@mickqian I think the pr is ready now, do you have any other suggestions?

@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Mar 28, 2026

We should try to keep CI runtime short — ideally each test completes within a few minutes.

BBuf

This comment was marked as outdated.

@Ratish1
Copy link
Copy Markdown
Collaborator Author

Ratish1 commented Mar 28, 2026

We should try to keep CI runtime short — ideally each test completes within a few minutes.

Yes I think the text encoders are the problem for that. Will skip similar models sharing same components.

@Ratish1
Copy link
Copy Markdown
Collaborator Author

Ratish1 commented Mar 28, 2026

Fixed CI runtime @BBuf @mickqian

@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Mar 30, 2026

/tag-and-rerun-ci

@ping1jing2
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Apr 1, 2026

/rerun-failed-ci

@Ratish1
Copy link
Copy Markdown
Collaborator Author

Ratish1 commented Apr 1, 2026

/rerun-failed-ci

1 similar comment
@Ratish1
Copy link
Copy Markdown
Collaborator Author

Ratish1 commented Apr 1, 2026

/rerun-failed-ci

@Ratish1
Copy link
Copy Markdown
Collaborator Author

Ratish1 commented Apr 1, 2026

TODO: wire the tests into run_suite.py.

@mickqian
Copy link
Copy Markdown
Collaborator

mickqian commented Apr 3, 2026

could you add a sleep after each server shutdown, to make CI more reliable?

@Ratish1
Copy link
Copy Markdown
Collaborator Author

Ratish1 commented Apr 3, 2026

could you add a sleep after each server shutdown, to make CI more reliable?

Yes, will do it in this PR #21960 . Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion Multi-modal multi-modal language model run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants