[diffusion] Extract post-training weight APIs into mixins and add tensor update/checker paths by MikukuOvO · Pull Request #22817 · sgl-project/sglang

MikukuOvO · 2026-04-14T18:00:31Z

Motivation

This PR consolidates diffusion post-training weight management under dedicated mixins and completes the tensor-based update/checker path for diffusion RL/post-training workflows.

Before this change, diffusion post-training weight operations were split across scheduler/worker implementations, and the tensor update / verification flow was not routed through the same post-training mixin structure as the disk update path. This PR makes the post-training API surface more consistent and easier to extend while keeping the runtime behavior explicit.

Modifications

Extracted diffusion post-training worker logic into GPUWorkerPostTrainingMixin.
Extracted diffusion post-training scheduler handlers into SchedulerPostTrainingMixin.
Moved the existing diffusion update_weights_from_disk and get_weights_checksum paths onto the new mixin-based structure.
Added tensor-based diffusion weight update support through:
- UpdateWeightFromTensorReqInput
- POST /update_weights_from_tensor
- scheduler dispatch + worker handling for deserializing per-rank tensor payloads
Added tensor-update verification support through:
- UpdateWeightFromTensorCheckerReqInput
- POST /update_weights_from_tensor_checker
- UpdateWeightFromTensorChecker utility for live transformer verification
Extended WeightsUpdater with:
- update_weights_from_tensor
- module-scoped payload resolution
- flattened_bucket reconstruction
- weight_loader-aware loading
- DTensor-aware copy/update handling
Added TP-aware payload selection in the worker mixin and a TP barrier in the scheduler tensor-update path so success is returned only after all TP ranks finish the update.
Added SHA-256-based transformer verification logic that:
- hashes live tensors in a stable way
- supports DTensor local shards
- reconstructs/checks TP-sharded tensors on the root rank when needed
- returns clearer mismatch / missing-tensor diagnostics

Accuracy Tests

This PR does not modify model forward math or inference kernels. It adds/refactors post-training weight-management and verification paths.

Manual functional validation run in miles-diffusion:

conda run -n miles-diffusion python /tmp/test_diffusion_post_training_tensor_checker.py
conda run -n miles-diffusion python /tmp/test_diffusion_post_training_server_tensor_checker.py

Results:

worker-level update_weights_from_tensor -> checker: PASS
server-level POST /update_weights_from_tensor: PASS
server-level POST /update_weights_from_tensor_checker success path: PASS
server-level POST /update_weights_from_tensor_checker mismatch path: PASS (expected failure)
server-level POST /get_weights_checksum: PASS

Example server-level validation on Tongyi-MAI/Z-Image-Turbo:

updated tensor: noise_refiner.0.ffn_norm1.weight
transformer checksum before update:
- 760694bc3805aac30827f46150f505facb973810599961ffd05aa3a6a2fdaa2e
transformer checksum after update:
- 74da79e1a3225a2f25e17572c987ff14197f0c97d810687bdc54e55cb8f9bfa4

Speed Tests and Profiling

N/A.

This PR only affects diffusion post-training weight update / verification paths and does not change the normal inference hot path. No dedicated speed benchmarking or profiling was run.

Related PRs

This PR builds on the diffusion post-training weight-update work introduced earlier:

#18306 [Feature] Implement update_weights_from_disk for SGLang-D
#20464 Add update_weights_from_tensor pipeline to Diffusion
#21106 [diffusion] Add update_weights_from_tensor checker

If this stack is being split/reviewed incrementally, the local branch also contains these related commits:

6cdd0b542 refactor: extract diffusion disk weight update mixins
5dfbefac3 feat: add diffusion tensor weight update mixin path
75e0cdbb3 refactor: extract diffusion weight checksum mixins
cbb0f690f feat: add diffusion tensor weight checker mixin path

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Co-authored-by: dreamyang-liu <nikolaliu@icloud.com>

Co-authored-by: Xiaole Guo <vera0315@connect.hku.hk>

Co-authored-by: dreamyang-liu <nikolaliu@icloud.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

Co-authored-by: MikukuOvO <118185781+MikukuOvO@users.noreply.github.com>

gemini-code-assist · 2026-04-14T18:00:35Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

MikukuOvO and others added 4 commits April 14, 2026 17:02

refactor: extract diffusion disk weight update mixins

6cdd0b5

Co-authored-by: dreamyang-liu <nikolaliu@icloud.com>

feat: add diffusion tensor weight update mixin path

5dfbefa

Co-authored-by: Xiaole Guo <vera0315@connect.hku.hk>

refactor: extract diffusion weight checksum mixins

75e0cdb

Co-authored-by: dreamyang-liu <nikolaliu@icloud.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

feat: add diffusion tensor weight checker mixin path

cbb0f69

Co-authored-by: MikukuOvO <118185781+MikukuOvO@users.noreply.github.com>

github-actions bot added the diffusion SGLang Diffusion label Apr 14, 2026

MikukuOvO changed the title ~~Dev/diffusion post training mixin~~ [diffusion] Extract post-training weight APIs into mixins and add tensor update/checker paths Apr 14, 2026

MikukuOvO added 9 commits April 14, 2026 18:05

refactor: move diffusion weight utilities into post training

5667858

refactor: simplify diffusion tensor weight checker

a3f0885

refactor: remove host type coupling from post-training mixins

dc7ca90

style: reduce diffusion weights updater formatting noise

011bb5e

refactor: simplify diffusion weights api responses

160d769

refactor: make diffusion tensor checker module-aware

7fc95d8

refactor: rename diffusion tensor update checker

08f335d

refactor: simplify diffusion tensor update checker

98388b3

refactor: tighten diffusion tensor checker helpers

eb73fa2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[diffusion] Extract post-training weight APIs into mixins and add tensor update/checker paths#22817

[diffusion] Extract post-training weight APIs into mixins and add tensor update/checker paths#22817
MikukuOvO wants to merge 13 commits intosgl-project:mainfrom
MikukuOvO:dev/diffusion-post-training-mixin

MikukuOvO commented Apr 14, 2026

Uh oh!

gemini-code-assist bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MikukuOvO commented Apr 14, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Related PRs

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant