Skip to content

[diffusion] Extract post-training weight APIs into mixins and add tensor update/checker paths#22817

Draft
MikukuOvO wants to merge 13 commits intosgl-project:mainfrom
MikukuOvO:dev/diffusion-post-training-mixin
Draft

[diffusion] Extract post-training weight APIs into mixins and add tensor update/checker paths#22817
MikukuOvO wants to merge 13 commits intosgl-project:mainfrom
MikukuOvO:dev/diffusion-post-training-mixin

Conversation

@MikukuOvO
Copy link
Copy Markdown
Contributor

Motivation

This PR consolidates diffusion post-training weight management under dedicated mixins and completes the tensor-based update/checker path for diffusion RL/post-training workflows.

Before this change, diffusion post-training weight operations were split across scheduler/worker implementations, and the tensor update / verification flow was not routed through the same post-training mixin structure as the disk update path. This PR makes the post-training API surface more consistent and easier to extend while keeping the runtime behavior explicit.

Modifications

  • Extracted diffusion post-training worker logic into GPUWorkerPostTrainingMixin.
  • Extracted diffusion post-training scheduler handlers into SchedulerPostTrainingMixin.
  • Moved the existing diffusion update_weights_from_disk and get_weights_checksum paths onto the new mixin-based structure.
  • Added tensor-based diffusion weight update support through:
    • UpdateWeightFromTensorReqInput
    • POST /update_weights_from_tensor
    • scheduler dispatch + worker handling for deserializing per-rank tensor payloads
  • Added tensor-update verification support through:
    • UpdateWeightFromTensorCheckerReqInput
    • POST /update_weights_from_tensor_checker
    • UpdateWeightFromTensorChecker utility for live transformer verification
  • Extended WeightsUpdater with:
    • update_weights_from_tensor
    • module-scoped payload resolution
    • flattened_bucket reconstruction
    • weight_loader-aware loading
    • DTensor-aware copy/update handling
  • Added TP-aware payload selection in the worker mixin and a TP barrier in the scheduler tensor-update path so success is returned only after all TP ranks finish the update.
  • Added SHA-256-based transformer verification logic that:
    • hashes live tensors in a stable way
    • supports DTensor local shards
    • reconstructs/checks TP-sharded tensors on the root rank when needed
    • returns clearer mismatch / missing-tensor diagnostics

Accuracy Tests

This PR does not modify model forward math or inference kernels. It adds/refactors post-training weight-management and verification paths.

Manual functional validation run in miles-diffusion:

  • conda run -n miles-diffusion python /tmp/test_diffusion_post_training_tensor_checker.py
  • conda run -n miles-diffusion python /tmp/test_diffusion_post_training_server_tensor_checker.py

Results:

  • worker-level update_weights_from_tensor -> checker: PASS
  • server-level POST /update_weights_from_tensor: PASS
  • server-level POST /update_weights_from_tensor_checker success path: PASS
  • server-level POST /update_weights_from_tensor_checker mismatch path: PASS (expected failure)
  • server-level POST /get_weights_checksum: PASS

Example server-level validation on Tongyi-MAI/Z-Image-Turbo:

  • updated tensor: noise_refiner.0.ffn_norm1.weight
  • transformer checksum before update:
    • 760694bc3805aac30827f46150f505facb973810599961ffd05aa3a6a2fdaa2e
  • transformer checksum after update:
    • 74da79e1a3225a2f25e17572c987ff14197f0c97d810687bdc54e55cb8f9bfa4

Speed Tests and Profiling

N/A.

This PR only affects diffusion post-training weight update / verification paths and does not change the normal inference hot path. No dedicated speed benchmarking or profiling was run.

Related PRs

This PR builds on the diffusion post-training weight-update work introduced earlier:

  • #18306 [Feature] Implement update_weights_from_disk for SGLang-D
  • #20464 Add update_weights_from_tensor pipeline to Diffusion
  • #21106 [diffusion] Add update_weights_from_tensor checker

If this stack is being split/reviewed incrementally, the local branch also contains these related commits:

  • 6cdd0b542 refactor: extract diffusion disk weight update mixins
  • 5dfbefac3 feat: add diffusion tensor weight update mixin path
  • 75e0cdbb3 refactor: extract diffusion weight checksum mixins
  • cbb0f690f feat: add diffusion tensor weight checker mixin path

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

MikukuOvO and others added 4 commits April 14, 2026 17:02
Co-authored-by: dreamyang-liu <nikolaliu@icloud.com>
Co-authored-by: Xiaole Guo <vera0315@connect.hku.hk>
Co-authored-by: dreamyang-liu <nikolaliu@icloud.com>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: MikukuOvO <118185781+MikukuOvO@users.noreply.github.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions bot added the diffusion SGLang Diffusion label Apr 14, 2026
@MikukuOvO MikukuOvO changed the title Dev/diffusion post training mixin [diffusion] Extract post-training weight APIs into mixins and add tensor update/checker paths Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant