[diffusion] Extract post-training weight APIs into mixins and add tensor update/checker paths#22817
Draft
MikukuOvO wants to merge 13 commits intosgl-project:mainfrom
Draft
[diffusion] Extract post-training weight APIs into mixins and add tensor update/checker paths#22817MikukuOvO wants to merge 13 commits intosgl-project:mainfrom
MikukuOvO wants to merge 13 commits intosgl-project:mainfrom
Conversation
Co-authored-by: dreamyang-liu <nikolaliu@icloud.com>
Co-authored-by: Xiaole Guo <vera0315@connect.hku.hk>
Co-authored-by: dreamyang-liu <nikolaliu@icloud.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: MikukuOvO <118185781+MikukuOvO@users.noreply.github.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This PR consolidates diffusion post-training weight management under dedicated mixins and completes the tensor-based update/checker path for diffusion RL/post-training workflows.
Before this change, diffusion post-training weight operations were split across scheduler/worker implementations, and the tensor update / verification flow was not routed through the same post-training mixin structure as the disk update path. This PR makes the post-training API surface more consistent and easier to extend while keeping the runtime behavior explicit.
Modifications
GPUWorkerPostTrainingMixin.SchedulerPostTrainingMixin.update_weights_from_diskandget_weights_checksumpaths onto the new mixin-based structure.UpdateWeightFromTensorReqInputPOST /update_weights_from_tensorUpdateWeightFromTensorCheckerReqInputPOST /update_weights_from_tensor_checkerUpdateWeightFromTensorCheckerutility for live transformer verificationWeightsUpdaterwith:update_weights_from_tensorflattened_bucketreconstructionweight_loader-aware loadingAccuracy Tests
This PR does not modify model forward math or inference kernels. It adds/refactors post-training weight-management and verification paths.
Manual functional validation run in
miles-diffusion:conda run -n miles-diffusion python /tmp/test_diffusion_post_training_tensor_checker.pyconda run -n miles-diffusion python /tmp/test_diffusion_post_training_server_tensor_checker.pyResults:
update_weights_from_tensor -> checker: PASSPOST /update_weights_from_tensor: PASSPOST /update_weights_from_tensor_checkersuccess path: PASSPOST /update_weights_from_tensor_checkermismatch path: PASS (expected failure)POST /get_weights_checksum: PASSExample server-level validation on
Tongyi-MAI/Z-Image-Turbo:noise_refiner.0.ffn_norm1.weight760694bc3805aac30827f46150f505facb973810599961ffd05aa3a6a2fdaa2e74da79e1a3225a2f25e17572c987ff14197f0c97d810687bdc54e55cb8f9bfa4Speed Tests and Profiling
N/A.
This PR only affects diffusion post-training weight update / verification paths and does not change the normal inference hot path. No dedicated speed benchmarking or profiling was run.
Related PRs
This PR builds on the diffusion post-training weight-update work introduced earlier:
update_weights_from_diskfor SGLang-Dupdate_weights_from_tensorpipeline to Diffusionupdate_weights_from_tensorcheckerIf this stack is being split/reviewed incrementally, the local branch also contains these related commits:
6cdd0b542refactor: extract diffusion disk weight update mixins5dfbefac3feat: add diffusion tensor weight update mixin path75e0cdbb3refactor: extract diffusion weight checksum mixinscbb0f690ffeat: add diffusion tensor weight checker mixin pathChecklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci