(WIP) Delta weight sync for AsyncGRPO (sparse patches over an HF Bucket)#5932
(WIP) Delta weight sync for AsyncGRPO (sparse patches over an HF Bucket)#5932AmineDiro wants to merge 11 commits into
Conversation
- Add `huggingface-hub` as dependency - Introduce sparse weight patching via `DeltaWeightTransferEngine` - Add `ULPChangeDetector` for optimizer-level change tracking - Add config parameters for delta sync control (repo, anchor interval, checksum verification) - Support both anchor checkpoints and delta patches via HF Hub (Xet storage)
- Add `huggingface-hub` as dependency - Introduce sparse weight patching via `DeltaWeightTransferEngine` - Add `ULPChangeDetector` for optimizer-level change tracking - Add config parameters for delta sync control (repo, anchor interval, checksum verification) - Support both anchor checkpoints and delta patches via HF Hub (Xet storage) Add delta weight synchronization support to AsyncGRPO Implements two-phase delta sync workflow: non-blocking upload to HF Hub while inference continues, followed by a signal to vLLM to fetch and apply. Adds ULP change detection to selectively sync only modified parameters with element-level masks. Simplifies delta engine API by removing anchor/checksum logic; now uses HF Hub directly without intermediate configuration objects.
Remove ULP prediction logic, diagnostic logging config, and checkpoint chain reconstruction. Keep only ground-truth bf16 change detection via optimizer hooks and sparse patch metadata.
- Move anchor/delta decision from trainer to rollout worker - Remove change detector from streaming iter; only check for validated masks - Migrate from HfApi to bucket_id and HF Bucket APIs - Simplify upload/download paths and remove revision parameter - Refactor _send_weights_delta with clearer empty/non-empty logic
- Retry weight sync requests with exponential backoff (up to 5 attempts) - Handle environment reset failures gracefully, skipping affected slots - Handle generation task exceptions without crashing, collecting partial results - Add environment snapshots to rollout groups for reward computation - Skip delta updates for missing parameters in weight snapshots
…port, in-place vLLM apply
|
|
||
| logger.info(f"Weight sync: resuming vLLM... (transfer took {t_transfer - t_barrier:.1f}s)") | ||
| # Phase 4: Resume | ||
| logger.info(f"Weight sync: resuming vLLM... (apply took {t_transfer - t_barrier:.1f}s)") |
There was a problem hiding this comment.
Version bumps after failed apply
High Severity
If signaling vLLM to apply the uploaded patch fails, the trainer logs a warning and continues, but still increments model_version and tells the rollout worker. Staleness filtering then assumes rollouts match a policy version vLLM never received.
Reviewed by Cursor Bugbot for commit 67a3d17. Configure here.
| "--host", "0.0.0.0", \ | ||
| "--port", "7860", \ | ||
| "--worker-extension-cls", "trl.experimental.async_grpo.delta_engine.DeltaWorkerExtension", \ | ||
| "--weight-transfer-config", "{\"backend\":\"nccl\"}", \ |
There was a problem hiding this comment.
Space vLLM uses NCCL backend
High Severity
The Wordle Space image registers DeltaWorkerExtension but starts vLLM with backend":"nccl". async_wordle.py enables delta bucket sync and HTTP sparse updates, so the trainer and server use incompatible weight-transfer paths.
Reviewed by Cursor Bugbot for commit 67a3d17. Configure here.
| "--max-model-len", "32768", \ | ||
| "--enforce-eager", \ | ||
| "--gpu-memory-utilization", "0.8", \ | ||
| "--logprobs-mode", "processed_logprobs"] |
There was a problem hiding this comment.
Space missing transformers impl
High Severity
The vLLM Space entrypoint omits --model-impl transformers and VLLM_USE_V2_MODEL_RUNNER=0, which the delta sync docs require so param names match the trainer and sparse in-place apply is available on the V1 runner.
Reviewed by Cursor Bugbot for commit 67a3d17. Configure here.
…ero-change step can't trigger an early apply
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
There are 4 total unresolved issues (including 3 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 3253247. Configure here.
| self._delta_model_version += 1 | ||
| is_anchor = self._delta_model_version == 1 or self._delta_model_version % self._delta_sync_anchor_interval == 0 | ||
| if is_anchor: | ||
| iterator = ((name, tensor, None) for name, tensor, _mask in iterator) # strip masks -> full tensors |
There was a problem hiding this comment.
Periodic anchors omit unchanged params
High Severity
After the first sync, _streaming_iter only yields parameters the change detector flagged, but periodic anchor uploads only strip masks and still consume that sparse iterator. vLLM then receives a dense checkpoint missing untouched weights, so anchors cannot reset drift from missed sparse updates.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 3253247. Configure here.


What does this PR do?
🔴 This is followup of the earlier
delta-weight-syncdraft (#5417). rebuilt on vLLM's sparseweight-transfer API, GPU-resident extraction, and a clean encoding enum.. WIP ( need to rebase on main)
Adds delta (sparse) weight synchronization to the experimental
AsyncGRPO. Instead of broadcasting the full policy to vLLM on every weight sync, the trainer detects which bf16 weights actually changed after the optimizer step, encodes only those as a sparse safetensors patch, pushes it to an HF Storage Bucket. On the inference engine side, vLLM applies it in place no full-model broadcast, we use the NEWSparseWeightPatchfrom vllm (still waiting for a release)At steady state, a delta is ~1–10% of the model, and sparsity rises as the LR decays.
How it works
Trainer side
LowByteChangeDetectorhooks into the optimizer and snapshots only the low byte of each weight's bf16 pattern (1 B/elem, half a full clone). A flipped low byte ⊆ a changed bf16 value, so precision is 1.0 by construction; recall (measured) is 1.0 in normal training. Misses cause inference drift, bounded by periodic full anchors.delta_codec.pydoes the sparse extraction on the GPU:extract_sparse_batchedruns a singlenonzero(to device sync) split across all changed params (instead of onenonzero/param) (~16× less than the old dense-D2H path).raw(int32),gap_delta(uint16 gaps, 2×),nvcomp_cascaded(uses GPU Cascaded delta+bitpack, ~3×, optional added a dep).DeltaWeightTransferEngine.uploadwrites one safetensors patch (anchor = full tensors; delta ={name}.idx+{name}.val) and pushes it to the bucket. Self-describing format: names from the.valkeys, encoding from a global header field, gap-delta width from the index dtype.Lifecycle: the trainer drives vLLM's
start_weight_update/update_weights/finish_weight_updateHTTP routes; the change detector is created incompute_lossbefore the firstoptimizer.step.The NCCL path is untouched.
Requirements/constraints
mainafter v0.22.0, not in a release yet; install from nightly. Newdelta_weight_syncextra added topyproject.toml.--model-impl transformersandVLLM_USE_V2_MODEL_RUNNER=0(apply_sparse_weight_patchesexists only on the V1 runner 😢 ). Example:CUDA_VISIBLE_DEVICES=1 VLLM_SERVER_DEV_MODE=1 VLLM_USE_V2_MODEL_RUNNER=0 \ vllm serve Qwen/Qwen3-1.7B \ --model-impl transformers \ --worker-extension-cls trl.experimental.async_grpo.delta_engine.DeltaWorkerExtension \ --weight-transfer-config '{"backend":"delta"}' \ --max-model-len 2560Tests
raw/gap_delta/nvcomp_cascaded, andextract_sparse_batcheddecode+apply~1 s.References
AI writing disclosure
Note
High Risk
Changes the training–inference weight sync path (sparse detection, bucket I/O, vLLM apply); missed low-byte changes or failed applies can leave stale weights until the next anchor, though NCCL mode is unchanged.
Overview
Adds delta (sparse) weight sync to experimental
AsyncGRPO: only changed bf16 weights are encoded as safetensors patches, stored in an HF Hub bucket, and applied in place on vLLM (vLLM sparse weight transfer /DeltaWorkerExtension). Full NCCL broadcast remains when delta sync is off.Trainer / worker:
AsyncGRPOConfiggainsdelta_sync_*flags (repo, anchor interval, index encoding).LowByteChangeDetectorhooks the optimizer so sync can stream masked params; weight sync becomes upload while inference runs → pause → apply → resume, with periodic anchor full checkpoints between sparse deltas. New modules:delta_codec(GPU sparse extract +raw/gap_delta/nvcompindex encoding),delta_engine(bucket upload/download + vLLM"delta"backend registration),weight_diff(metadata + detectors).Rollout:
upload_weights/apply_weights; reward scoring can passenvironments; tougher handling for env reset / generation failures.Packaging / examples:
delta_weight_syncoptional extra inpyproject.toml; GSM8K scriptasync_grpo_delta.py; OpenEnv Wordle training script plus HF Space Docker assets for vLLM and Wordle env.Reviewed by Cursor Bugbot for commit 3253247. Bugbot is set up for automated code reviews on this repo. Configure here.