(WIP) Delta weight sync for AsyncGRPO (sparse patches over an HF Bucket) by AmineDiro · Pull Request #5932 · huggingface/trl

AmineDiro · 2026-06-03T14:37:11Z

What does this PR do?

🔴 This is followup of the earlier delta-weight-sync draft (#5417). rebuilt on vLLM's sparse
weight-transfer API, GPU-resident extraction, and a clean encoding enum.. WIP ( need to rebase on main)

Adds delta (sparse) weight synchronization to the experimental AsyncGRPO. Instead of broadcasting the full policy to vLLM on every weight sync, the trainer detects which bf16 weights actually changed after the optimizer step, encodes only those as a sparse safetensors patch, pushes it to an HF Storage Bucket. On the inference engine side, vLLM applies it in place no full-model broadcast, we use the NEW SparseWeightPatch from vllm (still waiting for a release)

At steady state, a delta is ~1–10% of the model, and sparsity rises as the LR decays.

How it works

Trainer side

LowByteChangeDetector hooks into the optimizer and snapshots only the low byte of each weight's bf16 pattern (1 B/elem, half a full clone). A flipped low byte ⊆ a changed bf16 value, so precision is 1.0 by construction; recall (measured) is 1.0 in normal training. Misses cause inference drift, bounded by periodic full anchors.
delta_codec.py does the sparse extraction on the GPU: extract_sparse_batched runs a singlenonzero (to device sync) split across all changed params (instead of one nonzero/param) (~16× less than the old dense-D2H path).
Index encodings (main difference with the previous implementation): raw (int32), gap_delta (uint16 gaps, 2×), nvcomp_cascaded (uses GPU Cascaded delta+bitpack, ~3×, optional added a dep).
DeltaWeightTransferEngine.upload writes one safetensors patch (anchor = full tensors; delta ={name}.idx + {name}.val) and pushes it to the bucket. Self-describing format: names from the .val keys, encoding from a global header field, gap-delta width from the index dtype.

Lifecycle: the trainer drives vLLM's start_weight_update / update_weights /finish_weight_update HTTP routes; the change detector is created in compute_loss before the first optimizer.step.
The NCCL path is untouched.

Requirements/constraints

vLLM with sparse weight transfer ([Frontend][Core] Add sparse NCCL weight transfer support for in-place updates vllm-project/vllm#40096) merged to main after v0.22.0, not in a release yet; install from nightly. New delta_weight_sync extra added to pyproject.toml.

Serve with --model-impl transformers and VLLM_USE_V2_MODEL_RUNNER=0 (apply_sparse_weight_patches exists only on the V1 runner 😢 ). Example:

CUDA_VISIBLE_DEVICES=1 VLLM_SERVER_DEV_MODE=1 VLLM_USE_V2_MODEL_RUNNER=0 \
vllm serve Qwen/Qwen3-1.7B \
      --model-impl transformers \
      --worker-extension-cls trl.experimental.async_grpo.delta_engine.DeltaWorkerExtension \
      --weight-transfer-config '{"backend":"delta"}' \
      --max-model-len 2560

🔴🔴 Sparse apply is TP=1 / PP=1 (enforced by vLLM). Dense/small models today; sharded (TP>1 / EP /
fused-MoE) is future work.

Tests

Low-byte detector: recall 1.0 / precision 1.0 vs the full bf16 diff
Codec/file round-trip: bit-exact for raw / gap_delta / nvcomp_cascaded, and extract_sparse_batched
End to end AsyncGRPO (Qwen3-1.7B, GSM8K): sparse deltas apply with 0 failures (~91→99% sparse), reward improves 0.10 → 0.50 through the delta path; receiver timing anchor download-bound, vllm sidedecode+apply ~1 s.

References

vLLM sparse weight transfer: [Frontend][Core] Add sparse NCCL weight transfer support for in-place updates vllm-project/vllm#40096
PULSE (sparse delta / selective overwrite): https://huggingface.co/papers/2602.03839

AI writing disclosure

AI-assisted: parts were suggested/iterated with an AI tool, written and reviewed by a human.

Note

High Risk
Changes the training–inference weight sync path (sparse detection, bucket I/O, vLLM apply); missed low-byte changes or failed applies can leave stale weights until the next anchor, though NCCL mode is unchanged.

Overview
Adds delta (sparse) weight sync to experimental AsyncGRPO: only changed bf16 weights are encoded as safetensors patches, stored in an HF Hub bucket, and applied in place on vLLM (vLLM sparse weight transfer / DeltaWorkerExtension). Full NCCL broadcast remains when delta sync is off.

Trainer / worker: AsyncGRPOConfig gains delta_sync_* flags (repo, anchor interval, index encoding). LowByteChangeDetector hooks the optimizer so sync can stream masked params; weight sync becomes upload while inference runs → pause → apply → resume, with periodic anchor full checkpoints between sparse deltas. New modules: delta_codec (GPU sparse extract + raw / gap_delta / nvcomp index encoding), delta_engine (bucket upload/download + vLLM "delta" backend registration), weight_diff (metadata + detectors).

Rollout: upload_weights / apply_weights; reward scoring can pass environments; tougher handling for env reset / generation failures.

Packaging / examples: delta_weight_sync optional extra in pyproject.toml; GSM8K script async_grpo_delta.py; OpenEnv Wordle training script plus HF Space Docker assets for vLLM and Wordle env.

^{Reviewed by Cursor Bugbot for commit 3253247. Bugbot is set up for automated code reviews on this repo. Configure here.}

- Add `huggingface-hub` as dependency - Introduce sparse weight patching via `DeltaWeightTransferEngine` - Add `ULPChangeDetector` for optimizer-level change tracking - Add config parameters for delta sync control (repo, anchor interval, checksum verification) - Support both anchor checkpoints and delta patches via HF Hub (Xet storage)

- Add `huggingface-hub` as dependency - Introduce sparse weight patching via `DeltaWeightTransferEngine` - Add `ULPChangeDetector` for optimizer-level change tracking - Add config parameters for delta sync control (repo, anchor interval, checksum verification) - Support both anchor checkpoints and delta patches via HF Hub (Xet storage) Add delta weight synchronization support to AsyncGRPO Implements two-phase delta sync workflow: non-blocking upload to HF Hub while inference continues, followed by a signal to vLLM to fetch and apply. Adds ULP change detection to selectively sync only modified parameters with element-level masks. Simplifies delta engine API by removing anchor/checksum logic; now uses HF Hub directly without intermediate configuration objects.

Remove ULP prediction logic, diagnostic logging config, and checkpoint chain reconstruction. Keep only ground-truth bf16 change detection via optimizer hooks and sparse patch metadata.

- Move anchor/delta decision from trainer to rollout worker - Remove change detector from streaming iter; only check for validated masks - Migrate from HfApi to bucket_id and HF Bucket APIs - Simplify upload/download paths and remove revision parameter - Refactor _send_weights_delta with clearer empty/non-empty logic

- Retry weight sync requests with exponential backoff (up to 5 attempts) - Handle environment reset failures gracefully, skipping affected slots - Handle generation task exceptions without crashing, collecting partial results - Add environment snapshots to rollout groups for reward computation - Skip delta updates for missing parameters in weight snapshots

…port, in-place vLLM apply

cursor · 2026-06-03T14:39:20Z


-        logger.info(f"Weight sync: resuming vLLM... (transfer took {t_transfer - t_barrier:.1f}s)")
+        # Phase 4: Resume
+        logger.info(f"Weight sync: resuming vLLM... (apply took {t_transfer - t_barrier:.1f}s)")


Version bumps after failed apply

High Severity

If signaling vLLM to apply the uploaded patch fails, the trainer logs a warning and continues, but still increments model_version and tells the rollout worker. Staleness filtering then assumes rollouts match a policy version vLLM never received.

^{Reviewed by Cursor Bugbot for commit 67a3d17. Configure here.}

cursor · 2026-06-03T14:39:20Z

+    "--host", "0.0.0.0", \
+    "--port", "7860", \
+    "--worker-extension-cls", "trl.experimental.async_grpo.delta_engine.DeltaWorkerExtension", \
+    "--weight-transfer-config", "{\"backend\":\"nccl\"}", \


Space vLLM uses NCCL backend

High Severity

The Wordle Space image registers DeltaWorkerExtension but starts vLLM with backend":"nccl". async_wordle.py enables delta bucket sync and HTTP sparse updates, so the trainer and server use incompatible weight-transfer paths.

^{Reviewed by Cursor Bugbot for commit 67a3d17. Configure here.}

cursor · 2026-06-03T14:39:20Z

+    "--max-model-len", "32768", \
+    "--enforce-eager", \
+    "--gpu-memory-utilization", "0.8", \
+    "--logprobs-mode", "processed_logprobs"]


Space missing transformers impl

High Severity

The vLLM Space entrypoint omits --model-impl transformers and VLLM_USE_V2_MODEL_RUNNER=0, which the delta sync docs require so param names match the trainer and sparse in-place apply is available on the V1 runner.

^{Reviewed by Cursor Bugbot for commit 67a3d17. Configure here.}

…ero-change step can't trigger an early apply

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 3253247. Configure here.}

cursor · 2026-06-03T14:51:58Z

+        self._delta_model_version += 1
+        is_anchor = self._delta_model_version == 1 or self._delta_model_version % self._delta_sync_anchor_interval == 0
+        if is_anchor:
+            iterator = ((name, tensor, None) for name, tensor, _mask in iterator)  # strip masks -> full tensors


Periodic anchors omit unchanged params

High Severity

After the first sync, _streaming_iter only yields parameters the change detector flagged, but periodic anchor uploads only strip masks and still consume that sparse iterator. vLLM then receives a dense checkpoint missing untouched weights, so anchors cannot reset drift from missed sparse updates.

Additional Locations (1)

trl/experimental/async_grpo/async_grpo_trainer.py#L586-L601

^{Reviewed by Cursor Bugbot for commit 3253247. Configure here.}

AmineDiro and others added 10 commits March 31, 2026 09:46

Simplify weight diff detector and remove unused features

592734f

Remove ULP prediction logic, diagnostic logging config, and checkpoint chain reconstruction. Keep only ground-truth bf16 change detection via optimizer hooks and sparse patch metadata.

Remove unnecessary section comments from delta_engine.py

9f2e55f

Merge branch 'main' into delta-weight-sync

a7f0a86

working spaces openenv

d6504b7

Add delta weight sync to AsyncGRPO: GPU sparse codec, HF bucket trans…

019ac73

…port, in-place vLLM apply

Add AsyncGRPO delta weight sync example

67a3d17

AmineDiro requested review from McPatate, lewtun and qgallouedec June 3, 2026 14:37

cursor Bot reviewed Jun 3, 2026

View reviewed changes

Make delta sync phases explicit (upload_weights/apply_weights) so a z…

3253247

…ero-change step can't trigger an early apply

cursor Bot reviewed Jun 3, 2026

View reviewed changes

AmineDiro changed the title ~~(WIP) Delta weight sync for AsyncGRPO (sparse patches over an HF Storage Bucket)~~ (WIP) Delta weight sync for AsyncGRPO (sparse patches over an HF Bucket) Jun 3, 2026

AmineDiro closed this Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(WIP) Delta weight sync for AsyncGRPO (sparse patches over an HF Bucket)#5932

(WIP) Delta weight sync for AsyncGRPO (sparse patches over an HF Bucket)#5932
AmineDiro wants to merge 11 commits into
mainfrom
delta-weight-sync-v2

AmineDiro commented Jun 3, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

cursor Bot Jun 3, 2026

Uh oh!

cursor Bot Jun 3, 2026

Uh oh!

cursor Bot Jun 3, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AmineDiro commented Jun 3, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

How it works

Requirements/constraints

Tests

References

AI writing disclosure

Uh oh!

Uh oh!

cursor Bot Jun 3, 2026

Choose a reason for hiding this comment

Version bumps after failed apply

Uh oh!

cursor Bot Jun 3, 2026

Choose a reason for hiding this comment

Space vLLM uses NCCL backend

Uh oh!

cursor Bot Jun 3, 2026

Choose a reason for hiding this comment

Space missing transformers impl

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 3, 2026

Choose a reason for hiding this comment

Periodic anchors omit unchanged params

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AmineDiro commented Jun 3, 2026 •

edited by cursor Bot

Loading