Skip to content

(WIP) Delta weight sync for AsyncGRPO (sparse patches over an HF Bucket)#5932

Closed
AmineDiro wants to merge 11 commits into
mainfrom
delta-weight-sync-v2
Closed

(WIP) Delta weight sync for AsyncGRPO (sparse patches over an HF Bucket)#5932
AmineDiro wants to merge 11 commits into
mainfrom
delta-weight-sync-v2

Conversation

@AmineDiro
Copy link
Copy Markdown
Member

@AmineDiro AmineDiro commented Jun 3, 2026

What does this PR do?

🔴 This is followup of the earlier delta-weight-sync draft (#5417). rebuilt on vLLM's sparse
weight-transfer API, GPU-resident extraction, and a clean encoding enum.. WIP ( need to rebase on main)

Adds delta (sparse) weight synchronization to the experimental AsyncGRPO. Instead of broadcasting the full policy to vLLM on every weight sync, the trainer detects which bf16 weights actually changed after the optimizer step, encodes only those as a sparse safetensors patch, pushes it to an HF Storage Bucket. On the inference engine side, vLLM applies it in place no full-model broadcast, we use the NEW SparseWeightPatch from vllm (still waiting for a release)

At steady state, a delta is ~1–10% of the model, and sparsity rises as the LR decays.

How it works

Trainer side

  • LowByteChangeDetector hooks into the optimizer and snapshots only the low byte of each weight's bf16 pattern (1 B/elem, half a full clone). A flipped low byte ⊆ a changed bf16 value, so precision is 1.0 by construction; recall (measured) is 1.0 in normal training. Misses cause inference drift, bounded by periodic full anchors.
  • delta_codec.py does the sparse extraction on the GPU: extract_sparse_batched runs a singlenonzero (to device sync) split across all changed params (instead of one nonzero/param) (~16× less than the old dense-D2H path).
  • Index encodings (main difference with the previous implementation): raw (int32), gap_delta (uint16 gaps, 2×), nvcomp_cascaded (uses GPU Cascaded delta+bitpack, ~3×, optional added a dep).
  • DeltaWeightTransferEngine.upload writes one safetensors patch (anchor = full tensors; delta ={name}.idx + {name}.val) and pushes it to the bucket. Self-describing format: names from the .val keys, encoding from a global header field, gap-delta width from the index dtype.

Lifecycle: the trainer drives vLLM's start_weight_update / update_weights /finish_weight_update HTTP routes; the change detector is created in compute_loss before the first optimizer.step.
The NCCL path is untouched.

Requirements/constraints

  • vLLM with sparse weight transfer ([Frontend][Core] Add sparse NCCL weight transfer support for in-place updates vllm-project/vllm#40096) merged to main after v0.22.0, not in a release yet; install from nightly. New delta_weight_sync extra added to pyproject.toml.
  • Serve with --model-impl transformers and VLLM_USE_V2_MODEL_RUNNER=0 (apply_sparse_weight_patches exists only on the V1 runner 😢 ). Example:
    CUDA_VISIBLE_DEVICES=1 VLLM_SERVER_DEV_MODE=1 VLLM_USE_V2_MODEL_RUNNER=0 \
    vllm serve Qwen/Qwen3-1.7B \
          --model-impl transformers \
          --worker-extension-cls trl.experimental.async_grpo.delta_engine.DeltaWorkerExtension \
          --weight-transfer-config '{"backend":"delta"}' \
          --max-model-len 2560

🔴🔴 Sparse apply is TP=1 / PP=1 (enforced by vLLM). Dense/small models today; sharded (TP>1 / EP /
fused-MoE) is future work.

Tests

  • Low-byte detector: recall 1.0 / precision 1.0 vs the full bf16 diff
  • Codec/file round-trip: bit-exact for raw / gap_delta / nvcomp_cascaded, and extract_sparse_batched
  • End to end AsyncGRPO (Qwen3-1.7B, GSM8K): sparse deltas apply with 0 failures (~91→99% sparse), reward improves 0.10 → 0.50 through the delta path; receiver timing anchor download-bound, vllm sidedecode+apply ~1 s.

References

AI writing disclosure

  • AI-assisted: parts were suggested/iterated with an AI tool, written and reviewed by a human.

Note

High Risk
Changes the training–inference weight sync path (sparse detection, bucket I/O, vLLM apply); missed low-byte changes or failed applies can leave stale weights until the next anchor, though NCCL mode is unchanged.

Overview
Adds delta (sparse) weight sync to experimental AsyncGRPO: only changed bf16 weights are encoded as safetensors patches, stored in an HF Hub bucket, and applied in place on vLLM (vLLM sparse weight transfer / DeltaWorkerExtension). Full NCCL broadcast remains when delta sync is off.

Trainer / worker: AsyncGRPOConfig gains delta_sync_* flags (repo, anchor interval, index encoding). LowByteChangeDetector hooks the optimizer so sync can stream masked params; weight sync becomes upload while inference runs → pause → apply → resume, with periodic anchor full checkpoints between sparse deltas. New modules: delta_codec (GPU sparse extract + raw / gap_delta / nvcomp index encoding), delta_engine (bucket upload/download + vLLM "delta" backend registration), weight_diff (metadata + detectors).

Rollout: upload_weights / apply_weights; reward scoring can pass environments; tougher handling for env reset / generation failures.

Packaging / examples: delta_weight_sync optional extra in pyproject.toml; GSM8K script async_grpo_delta.py; OpenEnv Wordle training script plus HF Space Docker assets for vLLM and Wordle env.

Reviewed by Cursor Bugbot for commit 3253247. Bugbot is set up for automated code reviews on this repo. Configure here.

AmineDiro and others added 10 commits March 31, 2026 09:46
- Add `huggingface-hub` as dependency
- Introduce sparse weight patching via `DeltaWeightTransferEngine`
- Add `ULPChangeDetector` for optimizer-level change tracking
- Add config parameters for delta sync control (repo, anchor interval, checksum verification)
- Support both anchor checkpoints and delta patches via HF Hub (Xet storage)
- Add `huggingface-hub` as dependency
- Introduce sparse weight patching via `DeltaWeightTransferEngine`
- Add `ULPChangeDetector` for optimizer-level change tracking
- Add config parameters for delta sync control (repo, anchor interval,
  checksum verification)
- Support both anchor checkpoints and delta patches via HF Hub (Xet
  storage)
  Add delta weight synchronization support to AsyncGRPO

Implements two-phase delta sync workflow: non-blocking upload to HF Hub
while
inference continues, followed by a signal to vLLM to fetch and apply.
Adds ULP
change detection to selectively sync only modified parameters with
element-level
masks. Simplifies delta engine API by removing anchor/checksum logic;
now uses
HF Hub directly without intermediate configuration objects.
Remove ULP prediction logic, diagnostic logging config, and checkpoint
chain reconstruction. Keep only ground-truth bf16 change detection via
optimizer hooks and sparse patch metadata.
- Move anchor/delta decision from trainer to rollout worker
- Remove change detector from streaming iter; only check for validated
  masks
- Migrate from HfApi to bucket_id and HF Bucket APIs
- Simplify upload/download paths and remove revision parameter
- Refactor _send_weights_delta with clearer empty/non-empty logic
- Retry weight sync requests with exponential backoff (up to 5 attempts)
- Handle environment reset failures gracefully, skipping affected slots
- Handle generation task exceptions without crashing, collecting partial
  results
- Add environment snapshots to rollout groups for reward computation
- Skip delta updates for missing parameters in weight snapshots
Comment thread trl/experimental/async_grpo/async_rollout_worker.py Outdated

logger.info(f"Weight sync: resuming vLLM... (transfer took {t_transfer - t_barrier:.1f}s)")
# Phase 4: Resume
logger.info(f"Weight sync: resuming vLLM... (apply took {t_transfer - t_barrier:.1f}s)")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version bumps after failed apply

High Severity

If signaling vLLM to apply the uploaded patch fails, the trainer logs a warning and continues, but still increments model_version and tells the rollout worker. Staleness filtering then assumes rollouts match a policy version vLLM never received.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 67a3d17. Configure here.

"--host", "0.0.0.0", \
"--port", "7860", \
"--worker-extension-cls", "trl.experimental.async_grpo.delta_engine.DeltaWorkerExtension", \
"--weight-transfer-config", "{\"backend\":\"nccl\"}", \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space vLLM uses NCCL backend

High Severity

The Wordle Space image registers DeltaWorkerExtension but starts vLLM with backend":"nccl". async_wordle.py enables delta bucket sync and HTTP sparse updates, so the trainer and server use incompatible weight-transfer paths.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 67a3d17. Configure here.

"--max-model-len", "32768", \
"--enforce-eager", \
"--gpu-memory-utilization", "0.8", \
"--logprobs-mode", "processed_logprobs"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space missing transformers impl

High Severity

The vLLM Space entrypoint omits --model-impl transformers and VLLM_USE_V2_MODEL_RUNNER=0, which the delta sync docs require so param names match the trainer and sparse in-place apply is available on the V1 runner.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 67a3d17. Configure here.

…ero-change step can't trigger an early apply
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 3253247. Configure here.

self._delta_model_version += 1
is_anchor = self._delta_model_version == 1 or self._delta_model_version % self._delta_sync_anchor_interval == 0
if is_anchor:
iterator = ((name, tensor, None) for name, tensor, _mask in iterator) # strip masks -> full tensors
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Periodic anchors omit unchanged params

High Severity

After the first sync, _streaming_iter only yields parameters the change detector flagged, but periodic anchor uploads only strip masks and still consume that sparse iterator. vLLM then receives a dense checkpoint missing untouched weights, so anchors cannot reset drift from missed sparse updates.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 3253247. Configure here.

@AmineDiro AmineDiro changed the title (WIP) Delta weight sync for AsyncGRPO (sparse patches over an HF Storage Bucket) (WIP) Delta weight sync for AsyncGRPO (sparse patches over an HF Bucket) Jun 3, 2026
@AmineDiro AmineDiro closed this Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant