[algo] feat: add GRPO-Guard support for Qwen-Image training by zhtmike · Pull Request #48 · verl-project/verl-omni

zhtmike · 2026-04-30T09:17:33Z

What does this PR do?

add GRPO-Guard support for Qwen-Image training

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, vllm_omni, rollout, trainer, ci, training_utils, recipe, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, diffusion, omni, tests, docker
- If this PR involves multiple modules, separate them with , like [diffusion, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][diffusion, fsdp] feat: new rollout scheduler

Test

The pg_clipfrac_lower and pg_clipfrac_higher now symmetric as expected

critic

validation score

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...

gemini-code-assist

Code Review

This pull request implements the GRPO-Guard algorithm, an extension of Flow-GRPO designed to stabilize importance-ratio estimates in policy loss. The changes include the core implementation of the grpo_guard loss function, updates to the diffusion scheduler and training adapters to support the required sqrt_dt and proposal mean drift terms, and the addition of comprehensive documentation, example scripts, and unit tests. I have no feedback to provide as there were no review comments to evaluate.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Cheung Ka Wai <zhtmike@gmail.com>

Signed-off-by: Cheung Ka Wai <zhtmike@gmail.com>

SamitHuang · 2026-05-11T09:17:30Z

@gemini review

SamitHuang · 2026-05-11T09:18:45Z

why critic reward mean and val reward have a quick drop at ~125 step?

gemini-code-assist

Code Review

This pull request implements the GRPO-Guard algorithm, an extension of Flow-GRPO designed to stabilize importance-ratio estimates in diffusion-model RL. The implementation includes the core loss function in diffusion_algos.py, updates to the training pipeline and schedulers to handle additional parameters like sqrt_dt and old_prev_sample_mean, and the addition of new metrics, documentation, and a Qwen-Image OCR training example. Review feedback suggests refactoring the diffusion_loss utility to reduce tight coupling with specific algorithms by passing arguments more generically.

gemini-code-assist · 2026-05-11T09:23:09Z

+    if loss_mode == "grpo_guard":
+        # GRPO-Guard requires the rollout-time SDE proposal mean and the per-step
+        # diffusion coefficient terms; pass them through alongside the standard inputs.
+        policy_loss_kwargs.update(
+            old_prev_sample_mean=data["old_prev_sample_mean"],
+            prev_sample_mean=model_output["prev_sample_mean"],
+            std_dev_t=model_output["std_dev_t"],
+            sqrt_dt=model_output["sqrt_dt"],
+        )


The explicit check for grpo_guard to pass extra arguments makes this utility function tightly coupled with specific algorithm implementations. It would be more maintainable to pass all available keys from data and model_output as keyword arguments to the registered loss function, allowing the registry to handle the signature matching.

Agreed, it has been noted. We will refactor it once the algorithm becomes more complex.

zhtmike · 2026-05-11T09:26:40Z

why critic reward mean and val reward have a quick drop at ~125 step?

I think it is because all the rewards at the end of the training are almost 1, causing the reward signal to diminish, GRPO std -> 0, and thus making training unstable.

We can use: 1. a large PPO batch size (more GPUs) to provide a effective reward signal; 2. a harder reward (not so easy to be saturated); 3. adding KL will help with these.

SamitHuang · 2026-05-11T09:34:39Z

        advantages=advantages,
        config=config,
    )
+    if loss_mode == "grpo_guard":


is loss_mode the same as actor_rollout_ref.model.algorithm?

actor_rollout_ref.model.algorithm? -> extract the registered components (trainer side and rollout side).
loss_mode -> pick the right loss.

Here, we use algorithm="flowgrpo" to extract the components; loss_mode="grpo_guard" to select grpo guard loss

SamitHuang · 2026-05-11T09:35:08Z

+    data.val_files=$ocr_test_path \
+    data.train_batch_size=32 \
+    data.max_prompt_length=256 \
+    actor_rollout_ref.model.path=$model_name \


should we set actor_rollout_ref.model.algorithm to grpo_guard?

grpo-guard is improved based on flowgrpo. The only difference is the loss. Here we just reuse the components from flowgrpo

AndyZhou952

Might need to update the algo part for clarity later (i.e. highlight the key difference and motivation compared to GRPO, unify notation, etc.).

zhtmike · 2026-05-12T07:49:16Z

Might need to update the algo part for clarity later (i.e. highlight the key difference and motivation compared to GRPO, unify notation, etc.).

noted. Thanks.

Conflicts: - examples/flowgrpo_trainer/README.md: kept upstream's new Ulysses-SP and full-weight Qwen-Image variant blurbs together with our BAGEL recipe section. Additional fix: - verl_omni/pipelines/bagel_flow_grpo/diffusers_training_adapter.py: ``forward_and_sample_previous_step`` now returns the new 4-tuple ``(log_prob, prev_sample_mean, std_dev_t, sqrt_dt)`` to match the GRPO-Guard plumbing introduced upstream in verl-project#48 (BAGEL still trains with ``loss_mode=flow_grpo`` so ``sqrt_dt`` is unused, but the engine layer now unpacks 4-tuples unconditionally). Co-authored-by: GitHub Copilot Signed-off-by: princepride <wangzhipeng628@gmail.com>

add GRPO-Guard Algo

e5721ba

zhtmike changed the title ~~[algo].feat: add GRPO-Guard support for Qwen-Image training~~ [algo] feat: add GRPO-Guard support for Qwen-Image training Apr 30, 2026

gemini-code-assist Bot reviewed Apr 30, 2026

View reviewed changes

zhtmike added 2 commits April 30, 2026 17:33

fix readme

16910a6

clean test

5a1b2bd

SamitHuang mentioned this pull request May 6, 2026

[RFC] v0.1 Release Tracker #47

Open

27 tasks

zhtmike and others added 11 commits May 8, 2026 07:41

Merge branch 'main' into grpo_guard

19619c7

mv scripts

5987f95

revert chagne

f0b0c67

update script

4a5a308

fix merge

cd96340

clean comment

40c493d

update metric & documents

65e583b

update metrics

0134cd0

update

5276914

Potential fix for pull request finding

8ff16f2

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Cheung Ka Wai <zhtmike@gmail.com>

Merge branch 'main' into grpo_guard

4a06bb3

zhtmike marked this pull request as ready for review May 11, 2026 02:57

zhtmike requested a review from SamitHuang as a code owner May 11, 2026 02:57

Merge branch 'main' into grpo_guard

83e6fd7

zhtmike requested a review from AndyZhou952 May 11, 2026 02:59

Merge branch 'main' into grpo_guard

d52176d

Signed-off-by: Cheung Ka Wai <zhtmike@gmail.com>

gemini-code-assist Bot reviewed May 11, 2026

View reviewed changes

SamitHuang reviewed May 11, 2026

View reviewed changes

SamitHuang added the ready-for-ci read for running CI label May 11, 2026

Merge branch 'main' into grpo_guard

3d09b7a

github-actions Bot removed the ready-for-ci read for running CI label May 12, 2026

AndyZhou952 approved these changes May 12, 2026

View reviewed changes

zhtmike added the ready-for-ci read for running CI label May 12, 2026

AndyZhou952 merged commit 75227f2 into verl-project:main May 12, 2026
16 of 18 checks passed

Conversation

zhtmike commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

SamitHuang commented May 11, 2026

Uh oh!

SamitHuang commented May 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

zhtmike May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhtmike commented May 11, 2026

Uh oh!

SamitHuang May 11, 2026

Choose a reason for hiding this comment

Uh oh!

zhtmike May 11, 2026

Choose a reason for hiding this comment

Uh oh!

SamitHuang May 11, 2026

Choose a reason for hiding this comment

Uh oh!

zhtmike May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndyZhou952 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhtmike commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhtmike commented Apr 30, 2026 •

edited

Loading

zhtmike May 11, 2026 •

edited

Loading

zhtmike May 11, 2026 •

edited

Loading

AndyZhou952 left a comment •

edited

Loading