[diffusion, rollout, trainer] feat: add BAGEL FlowGRPO support#66
[diffusion, rollout, trainer] feat: add BAGEL FlowGRPO support#66timzsu wants to merge 18 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for the BAGEL (Mixture-of-Thought) model within the FlowGRPO training pipeline. Key additions include the BagelForTraining module, corresponding training and rollout adapters, and example configurations and scripts for OCR-based reward training. The PR also integrates a global profiling system into the diffusion trainer and updates various components to handle multi-stage model configurations and renamed prompt parameters. Feedback focuses on improving the efficiency of the Bagel forward pass by avoiding per-sample loops, ensuring correct handling of multiple samples per prompt in trajectory metadata, avoiding hardcoded token IDs, and optimizing network session management in the reward function.
zhtmike
left a comment
There was a problem hiding this comment.
Thank you for your PR!
Looks good with few comments.
Btw can you show the reward curve in your PR descriptions
|
@knlnguyen1802 Please take a look of vllm-omni related change. Thanks! |
Signed-off-by: Wang, Zhipeng | RASIA <zhipeng.wang@rakuten.com>
* fsdp/diffusers_impl: extract registry-based custom model loading into ``_build_module_from_registry`` helper; ``_build_module`` now simply delegates to it and falls back to ``AutoModel`` when no custom loader is registered. * vllm_rollout/utils: drop the ``VERL_OMNI_ENABLE_WORKER_DEATH_SIGNAL`` env gate and always call ``set_death_signal()`` (restores the original upstream behavior). Signed-off-by: Wang, Zhipeng | RASIA <zhipeng.wang@rakuten.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Wang, Zhipeng | RASIA <zhipeng.wang@rakuten.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: Wang, Zhipeng | RASIA <zhipeng.wang@rakuten.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
65cd845 to
e02f22c
Compare
- Drop examples/flowgrpo_trainer/reward_fn.py: superseded by verl_omni/utils/reward_score/genrm_ocr.py. - Drop examples/flowgrpo_trainer/test_bagel_train.py: private FSDP + CFG smoke test, not maintained as a recipe. - run_bagel_flowgrpo.sh: point reward_path at the new genrm_ocr.py. - examples/flowgrpo_trainer/README.md: add a "BAGEL recipe" section describing prerequisites, launch command and what differs from the Qwen-Image recipe. - tests/.../test_vllm_omni_bagel_generate.py: collapse the three non-LoRA tests (test_generate / test_generate_with_logprobs / test_generate_concurrent) into one concurrent SDE+logprobs test. The LoRA test is kept separate since it exercises a distinct adapter-loading code path. - workers/engine/fsdp/diffusers_impl.py::_build_module_from_registry: add a docstring warning that hooks (attention processors, gradient-checkpointing, LoRA, dtype upcast) may be partially effective or silently inactive on custom-loaded modules, plus a TODO to migrate registered architectures into a first-class training engine and drop this escape hatch. Emit a runtime warning log line when a custom loader is taken. Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Conflicts: - examples/flowgrpo_trainer/README.md: kept upstream's new Ulysses-SP and full-weight Qwen-Image variant blurbs together with our BAGEL recipe section. Additional fix: - verl_omni/pipelines/bagel_flow_grpo/diffusers_training_adapter.py: ``forward_and_sample_previous_step`` now returns the new 4-tuple ``(log_prob, prev_sample_mean, std_dev_t, sqrt_dt)`` to match the GRPO-Guard plumbing introduced upstream in verl-project#48 (BAGEL still trains with ``loss_mode=flow_grpo`` so ``sqrt_dt`` is unused, but the engine layer now unpacks 4-tuples unconditionally). Co-authored-by: GitHub Copilot Signed-off-by: princepride <wangzhipeng628@gmail.com>
- examples/flowgrpo_trainer/run_bagel_flowgrpo_local.sh: removed. Personal workstation launcher (hard-coded /proj-tango-pvc paths, debug-only env vars). Kept locally via .git/info/exclude, the same way test_bagel_train.py is handled per reviewer feedback. - .pre-commit-config.yaml: reverted; adding ``.venv`` to the check-naming-conventions grep is unrelated to BAGEL FlowGRPO. - tests/workers/rollout/rollout_vllm/test_vllm_omni_generate.py: reverted; switching the Qwen-Image fixture to ``scope='module'`` is an unrelated test optimization. Per AGENTS.md "No low-value busywork PRs" — these mechanical/personal changes should land in their own PR if needed. Co-authored-by: GitHub Copilot Signed-off-by: princepride <wangzhipeng628@gmail.com>
There was a problem hiding this comment.
Two coupled changes, both required to keep Qwen-Image working alongside BAGEL:
-
Rename
prompt_ids→prompt_token_ids. vllm-omni 0.20+'sOmniCustomPromptstandardizes onprompt_token_ids(matching vLLM'sTokensPrompt). The server-side patch writes that key now; without the matching rename here,custom_prompt.get("prompt_ids", ...)isNoneandforward()silently falls into the warmup/dummy branch, returning an emptyDiffusionOutput— Qwen-Image rollout would degrade to empty batches without raising. -
Move the
[0]batch-dim squeeze from server to pipeline. BAGEL is multi-stage and itscustom_outputisn't shaped[1, ...], so the server can't blindly index[0]. New contract: each pipeline returns per-sample tensors; the server passes them through. Net shape for Qwen-Image consumers is unchanged.
|
@zhtmike @SamitHuang PTAL |
zhtmike
left a comment
There was a problem hiding this comment.
looks good, with few small suggestions
And one question for moficatiion on verl_omni/trainer/diffusion/ray_diffusion_trainer.py.
|
|
||
| extra_fields["raw_prompt"] = kwargs["raw_prompt"] | ||
|
|
||
| # ``return_attention_mask=True`` is required by token-aware adapters (e.g. BAGEL). |
There was a problem hiding this comment.
| # ``return_attention_mask=True`` is required by token-aware adapters (e.g. BAGEL). |
| def _compute_reward_colocate(self, batch: DataProto) -> tuple[torch.Tensor, dict[str, Any]] | torch.Tensor: | ||
| """ | ||
| compute reward use colocate reward model | ||
| def _compute_reward_colocate(self, batch: DataProto) -> DataProto: | ||
| """Compute per-sample diffusion reward via the colocated reward loop. | ||
|
|
||
| Bypasses ``RewardLoopManager.compute_rm_score`` (LLM-only: assumes | ||
| ``responses`` has a token axis and reads ``attention_mask``) and | ||
| assembles a ``[B, 1]`` ``rm_scores`` tensor directly. | ||
| """ | ||
| assert self.reward_loop_manager is not None, "RewardLoopManager is None" | ||
| batch_reward = self.reward_loop_manager.compute_rm_score(batch) | ||
| return batch_reward | ||
| manager = self.reward_loop_manager | ||
|
|
||
| if manager.reward_model_manager is not None: | ||
| manager.reward_model_manager.wake_up() | ||
|
|
||
| chunks = batch.chunk(len(manager.reward_loop_workers)) | ||
| outputs = ray.get( | ||
| [ | ||
| worker.compute_score_batch.remote(chunk) | ||
| for worker, chunk in zip(manager.reward_loop_workers, chunks, strict=True) | ||
| ] | ||
| ) | ||
| outputs_flat = [item for sublist in outputs for item in sublist] | ||
|
|
||
| scores = [item["reward_score"] for item in outputs_flat] | ||
| rm_scores = torch.tensor(scores, dtype=torch.float32).unsqueeze(-1) | ||
| reward_batch = TensorDict({"rm_scores": rm_scores}, batch_size=len(batch)) | ||
|
|
||
| reward_extra_infos = [output.get("reward_extra_info", {}) for output in outputs_flat] | ||
| reward_extra_keys = list(reward_extra_infos[0].keys()) if reward_extra_infos else [] | ||
| non_tensor_batch = { | ||
| key: np.array([info[key] for info in reward_extra_infos]) for key in reward_extra_keys | ||
| } | ||
|
|
||
| if manager.reward_model_manager is not None: | ||
| manager.reward_model_manager.sleep() | ||
|
|
||
| return DataProto( | ||
| batch=reward_batch, | ||
| non_tensor_batch=non_tensor_batch, | ||
| meta_info={"reward_extra_keys": reward_extra_keys}, | ||
| ) |
| (attention processors, gradient checkpointing, LoRA, dtype upcast) | ||
| may be silently inactive on the returned module. | ||
|
|
||
| TODO: drop this function once the model is integrated into a |
There was a problem hiding this comment.
| TODO: drop this function once the model is integrated into a | |
| # TODO (princepride): drop this function once the model is integrated into a |
| def __new__(cls, **kwargs): | ||
| set_death_signal() | ||
|
|
||
| # Do NOT call verl's ``set_death_signal``: ``PR_SET_PDEATHSIG`` is |
There was a problem hiding this comment.
This is fixed in vllm-omni main branch
| def _preprocess_engine_kwargs(self, engine_kwargs: dict) -> None: | ||
| # No-op: ``deploy_config`` is a vllm-omni CLI flag and must reach the parser. | ||
| return |
There was a problem hiding this comment.
| def _preprocess_engine_kwargs(self, engine_kwargs: dict) -> None: | |
| # No-op: ``deploy_config`` is a vllm-omni CLI flag and must reach the parser. | |
| return |
There was a problem hiding this comment.
it's better to report the reference performance in the Performance Reference doc
| - Passes the deploy-config YAML to vllm-omni via | ||
| `+actor_rollout_ref.rollout.engine_kwargs.vllm_omni.deploy_config`. The | ||
| legacy `stage_configs_path` entrypoint is **not** supported: it routes | ||
| through vllm-omni 0.20's deprecated stage-args loader, which silently |
There was a problem hiding this comment.
Should we update vllm-omni version pin for 0.20 in the installation doc?
There was a problem hiding this comment.
Should we update vllm-omni version pin for 0.20 in the installation doc?
Let us do it in separate PR
| def _to_token_list(token_ids: Any) -> list[int] | None: | ||
| if token_ids is None: | ||
| return None | ||
| if isinstance(token_ids, torch.Tensor): | ||
| token_ids = token_ids.detach().cpu().tolist() | ||
| if token_ids and isinstance(token_ids[0], list): | ||
| token_ids = token_ids[0] | ||
| return [int(token_id) for token_id in token_ids] | ||
|
|
||
|
|
||
| def _extract_prompt_text(decoded: str) -> str: | ||
| if "<|im_start|>" in decoded: | ||
| user_chunks = [] | ||
| for segment in decoded.split("<|im_start|>"): | ||
| if not segment.startswith("user"): | ||
| continue | ||
| content = segment[len("user") :].lstrip("\n") | ||
| content = content.split("<|im_end|>", 1)[0] | ||
| user_chunks.append(content) | ||
| if user_chunks: | ||
| decoded = user_chunks[-1] | ||
|
|
||
| for marker in _CHAT_MARKERS: | ||
| decoded = decoded.replace(marker, "") | ||
| return decoded.replace("<|im_start|>", "").replace("<|im_end|>", "").strip() | ||
|
|
||
|
|
||
| def _to_cpu_tensor(v): | ||
| """Convert to a single CPU tensor, stacking a list of tensors if needed.""" | ||
| if isinstance(v, torch.Tensor): | ||
| return v.detach().cpu() | ||
| if isinstance(v, list): | ||
| tensors = [x.detach().cpu() if isinstance(x, torch.Tensor) else torch.tensor(x) for x in v] | ||
| return torch.stack(tensors) if tensors else None | ||
| return v |
There was a problem hiding this comment.
Please move this into a utils.py file
| """ | ||
|
|
||
| def __new__(cls, **kwargs): | ||
| set_death_signal() |
There was a problem hiding this comment.
It is not necessary to remove this anymore since the bug is fixed on vllm-omni. If it is for stable run with vllm-omni 0.20.0 please leave it as TODO to add it back later
What does this PR do?
Ports BAGEL FlowGRPO support from verl-project/verl#5947 into verl-omni and aligns the integration with the existing Qwen image FlowGRPO path.
This PR adds BAGEL-specific pipeline adapters, rollout tests, example FlowGRPO training config/scripts, and shared diffusion rollout/training plumbing for models that do not use the Qwen image prompt-embedding path.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,vllm_omni,rollout,trainer,ci,training_utils,recipe,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,diffusion,omni,tests,docker,like[diffusion, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][diffusion, fsdp] feat: new rollout schedulerTest
GPU rollout tests:
Result:
API and Usage Example
Design & Code Changes
verl_omni.pipelines.bagel_flow_grpowith BAGEL model loading, rollout, and diffusers training adapters.examples/flowgrpo_trainer.main.Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysSome training diagram