verl-project · timzsu · May 9, 2026 · May 11, 2026 · May 11, 2026 · May 11, 2026
@@ -1,6 +1,9 @@
 # FlowGRPO Trainer
 
-This example shows how to post-train `Qwen-Image` with FlowGRPO on an OCR-style image generation task using `vllm-omni` rollout and a visual generative reward model (`Qwen3-VL-8B-Instruct` in this example).
+This example shows how to post-train `Qwen-Image` (and, in a separate
+recipe, `BAGEL-7B-MoT`) with FlowGRPO on an OCR-style image generation
+task using `vllm-omni` rollout and a visual generative reward model
+(`Qwen3-VL-8B-Instruct` in this example).
 
 For the full installation and quickstart guide, see `docs/start/flowgrpo_quickstart.md`. For algorithm details and rule-based reward training (e.g. JPEG incompressibility), see `docs/algo/flowgrpo.md`.
 
@@ -104,6 +107,61 @@ We have provided a script to enable non-cfg full-weight Qwen-Image OCR training.
 bash examples/flowgrpo_trainer/run_qwen_image_ocr.sh
 ```
 
+## BAGEL recipe
+
+`run_bagel_flowgrpo.sh` post-trains `BAGEL-7B-MoT` (Mixture-of-Transformers)
+with the same OCR reward. BAGEL is registered through the
+`verl_omni.pipelines.bagel_flow_grpo` adapter pair as the architecture
+`OmniBagelForConditionalGeneration`, and the rollout uses a
+single-stage vllm-omni pipeline whose schema is described in
+[`bagel_deploy_config.yaml`](bagel_deploy_config.yaml).
+
+Prerequisites in addition to the Qwen-Image recipe:
+
+- A local copy of `BAGEL-7B-MoT` (HF repo `ByteDance-Seed/BAGEL-7B-MoT`).
+- The same `Qwen3-VL-8B-Instruct` reward model and OCR parquet files
+  produced above.
+
+Launch:
+
+```bash
+export BAGEL_MODEL_PATH=/path/to/BAGEL-7B-MoT
+export REWARD_MODEL_PATH=/path/to/Qwen3-VL-8B-Instruct
+export OCR_TRAIN_PATH=$WORKSPACE/data/ocr/train.parquet
+export OCR_TEST_PATH=$WORKSPACE/data/ocr/test.parquet
+
+bash examples/flowgrpo_trainer/run_bagel_flowgrpo.sh
+```
+
+Notable differences from the Qwen-Image recipe:
+
+- Uses `+actor_rollout_ref.model.architecture=OmniBagelForConditionalGeneration`
+  to bypass the `model_index.json` lookup (BAGEL ships as a single
+  custom checkpoint, not a `diffusers` pipeline).
+- LoRA `target_modules` are the BAGEL MoT generation projections
+  (`q_proj_moe_gen`, `k_proj_moe_gen`, `v_proj_moe_gen`,
+  `o_proj_moe_gen`).
+- Passes the deploy-config YAML to vllm-omni via
+  `+actor_rollout_ref.rollout.engine_kwargs.vllm_omni.deploy_config`. The
+  legacy `stage_configs_path` entrypoint is **not** supported: it routes
+  through vllm-omni 0.20's deprecated stage-args loader, which silently
+  kills the BAGEL `DiffusionWorker` subprocess after warmup. Always use
+  the `deploy_config` schema documented at
+  [`bagel_deploy_config.yaml`](bagel_deploy_config.yaml).
+- Defaults to `trainer.n_gpus_per_node=4` with
+  `actor_rollout_ref.rollout.tensor_model_parallel_size=1` (4 TP=1
+  rollout replicas), matching the Qwen-Image recipe. Be aware of a
+  TOCTOU race in vllm-omni's per-process `MASTER_PORT` picker
+  (`OmniDiffusionConfig.__post_init__` →`settle_port` in
+  [`vllm_omni/diffusion/data.py`](https://github.com/vllm-project/vllm-omni/blob/main/vllm_omni/diffusion/data.py)):
+  every concurrent `vLLMOmniHttpServer` Ray actor independently calls
+  `is_port_available(p)` and may pick the same port before any of them
+  actually `bind`s. Birthday-paradox collision probability is roughly 4%
+  at 4 actors and 18% at 8 in the default 100-port window, and is
+  amplified further when retries land inside the prior run's TIME_WAIT
+  window (≈60s). If a launch dies during `init_distributed_environment`
+  with `EADDRINUSE` on a port in 30005–30105, wait ~60s and re-launch.
+
 
 ## Performance
 

@@ -0,0 +1,25 @@
+# Single-stage BAGEL deploy config for FlowGRPO training with colocated workers.
+#
+# Uses vllm-omni 0.20+'s ``--deploy-config`` schema (``pipeline`` topology
+# marker + flat ``stages`` list). The legacy ``--stage-configs-path`` schema
+# (``stage_args`` + ``runtime`` block) silently kills the BAGEL
+# ``DiffusionWorker`` after warmup on vllm-omni 0.20, so we don't use it.
+#
+# Mirrors vllm-omni's reference single-stage BAGEL config at
+# ``vllm_omni/deploy/bagel_single_stage.yaml``: the DiT stage owns the full
+# LLM (Qwen2-MoT), ViT, VAE, and tokenizer, so a single stage covers all
+# four modalities (text2img, img2img, img2text, text2text) plus think mode.
+
+pipeline: bagel_single_stage
+async_chunk: false
+
+stages:
+  - stage_id: 0
+    max_num_batched_tokens: 32768
+    max_num_seqs: 1
+    enforce_eager: true
+    trust_remote_code: true
+    enable_prefix_caching: false
+    devices: "0"
+    default_sampling_params:
+      seed: 52
@@ -0,0 +1,117 @@
+# Bagel LoRA RL, vllm_omni rollout (FlowGRPO)
+#
+# Prerequisites:
+#   1. A Bagel model (e.g. BAGEL-7B-MoT) at $BAGEL_MODEL_PATH
+#   2. A vllm-omni deploy-config YAML at $BAGEL_DEPLOY_CONFIG (we ship one
+#      next to this script at ``bagel_deploy_config.yaml``)
+#   3. ``BagelDiffusion`` registered as ``OmniBagelForConditionalGeneration``
+#      via ``verl_omni.pipelines.bagel_flow_grpo`` (auto-imported)
+#   4. A reward VLM model at $REWARD_MODEL_PATH
+#   5. OCR training data at $OCR_TRAIN_PATH / $OCR_TEST_PATH
+#      (generate via: ``python examples/flowgrpo_trainer/data_process/qwenimage_ocr.py``)
+#
+# Usage:
+#   export BAGEL_MODEL_PATH=/path/to/BAGEL-7B-MoT
+#   export REWARD_MODEL_PATH=/path/to/Qwen3-VL-8B-Instruct
+#   bash examples/flowgrpo_trainer/run_bagel_flowgrpo.sh
+#
+#   # Override any param via CLI:
+#   bash examples/flowgrpo_trainer/run_bagel_flowgrpo.sh trainer.n_gpus_per_node=8
+#
+# Default uses 4 GPUs with ``tensor_model_parallel_size=1`` (4 single-GPU
+# rollout replicas) to mirror the Qwen-Image recipe. Be aware of a TOCTOU
+# race in vllm-omni's per-actor ``MASTER_PORT`` picker (``settle_port`` in
+# ``vllm_omni/diffusion/data.py``): every concurrent ``vLLMOmniHttpServer``
+# Ray actor independently calls ``is_port_available(p)`` and may pick the
+# same port before any of them ``bind()``s, with collision probability
+# scaling by the number of concurrent actors (~4% at 4, ~18% at 8 in the
+# default 100-port window) and amplified further when retries land inside
+# the prior run's TIME_WAIT window (≈60s). If a launch dies during
+# ``init_distributed_environment`` with ``EADDRINUSE`` on a port in
+# 30005-30105, wait ~60s and re-launch; the upstream bug is tracked at
+# vllm-project/vllm-omni#TBD.
+
+set -x
+
+# --------------- Paths (override via environment) ---------------
+BAGEL_MODEL_PATH=${BAGEL_MODEL_PATH:-$HOME/models/BAGEL-7B-MoT}
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+BAGEL_DEPLOY_CONFIG=${BAGEL_DEPLOY_CONFIG:-$SCRIPT_DIR/bagel_deploy_config.yaml}
+
+REWARD_MODEL_PATH=${REWARD_MODEL_PATH:-$HOME/models/Qwen3-VL-8B-Instruct}
+
+ocr_train_path=${OCR_TRAIN_PATH:-$HOME/data/ocr/train.parquet}
+ocr_test_path=${OCR_TEST_PATH:-$HOME/data/ocr/test.parquet}
+
+ENGINE=vllm_omni
+REWARD_ENGINE=vllm
+
+reward_path=verl_omni/utils/reward_score/genrm_ocr.py
+
+python3 -m verl_omni.trainer.diffusion.main_flowgrpo \
+    algorithm.adv_estimator=flow_grpo \
+    data.train_files=$ocr_train_path \
+    data.val_files=$ocr_test_path \
+    data.train_batch_size=16 \
+    data.max_prompt_length=256 \
+    data.trust_remote_code=True \
+    actor_rollout_ref.model.path=$BAGEL_MODEL_PATH \
+    actor_rollout_ref.model.tokenizer_path=$BAGEL_MODEL_PATH \
+    +actor_rollout_ref.model.architecture=OmniBagelForConditionalGeneration \
+    actor_rollout_ref.model.trust_remote_code=True \
+    actor_rollout_ref.model.pipeline.height=512 \
+    actor_rollout_ref.model.pipeline.width=512 \
+    actor_rollout_ref.model.pipeline.num_inference_steps=15 \
+    actor_rollout_ref.model.lora_rank=64 \
+    actor_rollout_ref.model.lora_alpha=128 \
+    actor_rollout_ref.model.target_modules="['q_proj_moe_gen','k_proj_moe_gen','v_proj_moe_gen','o_proj_moe_gen','mlp_moe_gen.gate_proj','mlp_moe_gen.up_proj','mlp_moe_gen.down_proj']" \
+    actor_rollout_ref.actor.optim.lr=1e-4 \
+    actor_rollout_ref.actor.optim.weight_decay=0.0001 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=8 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
+    actor_rollout_ref.actor.ppo_epochs=1 \
+    actor_rollout_ref.actor.shuffle=False \
+    actor_rollout_ref.actor.fsdp_config.param_offload=False \
+    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+    actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
+    actor_rollout_ref.actor.diffusion_loss.loss_mode=flow_grpo \
+    actor_rollout_ref.actor.diffusion_loss.clip_ratio=1e-5 \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+    actor_rollout_ref.rollout.name=$ENGINE \
+    actor_rollout_ref.rollout.n=16 \
+    actor_rollout_ref.rollout.agent.num_workers=2 \
+    actor_rollout_ref.rollout.load_format=auto \
+    actor_rollout_ref.rollout.layered_summon=True \
+    actor_rollout_ref.rollout.pipeline.num_inference_steps=15 \
+    actor_rollout_ref.rollout.pipeline.max_sequence_length=256 \
+    actor_rollout_ref.rollout.algo.noise_level=1.3 \
+    actor_rollout_ref.rollout.algo.sde_type="sde" \
+    actor_rollout_ref.rollout.algo.sde_window_size=2 \
+    actor_rollout_ref.rollout.algo.sde_window_range="[0,7]" \
+    actor_rollout_ref.rollout.calculate_log_probs=True \
+    actor_rollout_ref.rollout.val_kwargs.pipeline.num_inference_steps=15 \
+    actor_rollout_ref.rollout.val_kwargs.algo.noise_level=0.0 \
+    +actor_rollout_ref.rollout.engine_kwargs.vllm_omni.deploy_config=$BAGEL_DEPLOY_CONFIG \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
+    reward.num_workers=1 \
+    reward.reward_model.enable=True \
+    reward.reward_model.model_path=$REWARD_MODEL_PATH \
+    reward.reward_model.rollout.name=$REWARD_ENGINE \
+    reward.reward_model.rollout.tensor_model_parallel_size=4 \
+    +reward.reward_model.rollout.engine_kwargs.vllm.mm_processor_cache_gb=0 \
+    reward.custom_reward_function.path=$reward_path \
+    reward.custom_reward_function.name=compute_score_ocr \
+    algorithm.global_std=False \
+    algorithm.bypass_mode=False \
+    trainer.logger='["console", "wandb"]' \
+    trainer.project_name=flow_grpo \
+    trainer.experiment_name=bagel_ocr_lora_orig_replica \
+    trainer.log_val_generations=4 \
+    trainer.val_before_train=False \
+    trainer.n_gpus_per_node=4 \
+    trainer.nnodes=1 \
+    trainer.save_freq=10 \
+    trainer.test_freq=10 \
+    trainer.total_epochs=5 \
+    trainer.total_training_steps=300 "$@"