-
Notifications
You must be signed in to change notification settings - Fork 33
[diffusion, rollout, trainer] feat: add BAGEL FlowGRPO support #66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
e3f8dbf
6e30553
dfd504f
856bd9a
72b7d23
4b04e63
e23ed42
4e30fd0
91811ee
196f0f7
e02f22c
f23393e
8418b8d
3be3e8f
8b9457d
d8fbb14
39a501e
28c9c40
273b3b8
2fbd905
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,9 @@ | ||
| # FlowGRPO Trainer | ||
|
|
||
| This example shows how to post-train `Qwen-Image` with FlowGRPO on an OCR-style image generation task using `vllm-omni` rollout and a visual generative reward model (`Qwen3-VL-8B-Instruct` in this example). | ||
| This example shows how to post-train `Qwen-Image` (and, in a separate | ||
| recipe, `BAGEL-7B-MoT`) with FlowGRPO on an OCR-style image generation | ||
| task using `vllm-omni` rollout and a visual generative reward model | ||
| (`Qwen3-VL-8B-Instruct` in this example). | ||
|
|
||
| For the full installation and quickstart guide, see `docs/start/flowgrpo_quickstart.md`. For algorithm details and rule-based reward training (e.g. JPEG incompressibility), see `docs/algo/flowgrpo.md`. | ||
|
|
||
|
|
@@ -104,6 +107,61 @@ We have provided a script to enable non-cfg full-weight Qwen-Image OCR training. | |
| bash examples/flowgrpo_trainer/run_qwen_image_ocr.sh | ||
| ``` | ||
|
|
||
| ## BAGEL recipe | ||
|
|
||
| `run_bagel_flowgrpo.sh` post-trains `BAGEL-7B-MoT` (Mixture-of-Transformers) | ||
| with the same OCR reward. BAGEL is registered through the | ||
| `verl_omni.pipelines.bagel_flow_grpo` adapter pair as the architecture | ||
| `OmniBagelForConditionalGeneration`, and the rollout uses a | ||
| single-stage vllm-omni pipeline whose schema is described in | ||
| [`bagel_deploy_config.yaml`](bagel_deploy_config.yaml). | ||
|
|
||
| Prerequisites in addition to the Qwen-Image recipe: | ||
|
|
||
| - A local copy of `BAGEL-7B-MoT` (HF repo `ByteDance-Seed/BAGEL-7B-MoT`). | ||
| - The same `Qwen3-VL-8B-Instruct` reward model and OCR parquet files | ||
| produced above. | ||
|
|
||
| Launch: | ||
|
|
||
| ```bash | ||
| export BAGEL_MODEL_PATH=/path/to/BAGEL-7B-MoT | ||
| export REWARD_MODEL_PATH=/path/to/Qwen3-VL-8B-Instruct | ||
| export OCR_TRAIN_PATH=$WORKSPACE/data/ocr/train.parquet | ||
| export OCR_TEST_PATH=$WORKSPACE/data/ocr/test.parquet | ||
|
|
||
| bash examples/flowgrpo_trainer/run_bagel_flowgrpo.sh | ||
| ``` | ||
|
|
||
| Notable differences from the Qwen-Image recipe: | ||
|
|
||
| - Uses `+actor_rollout_ref.model.architecture=OmniBagelForConditionalGeneration` | ||
| to bypass the `model_index.json` lookup (BAGEL ships as a single | ||
| custom checkpoint, not a `diffusers` pipeline). | ||
| - LoRA `target_modules` are the BAGEL MoT generation projections | ||
| (`q_proj_moe_gen`, `k_proj_moe_gen`, `v_proj_moe_gen`, | ||
| `o_proj_moe_gen`). | ||
| - Passes the deploy-config YAML to vllm-omni via | ||
| `+actor_rollout_ref.rollout.engine_kwargs.vllm_omni.deploy_config`. The | ||
| legacy `stage_configs_path` entrypoint is **not** supported: it routes | ||
| through vllm-omni 0.20's deprecated stage-args loader, which silently | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we update vllm-omni version pin for 0.20 in the installation doc?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Let us do it in separate PR
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| kills the BAGEL `DiffusionWorker` subprocess after warmup. Always use | ||
| the `deploy_config` schema documented at | ||
| [`bagel_deploy_config.yaml`](bagel_deploy_config.yaml). | ||
| - Defaults to `trainer.n_gpus_per_node=4` with | ||
| `actor_rollout_ref.rollout.tensor_model_parallel_size=1` (4 TP=1 | ||
| rollout replicas), matching the Qwen-Image recipe. Be aware of a | ||
| TOCTOU race in vllm-omni's per-process `MASTER_PORT` picker | ||
| (`OmniDiffusionConfig.__post_init__` →`settle_port` in | ||
| [`vllm_omni/diffusion/data.py`](https://github.com/vllm-project/vllm-omni/blob/main/vllm_omni/diffusion/data.py)): | ||
| every concurrent `vLLMOmniHttpServer` Ray actor independently calls | ||
| `is_port_available(p)` and may pick the same port before any of them | ||
| actually `bind`s. Birthday-paradox collision probability is roughly 4% | ||
| at 4 actors and 18% at 8 in the default 100-port window, and is | ||
| amplified further when retries land inside the prior run's TIME_WAIT | ||
| window (≈60s). If a launch dies during `init_distributed_environment` | ||
| with `EADDRINUSE` on a port in 30005–30105, wait ~60s and re-launch. | ||
|
|
||
|
|
||
| ## Performance | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| # Single-stage BAGEL deploy config for FlowGRPO training with colocated workers. | ||
| # | ||
| # Uses vllm-omni 0.20+'s ``--deploy-config`` schema (``pipeline`` topology | ||
| # marker + flat ``stages`` list). The legacy ``--stage-configs-path`` schema | ||
| # (``stage_args`` + ``runtime`` block) silently kills the BAGEL | ||
| # ``DiffusionWorker`` after warmup on vllm-omni 0.20, so we don't use it. | ||
| # | ||
| # Mirrors vllm-omni's reference single-stage BAGEL config at | ||
| # ``vllm_omni/deploy/bagel_single_stage.yaml``: the DiT stage owns the full | ||
| # LLM (Qwen2-MoT), ViT, VAE, and tokenizer, so a single stage covers all | ||
| # four modalities (text2img, img2img, img2text, text2text) plus think mode. | ||
|
|
||
| pipeline: bagel_single_stage | ||
| async_chunk: false | ||
|
|
||
| stages: | ||
| - stage_id: 0 | ||
| max_num_batched_tokens: 32768 | ||
| max_num_seqs: 1 | ||
| enforce_eager: true | ||
| trust_remote_code: true | ||
| enable_prefix_caching: false | ||
| devices: "0" | ||
| default_sampling_params: | ||
| seed: 52 |
|
princepride marked this conversation as resolved.
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,117 @@ | ||
| # Bagel LoRA RL, vllm_omni rollout (FlowGRPO) | ||
| # | ||
| # Prerequisites: | ||
| # 1. A Bagel model (e.g. BAGEL-7B-MoT) at $BAGEL_MODEL_PATH | ||
| # 2. A vllm-omni deploy-config YAML at $BAGEL_DEPLOY_CONFIG (we ship one | ||
| # next to this script at ``bagel_deploy_config.yaml``) | ||
| # 3. ``BagelDiffusion`` registered as ``OmniBagelForConditionalGeneration`` | ||
| # via ``verl_omni.pipelines.bagel_flow_grpo`` (auto-imported) | ||
| # 4. A reward VLM model at $REWARD_MODEL_PATH | ||
| # 5. OCR training data at $OCR_TRAIN_PATH / $OCR_TEST_PATH | ||
| # (generate via: ``python examples/flowgrpo_trainer/data_process/qwenimage_ocr.py``) | ||
| # | ||
| # Usage: | ||
| # export BAGEL_MODEL_PATH=/path/to/BAGEL-7B-MoT | ||
| # export REWARD_MODEL_PATH=/path/to/Qwen3-VL-8B-Instruct | ||
| # bash examples/flowgrpo_trainer/run_bagel_flowgrpo.sh | ||
| # | ||
| # # Override any param via CLI: | ||
| # bash examples/flowgrpo_trainer/run_bagel_flowgrpo.sh trainer.n_gpus_per_node=8 | ||
| # | ||
| # Default uses 4 GPUs with ``tensor_model_parallel_size=1`` (4 single-GPU | ||
| # rollout replicas) to mirror the Qwen-Image recipe. Be aware of a TOCTOU | ||
| # race in vllm-omni's per-actor ``MASTER_PORT`` picker (``settle_port`` in | ||
| # ``vllm_omni/diffusion/data.py``): every concurrent ``vLLMOmniHttpServer`` | ||
| # Ray actor independently calls ``is_port_available(p)`` and may pick the | ||
| # same port before any of them ``bind()``s, with collision probability | ||
| # scaling by the number of concurrent actors (~4% at 4, ~18% at 8 in the | ||
| # default 100-port window) and amplified further when retries land inside | ||
| # the prior run's TIME_WAIT window (≈60s). If a launch dies during | ||
| # ``init_distributed_environment`` with ``EADDRINUSE`` on a port in | ||
| # 30005-30105, wait ~60s and re-launch; the upstream bug is tracked at | ||
| # vllm-project/vllm-omni#TBD. | ||
|
|
||
| set -x | ||
|
|
||
| # --------------- Paths (override via environment) --------------- | ||
| BAGEL_MODEL_PATH=${BAGEL_MODEL_PATH:-$HOME/models/BAGEL-7B-MoT} | ||
| SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" | ||
| BAGEL_DEPLOY_CONFIG=${BAGEL_DEPLOY_CONFIG:-$SCRIPT_DIR/bagel_deploy_config.yaml} | ||
|
|
||
| REWARD_MODEL_PATH=${REWARD_MODEL_PATH:-$HOME/models/Qwen3-VL-8B-Instruct} | ||
|
|
||
| ocr_train_path=${OCR_TRAIN_PATH:-$HOME/data/ocr/train.parquet} | ||
| ocr_test_path=${OCR_TEST_PATH:-$HOME/data/ocr/test.parquet} | ||
|
|
||
| ENGINE=vllm_omni | ||
| REWARD_ENGINE=vllm | ||
|
|
||
| reward_path=verl_omni/utils/reward_score/genrm_ocr.py | ||
|
|
||
| python3 -m verl_omni.trainer.diffusion.main_flowgrpo \ | ||
| algorithm.adv_estimator=flow_grpo \ | ||
| data.train_files=$ocr_train_path \ | ||
| data.val_files=$ocr_test_path \ | ||
| data.train_batch_size=16 \ | ||
| data.max_prompt_length=256 \ | ||
| data.trust_remote_code=True \ | ||
| actor_rollout_ref.model.path=$BAGEL_MODEL_PATH \ | ||
| actor_rollout_ref.model.tokenizer_path=$BAGEL_MODEL_PATH \ | ||
| +actor_rollout_ref.model.architecture=OmniBagelForConditionalGeneration \ | ||
| actor_rollout_ref.model.trust_remote_code=True \ | ||
| actor_rollout_ref.model.pipeline.height=512 \ | ||
| actor_rollout_ref.model.pipeline.width=512 \ | ||
| actor_rollout_ref.model.pipeline.num_inference_steps=15 \ | ||
| actor_rollout_ref.model.lora_rank=64 \ | ||
| actor_rollout_ref.model.lora_alpha=128 \ | ||
| actor_rollout_ref.model.target_modules="['q_proj_moe_gen','k_proj_moe_gen','v_proj_moe_gen','o_proj_moe_gen','mlp_moe_gen.gate_proj','mlp_moe_gen.up_proj','mlp_moe_gen.down_proj']" \ | ||
| actor_rollout_ref.actor.optim.lr=1e-4 \ | ||
| actor_rollout_ref.actor.optim.weight_decay=0.0001 \ | ||
| actor_rollout_ref.actor.ppo_mini_batch_size=8 \ | ||
| actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \ | ||
| actor_rollout_ref.actor.ppo_epochs=1 \ | ||
| actor_rollout_ref.actor.shuffle=False \ | ||
| actor_rollout_ref.actor.fsdp_config.param_offload=False \ | ||
| actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ | ||
| actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \ | ||
| actor_rollout_ref.actor.diffusion_loss.loss_mode=flow_grpo \ | ||
| actor_rollout_ref.actor.diffusion_loss.clip_ratio=1e-5 \ | ||
| actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \ | ||
| actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ | ||
| actor_rollout_ref.rollout.name=$ENGINE \ | ||
| actor_rollout_ref.rollout.n=16 \ | ||
| actor_rollout_ref.rollout.agent.num_workers=2 \ | ||
| actor_rollout_ref.rollout.load_format=auto \ | ||
| actor_rollout_ref.rollout.layered_summon=True \ | ||
| actor_rollout_ref.rollout.pipeline.num_inference_steps=15 \ | ||
| actor_rollout_ref.rollout.pipeline.max_sequence_length=256 \ | ||
| actor_rollout_ref.rollout.algo.noise_level=1.3 \ | ||
| actor_rollout_ref.rollout.algo.sde_type="sde" \ | ||
| actor_rollout_ref.rollout.algo.sde_window_size=2 \ | ||
| actor_rollout_ref.rollout.algo.sde_window_range="[0,7]" \ | ||
| actor_rollout_ref.rollout.calculate_log_probs=True \ | ||
| actor_rollout_ref.rollout.val_kwargs.pipeline.num_inference_steps=15 \ | ||
| actor_rollout_ref.rollout.val_kwargs.algo.noise_level=0.0 \ | ||
| +actor_rollout_ref.rollout.engine_kwargs.vllm_omni.deploy_config=$BAGEL_DEPLOY_CONFIG \ | ||
| actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \ | ||
| reward.num_workers=1 \ | ||
| reward.reward_model.enable=True \ | ||
| reward.reward_model.model_path=$REWARD_MODEL_PATH \ | ||
| reward.reward_model.rollout.name=$REWARD_ENGINE \ | ||
| reward.reward_model.rollout.tensor_model_parallel_size=4 \ | ||
| +reward.reward_model.rollout.engine_kwargs.vllm.mm_processor_cache_gb=0 \ | ||
| reward.custom_reward_function.path=$reward_path \ | ||
| reward.custom_reward_function.name=compute_score_ocr \ | ||
| algorithm.global_std=False \ | ||
| algorithm.bypass_mode=False \ | ||
| trainer.logger='["console", "wandb"]' \ | ||
| trainer.project_name=flow_grpo \ | ||
| trainer.experiment_name=bagel_ocr_lora_orig_replica \ | ||
| trainer.log_val_generations=4 \ | ||
| trainer.val_before_train=False \ | ||
| trainer.n_gpus_per_node=4 \ | ||
| trainer.nnodes=1 \ | ||
| trainer.save_freq=10 \ | ||
| trainer.test_freq=10 \ | ||
| trainer.total_epochs=5 \ | ||
| trainer.total_training_steps=300 "$@" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better to report the reference performance in the Performance Reference doc