[rollout, vllm] feat: Add BAGEL RL rollout support via vLLMOmniHttpServer#5947
[rollout, vllm] feat: Add BAGEL RL rollout support via vLLMOmniHttpServer#5947timzsu wants to merge 10 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for multi-stage pipelines, specifically for the BAGEL model, within the vLLM-Omni rollout infrastructure. Key changes include the implementation of a custom BAGEL pipeline with an SDE scheduler for RL rollouts, updates to the asynchronous server to handle multi-stage engine configurations, and enhancements to the generation logic to support LoRA adapters and correctly unbatch outputs for different pipeline types. I have no feedback to provide as there were no review comments.
|
@SamitHuang @princepride This PR is related to RFC vllm-project/vllm-omni#1904. Feel free to suggest how to improve. |
|
|
| # Target it specifically to avoid errors on the LLM stage. | ||
| if is_multi_stage: | ||
| diffusion_stage_id = len(default_params_list) - 1 | ||
| results = await self.engine.collective_rpc("list_loras", stage_ids=[diffusion_stage_id]) |
There was a problem hiding this comment.
assumes only the diffusion stage uses LoRA in a multi-stage pipeline. If a user trains the LLM/thinker stage with LoRA in BAGEL, this hardcoded behavior will fail to locate or apply the adapter?
There was a problem hiding this comment.
Now back to direct call to list_loras.
| # Single-stage pipelines (Qwen-Image) batch outputs with a leading batch | ||
| # dim that should be stripped. Multi-stage (BAGEL) returns un-batched tensors. | ||
| def _unbatch(v): | ||
| if v is None or is_multi_stage: |
There was a problem hiding this comment.
Relying on is_multi_stage to determine whether to strip the batch dimension in _unbatch(v) introduces fragility if future single-stage or multi-stage models change their output shapes. Modify this to check the actual tensor dimensions or use an explicit configuration flag for batching behavior rather than implicitly tying it to the stage count?
There was a problem hiding this comment.
Since now we only have two pipelines, I have modified it to require the unbatched shape. The Qwen Image pipeline is adjusted to produce the unbatched shape directly.
| # ----------------------------------------------------------------------- | ||
|
|
||
| async def run_server(self, args: argparse.Namespace): | ||
| engine_args = OmniEngineArgs.from_cli_args(args) |
There was a problem hiding this comment.
I think is better to unify all engine_args from OmniEngineArgs.from_cli_args
There was a problem hiding this comment.
I agree. I have created a PR (vllm-project/vllm-omni#2684) that enables the stage configs in the CLI and thus simplified the VeRL side.
|
@knlnguyen1802 please take a look |
| custom_prompt = req.prompts[0] if req.prompts else {} | ||
| if isinstance(custom_prompt, dict): | ||
| prompt_ids = custom_prompt.get("prompt_ids", prompt_ids) | ||
| prompt_ids = custom_prompt.get("prompt_token_ids", prompt_ids) |
There was a problem hiding this comment.
I think we do not need to rename this. It can be keep for compatible with qwen-image pipeline and keep old name might easier to debug
There was a problem hiding this comment.
VLLM uses prompt_token_ids as the input key, so if we need to keep this, then the engine should handle both prompt_ids and prompt_token_ids. From my perspective, that would make the logic more "dirty". Do you think we should keep the original name?
There was a problem hiding this comment.
I prefer to make it consistent. If you want to change it to prompt_tokens_ids. It's better to change all prompt_ids into prompt_token_ids
There was a problem hiding this comment.
I have systematically renamed all prompt_ids in the vllm rollout path.
There was a problem hiding this comment.
@knlnguyen1802 Can you please have a look at the current version?
| # processor which requires "prompt_token_ids" and "modalities". | ||
| # Single-stage (e.g. Qwen-Image) reads "prompt_ids" directly. | ||
| if len(default_params_list) > 1: | ||
| custom_prompt: OmniCustomPrompt = {"prompt_token_ids": prompt_ids, "modalities": ["image"]} |
There was a problem hiding this comment.
Suggest to unify the prompt_token_id and prompt_id. If decide to change to prompt_token_id, pls change qwenimage pipeline as well :)
There was a problem hiding this comment.
I have systematically renamed all prompt_ids in the VLLM rollout path :)
|
|
@SamitHuang May I ask what you mean by precision check? Now I tested the LoRA introduces a noticeable perturbation without corrupting the image (the difference in the generated images w/ and w/o LoRA is bounded), which I think is the best I can do with a random LoRA. To have the "precision check", we might need a real LoRA adapter. Any recommendations? |
|
BAGEL (4 tests, ~345s), is 5x of qwenimage, which means it may take 3m44s x 5 on the CI server. (https://github.com/verl-project/verl/actions/runs/23841025230/job/69496889448). can we further reduce this test time? |
can provide a deterministic check with random ckpt for reproduction? where the ground-truth should be obtained using the code that are verified with real lora adapter precision checking? |
@SamitHuang Previously, the test started up an engine for each test, and now I changed them to share the same engine across all tests in the module. Both tests are 4x faster than before. |
I didn't find an open-sourced real LoRA adapter for BAGEL. Do you have one? |
…ttpServer Add multi-stage BAGEL (thinker + DiT) integration to verl's rollout server, following the existing Qwen-Image pattern. Includes custom vllm-omni pipeline with SDE scheduler for log-probability recording, multi-stage engine initialization, per-request LoRA on the diffusion stage, and E2E tests with synthetic LoRA adapters. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
- Unify run_server: single OmniEngineArgs.from_cli_args path, inject stage_configs_path and custom_pipeline_args after parsing - LoRA: use engine.list_loras() for all stages instead of targeting diffusion stage specifically - Remove _unbatch from server: squeeze batch dim in Qwen-Image pipeline instead, so both pipelines return unbatched tensors - Use len(default_params_list) > 1 for multi-stage checks instead of separate flag - Update BAGEL test model path Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lout path Unify naming with vLLM convention across the rollout server, pipelines, and tests. This eliminates the need for per-pipeline dict key branching. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change init_server fixture from function scope to module scope for both BAGEL and QwenImage rollout tests. The vllm-omni server now starts once per test module instead of once per test function. Results (local, 2x RTX 6000 Ada): - BAGEL (4 tests): 343s → 92s (3.7x faster) - QwenImage (3 tests): 63s → 25s (2.5x faster) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add DiffusionModelBase implementation for Bagel (bagel.py) with custom model loading, scheduler, and forward pass - Add build_module hook in DiffusionModelBase and DiffusersFSDPEngine to support non-standard model loading (e.g. Bagel's BagelForTraining) - Fix scheduler device/precision mismatch in FlowMatchSDEDiscreteScheduler (index_for_timestep nearest-neighbor, move sigmas to sample device) - Fix pipeline_bagel to stack trajectory tensors for proper batching - Add system prompt and negative prompt to OCR training data - Add error handling in reward_fn for vLLM reward model failures - Guard prompt_embeds access for models that don't produce embeddings - Add profiler support to diffusion trainer (config + start/stop hooks) - Fix prompt_ids -> prompt_token_ids rename in agent loop server call - Update run_bagel_flowgrpo.sh with tuned hyperparameters Co-authored-by: Claude Co-authored-by: Cursor <cursoragent@cursor.com>
Enable Bagel FlowGRPO LoRA training with vLLM-omni rollout
What does this PR do?
Add multi-stage BAGEL (thinker + DiT) RL rollout support to verl's
vLLMOmniHttpServer, following the existing Qwen-Image pattern. This enables RL training for BAGEL image generation models through verl's diffusion rollout pipeline.Depends on vllm-omni features:
stage_configs_pathsupport inOmniEngineArgs([Feat] Override single stage CLI args when stage_configs_path is set in OmniEngineArgs vllm-project/vllm-omni#2684)Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI)Test
All tests pass (I adjusted the test to share the same engine in all tests in the module, thereby accelerating the tests by 3-4x):
test_generate,test_generate_with_logprobs,test_generate_concurrent,test_generate_with_loratest_generate,test_generate_with_logprobs,test_generate_concurrentBoth suites run in parallel on separate GPUs.
API and Usage Example
Design & Code Changes
New files:
examples/flowgrpo_trainer/vllm_omni/pipeline_bagel.py— Custom vllm-omni pipeline for BAGEL RL rollouts. WrapsFlowMatchSDEDiscreteSchedulerwith_BagelSchedulerAdapterthat bridges BAGEL's 4-argstep(v_t, sigma, x_t, dt)to diffusers' 3-arg convention. Computes per-request shifted sigmas matching BAGEL's internal schedule. Returnsall_latents,all_log_probs,all_timestepsincustom_output.tests/workers/rollout/rollout_vllm/test_vllm_omni_bagel_generate.py— E2E tests through verl's rollout server, aligned with Qwen-Image test structure. Uses zhengyuansu/bagel-tiny-random (~376MB) for fast CI.Modified files:
verl/workers/rollout/vllm_rollout/vllm_omni_async_server.py— Multi-stage support:run_server: injectstage_configs_pathfrom engine_kwargs intoOmniEngineArgs(single unified code path for both single-stage and multi-stage).generate: multi-stage sampling params (defaults for non-diffusion stages, caller params for diffusion stage),prompt_token_ids+modalitiesfor vLLM input processor, caller-suppliedlora_request/lora_scale.examples/flowgrpo_trainer/vllm_omni/pipeline_qwenimage.py— Squeeze batch dim incustom_output(batch=1 in rollout) so both pipelines return unbatched tensors. Server passes through without shape manipulation.Checklist Before Submitting
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaystests/workers/rollout/rollout_vllm/test_vllm_omni_bagel_generate.py, configurable viaBAGEL_STAGE_CONFIGenv var.ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.