[rollout, vllm] feat: Add BAGEL RL rollout support via vLLMOmniHttpServer by timzsu · Pull Request #5947 · verl-project/verl

timzsu · 2026-04-09T13:35:15Z

What does this PR do?

Add multi-stage BAGEL (thinker + DiT) RL rollout support to verl's vLLMOmniHttpServer, following the existing Qwen-Image pattern. This enables RL training for BAGEL image generation models through verl's diffusion rollout pipeline.

Depends on vllm-omni features:

Trajectory recording + scheduler/scheduler_kwargs on BagelPipeline
Per-request LoRA propagation through orchestrator
stage_configs_path support in OmniEngineArgs ([Feat] Override single stage CLI args when stage_configs_path is set in OmniEngineArgs vllm-project/vllm-omni#2684)

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: https://github.com/volcengine/verl/pulls?q=is%3Apr+bagel+OR+diffusion+OR+vllm-omni
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)

Test

# BAGEL (1 GPU, tiny-random model ~376MB)
BAGEL_STAGE_CONFIG=/path/to/bagel_sharedmemory_ci.yaml \
  pytest tests/workers/rollout/rollout_vllm/test_vllm_omni_bagel_generate.py -v -s

# Qwen-Image (1 GPU, tiny-random model ~30MB)
pytest tests/workers/rollout/rollout_vllm/test_vllm_omni_generate.py -v -s

All tests pass (I adjusted the test to share the same engine in all tests in the module, thereby accelerating the tests by 3-4x):

BAGEL (4 tests, ~92s): test_generate, test_generate_with_logprobs, test_generate_concurrent, test_generate_with_lora
Qwen-Image (3 tests, ~26s): test_generate, test_generate_with_logprobs, test_generate_concurrent

Both suites run in parallel on separate GPUs.

API and Usage Example

# Config (rollout_cfg.engine_kwargs)
"engine_kwargs": {
    "vllm_omni": {
        "custom_pipeline": "examples.flowgrpo_trainer.vllm_omni.pipeline_bagel.BagelPipelineWithLogProb",
        "stage_configs_path": "/path/to/bagel_sharedmemory_ci.yaml",
    }
}

# Generate with RL artifacts
output = server.generate(
    prompt_ids=token_ids,
    sampling_params={
        "num_inference_steps": 10,
        "noise_level": 0.7,
        "sde_type": "sde",
        "logprobs": True,
    },
    request_id="req_001",
)
# output.log_probs, output.extra_fields["all_latents"], output.extra_fields["all_timesteps"]

# Generate with LoRA
from vllm_omni.lora.request import LoRARequest
output = server.generate(
    prompt_ids=token_ids,
    sampling_params={"num_inference_steps": 10},
    request_id="req_002",
    lora_request=LoRARequest(lora_name="policy", lora_int_id=1, lora_path="/path/to/adapter"),
    lora_scale=1.0,
)

Design & Code Changes

New files:

examples/flowgrpo_trainer/vllm_omni/pipeline_bagel.py — Custom vllm-omni pipeline for BAGEL RL rollouts. Wraps FlowMatchSDEDiscreteScheduler with _BagelSchedulerAdapter that bridges BAGEL's 4-arg step(v_t, sigma, x_t, dt) to diffusers' 3-arg convention. Computes per-request shifted sigmas matching BAGEL's internal schedule. Returns all_latents, all_log_probs, all_timesteps in custom_output.
tests/workers/rollout/rollout_vllm/test_vllm_omni_bagel_generate.py — E2E tests through verl's rollout server, aligned with Qwen-Image test structure. Uses zhengyuansu/bagel-tiny-random (~376MB) for fast CI.

Modified files:

verl/workers/rollout/vllm_rollout/vllm_omni_async_server.py — Multi-stage support:
- run_server: inject stage_configs_path from engine_kwargs into OmniEngineArgs (single unified code path for both single-stage and multi-stage).
- generate: multi-stage sampling params (defaults for non-diffusion stages, caller params for diffusion stage), prompt_token_ids + modalities for vLLM input processor, caller-supplied lora_request/lora_scale.
examples/flowgrpo_trainer/vllm_omni/pipeline_qwenimage.py — Squeeze batch dim in custom_output (batch=1 in rollout) so both pipelines return unbatched tensors. Server passes through without shape manipulation.

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: E2E test requires 1 GPU + tiny-random model. Test file provided at tests/workers/rollout/rollout_vllm/test_vllm_omni_bagel_generate.py, configurable via BAGEL_STAGE_CONFIG env var.
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

gemini-code-assist

Code Review

This pull request introduces support for multi-stage pipelines, specifically for the BAGEL model, within the vLLM-Omni rollout infrastructure. Key changes include the implementation of a custom BAGEL pipeline with an SDE scheduler for RL rollouts, updates to the asynchronous server to handle multi-stage engine configurations, and enhancements to the generation logic to support LoRA adapters and correctly unbatch outputs for different pipeline types. I have no feedback to provide as there were no review comments.

timzsu · 2026-04-09T13:45:04Z

@SamitHuang @princepride This PR is related to RFC vllm-project/vllm-omni#1904. Feel free to suggest how to improve.

CLAassistant · 2026-04-09T14:23:01Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ timzsu
❌ princepride
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

SamitHuang · 2026-04-09T14:06:14Z

+            # Target it specifically to avoid errors on the LLM stage.
+            if is_multi_stage:
+                diffusion_stage_id = len(default_params_list) - 1
+                results = await self.engine.collective_rpc("list_loras", stage_ids=[diffusion_stage_id])


assumes only the diffusion stage uses LoRA in a multi-stage pipeline. If a user trains the LLM/thinker stage with LoRA in BAGEL, this hardcoded behavior will fail to locate or apply the adapter?

Now back to direct call to list_loras.

SamitHuang · 2026-04-10T02:15:31Z

+        # Single-stage pipelines (Qwen-Image) batch outputs with a leading batch
+        # dim that should be stripped. Multi-stage (BAGEL) returns un-batched tensors.
+        def _unbatch(v):
+            if v is None or is_multi_stage:


Relying on is_multi_stage to determine whether to strip the batch dimension in _unbatch(v) introduces fragility if future single-stage or multi-stage models change their output shapes. Modify this to check the actual tensor dimensions or use an explicit configuration flag for batching behavior rather than implicitly tying it to the stage count?

Since now we only have two pipelines, I have modified it to require the unbatched shape. The Qwen Image pipeline is adjusted to produce the unbatched shape directly.

zhtmike · 2026-04-10T02:28:44Z

    # -----------------------------------------------------------------------

    async def run_server(self, args: argparse.Namespace):
-        engine_args = OmniEngineArgs.from_cli_args(args)


I think is better to unify all engine_args from OmniEngineArgs.from_cli_args

I agree. I have created a PR (vllm-project/vllm-omni#2684) that enables the stage configs in the CLI and thus simplified the VeRL side.

zhtmike · 2026-04-10T02:32:19Z

@knlnguyen1802 please take a look

knlnguyen1802 · 2026-04-10T03:02:45Z

        custom_prompt = req.prompts[0] if req.prompts else {}
        if isinstance(custom_prompt, dict):
-            prompt_ids = custom_prompt.get("prompt_ids", prompt_ids)
+            prompt_ids = custom_prompt.get("prompt_token_ids", prompt_ids)


I think we do not need to rename this. It can be keep for compatible with qwen-image pipeline and keep old name might easier to debug

VLLM uses prompt_token_ids as the input key, so if we need to keep this, then the engine should handle both prompt_ids and prompt_token_ids. From my perspective, that would make the logic more "dirty". Do you think we should keep the original name?

I prefer to make it consistent. If you want to change it to prompt_tokens_ids. It's better to change all prompt_ids into prompt_token_ids

I have systematically renamed all prompt_ids in the vllm rollout path.

@knlnguyen1802 Can you please have a look at the current version?

zhtmike · 2026-04-10T13:34:37Z

+        # processor which requires "prompt_token_ids" and "modalities".
+        # Single-stage (e.g. Qwen-Image) reads "prompt_ids" directly.
+        if len(default_params_list) > 1:
+            custom_prompt: OmniCustomPrompt = {"prompt_token_ids": prompt_ids, "modalities": ["image"]}


Suggest to unify the prompt_token_id and prompt_id. If decide to change to prompt_token_id, pls change qwenimage pipeline as well :)

I have systematically renamed all prompt_ids in the VLLM rollout path :)

SamitHuang · 2026-04-13T03:55:25Z

test_generate_with_lora is a new test compared to QwenImage, how is the generation precision with lora checked? can we add the precision check in the test as well?

timzsu · 2026-04-13T03:58:56Z

test_generate_with_lora is a new test compared to QwenImage, how is the generation precision with lora checked? can we add the precision check in the test as well?

@SamitHuang May I ask what you mean by precision check? Now I tested the LoRA introduces a noticeable perturbation without corrupting the image (the difference in the generated images w/ and w/o LoRA is bounded), which I think is the best I can do with a random LoRA. To have the "precision check", we might need a real LoRA adapter. Any recommendations?

SamitHuang · 2026-04-13T04:25:10Z

BAGEL (4 tests, ~345s), is 5x of qwenimage, which means it may take 3m44s x 5 on the CI server. (https://github.com/verl-project/verl/actions/runs/23841025230/job/69496889448). can we further reduce this test time?

SamitHuang · 2026-04-13T04:27:13Z

test_generate_with_lora is a new test compared to QwenImage, how is the generation precision with lora checked? can we add the precision check in the test as well?

@SamitHuang May I ask what you mean by precision check? Now I tested the LoRA introduces a noticeable perturbation without corrupting the image (the difference in the generated images w/ and w/o LoRA is bounded), which I think is the best I can do with a random LoRA. To have the "precision check", we might need a real LoRA adapter. Any recommendations?

can provide a deterministic check with random ckpt for reproduction? where the ground-truth should be obtained using the code that are verified with real lora adapter precision checking?

timzsu · 2026-04-13T15:51:45Z

BAGEL (4 tests, ~345s), is 5x of qwenimage, which means it may take 3m44s x 5 on the CI server. (https://github.com/verl-project/verl/actions/runs/23841025230/job/69496889448). can we further reduce this test time?

@SamitHuang Previously, the test started up an engine for each test, and now I changed them to share the same engine across all tests in the module. Both tests are 4x faster than before.

timzsu · 2026-04-13T15:53:42Z

test_generate_with_lora is a new test compared to QwenImage, how is the generation precision with lora checked? can we add the precision check in the test as well?

@SamitHuang May I ask what you mean by precision check? Now I tested the LoRA introduces a noticeable perturbation without corrupting the image (the difference in the generated images w/ and w/o LoRA is bounded), which I think is the best I can do with a random LoRA. To have the "precision check", we might need a real LoRA adapter. Any recommendations?

can provide a deterministic check with random ckpt for reproduction? where the ground-truth should be obtained using the code that are verified with real lora adapter precision checking?

I didn't find an open-sourced real LoRA adapter for BAGEL. Do you have one?

…ttpServer Add multi-stage BAGEL (thinker + DiT) integration to verl's rollout server, following the existing Qwen-Image pattern. Includes custom vllm-omni pipeline with SDE scheduler for log-probability recording, multi-stage engine initialization, per-request LoRA on the diffusion stage, and E2E tests with synthetic LoRA adapters. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

- Unify run_server: single OmniEngineArgs.from_cli_args path, inject stage_configs_path and custom_pipeline_args after parsing - LoRA: use engine.list_loras() for all stages instead of targeting diffusion stage specifically - Remove _unbatch from server: squeeze batch dim in Qwen-Image pipeline instead, so both pipelines return unbatched tensors - Use len(default_params_list) > 1 for multi-stage checks instead of separate flag - Update BAGEL test model path Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…lout path Unify naming with vLLM convention across the rollout server, pipelines, and tests. This eliminates the need for per-pipeline dict key branching. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Change init_server fixture from function scope to module scope for both BAGEL and QwenImage rollout tests. The vllm-omni server now starts once per test module instead of once per test function. Results (local, 2x RTX 6000 Ada): - BAGEL (4 tests): 343s → 92s (3.7x faster) - QwenImage (3 tests): 63s → 25s (2.5x faster) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add DiffusionModelBase implementation for Bagel (bagel.py) with custom model loading, scheduler, and forward pass - Add build_module hook in DiffusionModelBase and DiffusersFSDPEngine to support non-standard model loading (e.g. Bagel's BagelForTraining) - Fix scheduler device/precision mismatch in FlowMatchSDEDiscreteScheduler (index_for_timestep nearest-neighbor, move sigmas to sample device) - Fix pipeline_bagel to stack trajectory tensors for proper batching - Add system prompt and negative prompt to OCR training data - Add error handling in reward_fn for vLLM reward model failures - Guard prompt_embeds access for models that don't produce embeddings - Add profiler support to diffusion trainer (config + start/stop hooks) - Fix prompt_ids -> prompt_token_ids rename in agent loop server call - Update run_bagel_flowgrpo.sh with tuned hyperparameters Co-authored-by: Claude Co-authored-by: Cursor <cursoragent@cursor.com>

Enable Bagel FlowGRPO LoRA training with vLLM-omni rollout

timzsu requested review from PeterSH6, chenhaiq and wuxibin89 as code owners April 9, 2026 13:35

gemini-code-assist Bot reviewed Apr 9, 2026

View reviewed changes

SamitHuang reviewed Apr 10, 2026

View reviewed changes

zhtmike reviewed Apr 10, 2026

View reviewed changes

knlnguyen1802 reviewed Apr 10, 2026

View reviewed changes

timzsu force-pushed the bagel-pipeline branch from 327be4d to 2ecc470 Compare April 10, 2026 13:30

zhtmike reviewed Apr 10, 2026

View reviewed changes

timzsu requested review from SamitHuang, knlnguyen1802 and zhtmike April 10, 2026 14:18

timzsu mentioned this pull request Apr 10, 2026

[Feat] Override single stage CLI args when stage_configs_path is set in OmniEngineArgs vllm-project/vllm-omni#2684

Merged

5 tasks

zhtmike approved these changes Apr 10, 2026

View reviewed changes

knlnguyen1802 approved these changes Apr 13, 2026

View reviewed changes

timzsu mentioned this pull request Apr 15, 2026

[RFC]: Reinforcement learning support for multi-stage models (Bagel) in vLLM-Omni vllm-project/vllm-omni#1904

Open

5 tasks

timzsu and others added 5 commits April 17, 2026 12:18

Update test

82e05a5

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

[rollout] refactor: Minimize run_server diff against main

8d52f8b

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

timzsu and others added 2 commits April 17, 2026 12:18

Update model path for bagel-tiny-random

ba1d306

timzsu force-pushed the bagel-pipeline branch from 1f546df to a534b40 Compare April 17, 2026 05:19

princepride and others added 3 commits April 20, 2026 09:29

Add standalone BAGEL FSDP training module and test script

f25abc9

Merge pull request #1 from timzsu/bagel-fsdp-training

83ef93b

Enable Bagel FlowGRPO LoRA training with vLLM-omni rollout

princepride requested review from ArronHZG, eric-haibin-lin, tongyx361 and vermouth1992 as code owners May 3, 2026 03:31

timzsu mentioned this pull request May 10, 2026

[diffusion, rollout, trainer] feat: add BAGEL FlowGRPO support verl-project/verl-omni#66

Open

6 tasks

Conversation

timzsu commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

timzsu commented Apr 9, 2026

Uh oh!

CLAassistant commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhtmike commented Apr 10, 2026

Uh oh!

knlnguyen1802 Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timzsu Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SamitHuang commented Apr 13, 2026

Uh oh!

timzsu commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SamitHuang commented Apr 13, 2026

Uh oh!

SamitHuang commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timzsu commented Apr 13, 2026

Uh oh!

timzsu commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

timzsu commented Apr 9, 2026 •

edited

Loading

CLAassistant commented Apr 9, 2026 •

edited

Loading

knlnguyen1802 Apr 10, 2026 •

edited

Loading

timzsu Apr 10, 2026 •

edited

Loading

timzsu commented Apr 13, 2026 •

edited

Loading

SamitHuang commented Apr 13, 2026 •

edited

Loading