[vllm] feat: Support online quant for rollout with torchao by jerryzh168 · Pull Request #3084 · verl-project/verl

jerryzh168 · 2025-08-15T23:57:32Z

What does this PR do?

Add support for torchao online quantization for vllm, configure the type of quantization by serializing a config file. See vllm PR for how to generate the quantization file (vllm-project/vllm#23014).
Requires vllm changes: vllm-project/vllm#23014 and vllm-project/vllm#26327

Test

generate the torchao config file (can change to other configs: https://docs.pytorch.org/ao/main/api_ref_quantization.html#inference-apis-for-quantize) torchao_config.json

from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
from torchao.core.config import config_to_dict
import json

config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())

with open("torchao_config.json", "w") as f:
    f.write(json.dumps(config_to_dict(config)))

# LLM(..., quantization="torchao", hf_overrides={"quantization_config_file": "torchao_config.json"})

this is fp8 dynamic quant:

{"_type": "Float8DynamicActivationFloat8WeightConfig", "_version": 2, "_data": {"activation_dtype": {"_type": "torch.dtype", "_data": "float8_e4m3fn"}, "weight_dtype": {"_type": "torch.dtype", "_data": "float8_e4m3fn"}, "granularity": [{"_type": "PerRow", "_version": 1, "_data": {}}, {"_type": "PerRow", "_version": 1, "_data": {}}], "mm_config": {"_type": "Float8MMConfig", "_version": 1, "_data": {"emulate": false, "use_fast_accum": true, "pad_inner_dim": false}}, "activation_value_lb": null, "activation_value_ub": null, "kernel_preference": {"_type": "KernelPreference", "_data": "AUTO"}, "set_inductor_config": true}}

Add following to sh examples/ppo_trainer/run_deepseek7b_llm.sh

     actor_rollout_ref.rollout.quantization=torchao \
     actor_rollout_ref.rollout.quantization_config_file=torchao_config.json \

Run test
VLLM_DISABLE_COMPILE_CACHE=1 sh examples/ppo_trainer/run_deepseek7b_llm.sh

# baseline
(TaskRunner pid=539843) ("Initial validation metrics: {'val-aux/openai/gsm8k/reward/mean@1': "
(TaskRunner pid=539843)  "0.6717210007581501, 'val-core/openai/gsm8k/acc/mean@1': 0.6717210007581501, "
(TaskRunner pid=539843)  "'val-aux/num_turns/min': 2, 'val-aux/num_turns/max': 2, "
(TaskRunner pid=539843)  "'val-aux/num_turns/mean': 2.0}")
(TaskRunner pid=539843) step:105 - val-aux/openai/gsm8k/reward/mean@1:0.6717210007581501 - val-core/openai/gsm8k/acc/mean@1:0.6717210007581501 - val-aux/num_turns/min:2 - val-aux/num_turns/max:2 - val-aux/num_turns/mean:2.0



# fp8
(TaskRunner pid=3763210) validation generation end
(TaskRunner pid=3763210) ("Initial validation metrics: {'val-aux/openai/gsm8k/reward/mean@1': "
(TaskRunner pid=3763210)  "0.6739954510993177, 'val-core/openai/gsm8k/acc/mean@1': 0.6739954510993177, "
(TaskRunner pid=3763210)  "'val-aux/num_turns/min': 2, 'val-aux/num_turns/max': 2, "
(TaskRunner pid=3763210)  "'val-aux/num_turns/mean': 2.0}")
(TaskRunner pid=3763210) step:105 - val-aux/openai/gsm8k/reward/mean@1:0.6739954510993177 - val-core/openai/gsm8k/acc/mean@1:0.6739954510993177 - val-aux/num_turns/min:2 - val-aux/num_turns/max:2 - val-aux/num_turns/mean:2.0

Docs:

no docs added yet since I didn't find a place to add quantized rollout docs in https://github.com/volcengine/verl/blob/main/docs/workers/fsdp_workers.rst, happy to add later when there are more docs

We can add simple string options (e.g. fp8_tensorwise, fp8_rowwise, fp8_blockwise etc.) in the future if needed.

Reviewers:

Subscribers:

Tasks:

Tags:

Checklist Before Starting

Search for similar PRs. https://github.com/volcengine/verl/pulls?q=sort%3Aupdated-desc+is%3Apr+is%3Aopen+quantization+
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

gemini-code-assist

Code Review

This pull request introduces support for on-the-fly quantization for vLLM rollouts using torchao. The changes involve adding configuration options for quantization and implementing the quantization logic within the FSDP sharding manager.

My review identified a critical issue in the implementation where model weights would fail to load if quantization is disabled. I have provided a code suggestion to fix this. Additionally, I've pointed out that the current method for selecting layers to quantize is too specific and may miss some linear layers, which could lead to unexpected behavior.

verl/workers/sharding_manager/fsdp_vllm.py

jerryzh168 · 2025-08-18T21:41:29Z

please let me know if the API makes sense, I can clean up both PRs after confirmation

vadimkantorov · 2025-08-18T23:57:25Z

I also wonder how torchao fp8 quantization compares to vllm's own impl for quantization="fp8" (and related) - and curious, why does vllm not use torchao for this (IIUC there are several "quantization backends" in vllm?), why do they have to use CUDA kernels https://github.com/vllm-project/vllm/tree/main/csrc/quantization/fp8 (why Triton is not sufficient) and why are these kernels not upstreamed to PyTorch core :) Hoping that at least on kernel level, fragmentation can be reduced by upstreaming more of these https://github.com/vllm-project/vllm/tree/main/csrc/quantization :)

And also, how does the approach in this PR compare to FlashRL approach (which also patches vllm)

jerryzh168 · 2025-08-19T00:39:47Z

I also wonder how torchao fp8 quantization compares to vllm's own impl for quantization="fp8" (and related) - and curious, why does vllm not use torchao for this (IIUC there are several "quantization backends" in vllm?), why do they have to use CUDA kernels vllm-project/vllm@main/csrc/quantization/fp8 (why Triton is not sufficient) and why are these kernels not upstreamed to PyTorch core :) Hoping that at least on kernel level, fragmentation can be reduced by upstreaming more of these vllm-project/vllm@main/csrc/quantization :)

we haven't officially compared with them yet, but we are integrating with fbgemm kernels which should be SOTA.
why does vllm not use torchao: torchao is integrated into vllm recently: https://docs.vllm.ai/en/latest/features/quantization/torchao.html (a few months ago), and we are actively working on improving torchao for vllm users. I don't have context on vllm-project/vllm@main/csrc/quantization/fp8 though.

And also, how does the approach in this PR compare to FlashRL approach (which also patches vllm)

also haven't compared yet, but we can discuss what would be the API that makes most of sense I think

CLAassistant · 2025-11-19T01:05:22Z

All committers have signed the CLA.

wuxibin89 · 2025-12-23T02:17:44Z

verl/trainer/config/rollout/rollout.yaml

  # Specify served_model_name to avoid displaying overly long model paths in Grafana
  served_model_name: ${oc.select:actor_rollout_ref.model.path,null}

+quantization: null


Please add comment above quantization and quantization_config_file.

Summary: Only supporting quantizing all linear layers with torchao config for now. see vllm PR for how to generate the quantization file. Also requires vllm changes: vllm-project/vllm#23014 Test Plan: sh examples/ppo_trainer/run_deepseek7b_llm.sh Reviewers: Subscribers: Tasks: Tags:

…ect#3084) ### What does this PR do? Add support for torchao online quantization for vllm, configure the type of quantization by serializing a config file. See vllm PR for how to generate the quantization file (vllm-project/vllm#23014). Requires vllm changes: vllm-project/vllm#23014 and vllm-project/vllm#26327 ### Test 1. generate the torchao config file (can change to other configs: https://docs.pytorch.org/ao/main/api_ref_quantization.html#inference-apis-for-quantize) torchao_config.json ``` from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow from torchao.core.config import config_to_dict import json config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()) with open("torchao_config.json", "w") as f: f.write(json.dumps(config_to_dict(config))) # LLM(..., quantization="torchao", hf_overrides={"quantization_config_file": "torchao_config.json"}) ``` this is fp8 dynamic quant: ``` {"_type": "Float8DynamicActivationFloat8WeightConfig", "_version": 2, "_data": {"activation_dtype": {"_type": "torch.dtype", "_data": "float8_e4m3fn"}, "weight_dtype": {"_type": "torch.dtype", "_data": "float8_e4m3fn"}, "granularity": [{"_type": "PerRow", "_version": 1, "_data": {}}, {"_type": "PerRow", "_version": 1, "_data": {}}], "mm_config": {"_type": "Float8MMConfig", "_version": 1, "_data": {"emulate": false, "use_fast_accum": true, "pad_inner_dim": false}}, "activation_value_lb": null, "activation_value_ub": null, "kernel_preference": {"_type": "KernelPreference", "_data": "AUTO"}, "set_inductor_config": true}} ``` 2. Add following to `sh examples/ppo_trainer/run_deepseek7b_llm.sh` ``` actor_rollout_ref.rollout.quantization=torchao \ actor_rollout_ref.rollout.quantization_config_file=torchao_config.json \ ``` 3. Run test VLLM_DISABLE_COMPILE_CACHE=1 sh examples/ppo_trainer/run_deepseek7b_llm.sh ``` # baseline (TaskRunner pid=539843) ("Initial validation metrics: {'val-aux/openai/gsm8k/reward/mean@1': " (TaskRunner pid=539843) "0.6717210007581501, 'val-core/openai/gsm8k/acc/mean@1': 0.6717210007581501, " (TaskRunner pid=539843) "'val-aux/num_turns/min': 2, 'val-aux/num_turns/max': 2, " (TaskRunner pid=539843) "'val-aux/num_turns/mean': 2.0}") (TaskRunner pid=539843) step:105 - val-aux/openai/gsm8k/reward/mean@1:0.6717210007581501 - val-core/openai/gsm8k/acc/mean@1:0.6717210007581501 - val-aux/num_turns/min:2 - val-aux/num_turns/max:2 - val-aux/num_turns/mean:2.0 # fp8 (TaskRunner pid=3763210) validation generation end (TaskRunner pid=3763210) ("Initial validation metrics: {'val-aux/openai/gsm8k/reward/mean@1': " (TaskRunner pid=3763210) "0.6739954510993177, 'val-core/openai/gsm8k/acc/mean@1': 0.6739954510993177, " (TaskRunner pid=3763210) "'val-aux/num_turns/min': 2, 'val-aux/num_turns/max': 2, " (TaskRunner pid=3763210) "'val-aux/num_turns/mean': 2.0}") (TaskRunner pid=3763210) step:105 - val-aux/openai/gsm8k/reward/mean@1:0.6739954510993177 - val-core/openai/gsm8k/acc/mean@1:0.6739954510993177 - val-aux/num_turns/min:2 - val-aux/num_turns/max:2 - val-aux/num_turns/mean:2.0 ``` Docs: no docs added yet since I didn't find a place to add quantized rollout docs in https://github.com/volcengine/verl/blob/main/docs/workers/fsdp_workers.rst, happy to add later when there are more docs We can add simple string options (e.g. fp8_tensorwise, fp8_rowwise, fp8_blockwise etc.) in the future if needed. Reviewers: Subscribers: Tasks: Tags: ### Checklist Before Starting - [x] Search for similar PRs. https://github.com/volcengine/verl/pulls?q=sort%3Aupdated-desc+is%3Apr+is%3Aopen+quantization+ - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

jerryzh168 requested review from PeterSH6, chenhaiq, eric-haibin-lin, tongyx361, vermouth1992 and wuxibin89 as code owners August 15, 2025 23:57

gemini-code-assist bot reviewed Aug 15, 2025

View reviewed changes

verl/workers/sharding_manager/fsdp_vllm.py Outdated Show resolved Hide resolved

jerryzh168 marked this pull request as draft August 16, 2025 00:02

jerryzh168 force-pushed the on-the-fly-quant branch from 4180a17 to a841cba Compare August 16, 2025 00:03

jerryzh168 mentioned this pull request Aug 18, 2025

Using FP8 inference in long response vllm rollouts #1803

Open

jerryzh168 force-pushed the on-the-fly-quant branch 2 times, most recently from d06a31d to 3bf8cd5 Compare October 3, 2025 20:17

jerryzh168 force-pushed the on-the-fly-quant branch 2 times, most recently from 353c1eb to 8ac8d67 Compare November 3, 2025 18:07

jerryzh168 changed the title ~~[vllm] feat: Support on the fly quant for rollout with torchao~~ [vllm] feat: Support quant for rollout with torchao Nov 3, 2025

jerryzh168 force-pushed the on-the-fly-quant branch 2 times, most recently from df0ec96 to 266420a Compare November 13, 2025 00:59

jerryzh168 mentioned this pull request Nov 13, 2025

Move online quantization to model.load_weights vllm-project/vllm#26327

Merged

jerryzh168 changed the title ~~[vllm] feat: Support quant for rollout with torchao~~ [vllm,fsdp] feat: Support quant for rollout with torchao Nov 13, 2025

jerryzh168 changed the title ~~[vllm,fsdp] feat: Support quant for rollout with torchao~~ [vllm, fsdp] feat: Support quant for rollout with torchao Nov 13, 2025

jerryzh168 force-pushed the on-the-fly-quant branch 2 times, most recently from 108b3ce to ddd405e Compare November 19, 2025 01:05

jerryzh168 force-pushed the on-the-fly-quant branch from ddd405e to 4c99091 Compare November 19, 2025 01:18

jerryzh168 changed the title ~~[vllm, fsdp] feat: Support quant for rollout with torchao~~ [vllm, fsdp] feat: Support online quant for rollout with torchao Nov 19, 2025

jerryzh168 marked this pull request as ready for review November 19, 2025 01:20

jerryzh168 changed the title ~~[vllm, fsdp] feat: Support online quant for rollout with torchao~~ [vllm] feat: Support online quant for rollout with torchao Nov 19, 2025

jerryzh168 force-pushed the on-the-fly-quant branch from 4c99091 to c49be5e Compare December 22, 2025 17:45

wuxibin89 reviewed Dec 23, 2025

View reviewed changes

jerryzh168 force-pushed the on-the-fly-quant branch from c49be5e to 4728239 Compare December 23, 2025 02:20

wuxibin89 approved these changes Dec 23, 2025

View reviewed changes

wuxibin89 merged commit 45e707a into verl-project:main Dec 23, 2025
55 of 56 checks passed

jerryzh168 mentioned this pull request Jan 13, 2026

[BREAKING][worker, rollout, vllm] feat: implement vLLM colocated training-inference rollout with process separation #4280

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[vllm] feat: Support online quant for rollout with torchao#3084

[vllm] feat: Support online quant for rollout with torchao#3084
wuxibin89 merged 1 commit intoverl-project:mainfrom
jerryzh168:on-the-fly-quant

jerryzh168 commented Aug 15, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

jerryzh168 commented Aug 18, 2025

Uh oh!

vadimkantorov commented Aug 18, 2025 •

edited

Loading

Uh oh!

jerryzh168 commented Aug 19, 2025

Uh oh!

CLAassistant commented Nov 19, 2025 •

edited

Loading

Uh oh!

wuxibin89 Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jerryzh168 commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test

Checklist Before Starting

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

jerryzh168 commented Aug 18, 2025

Uh oh!

vadimkantorov commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerryzh168 commented Aug 19, 2025

Uh oh!

CLAassistant commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wuxibin89 Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jerryzh168 commented Aug 15, 2025 •

edited

Loading

vadimkantorov commented Aug 18, 2025 •

edited

Loading

CLAassistant commented Nov 19, 2025 •

edited

Loading