Skip to content

[vllm] feat: Support online quant for rollout with torchao#3084

Merged
wuxibin89 merged 1 commit intoverl-project:mainfrom
jerryzh168:on-the-fly-quant
Dec 23, 2025
Merged

[vllm] feat: Support online quant for rollout with torchao#3084
wuxibin89 merged 1 commit intoverl-project:mainfrom
jerryzh168:on-the-fly-quant

Conversation

@jerryzh168
Copy link
Copy Markdown
Contributor

@jerryzh168 jerryzh168 commented Aug 15, 2025

What does this PR do?

Add support for torchao online quantization for vllm, configure the type of quantization by serializing a config file. See vllm PR for how to generate the quantization file (vllm-project/vllm#23014).
Requires vllm changes: vllm-project/vllm#23014 and vllm-project/vllm#26327

Test

  1. generate the torchao config file (can change to other configs: https://docs.pytorch.org/ao/main/api_ref_quantization.html#inference-apis-for-quantize) torchao_config.json
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
from torchao.core.config import config_to_dict
import json

config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())

with open("torchao_config.json", "w") as f:
    f.write(json.dumps(config_to_dict(config)))

# LLM(..., quantization="torchao", hf_overrides={"quantization_config_file": "torchao_config.json"})

this is fp8 dynamic quant:

{"_type": "Float8DynamicActivationFloat8WeightConfig", "_version": 2, "_data": {"activation_dtype": {"_type": "torch.dtype", "_data": "float8_e4m3fn"}, "weight_dtype": {"_type": "torch.dtype", "_data": "float8_e4m3fn"}, "granularity": [{"_type": "PerRow", "_version": 1, "_data": {}}, {"_type": "PerRow", "_version": 1, "_data": {}}], "mm_config": {"_type": "Float8MMConfig", "_version": 1, "_data": {"emulate": false, "use_fast_accum": true, "pad_inner_dim": false}}, "activation_value_lb": null, "activation_value_ub": null, "kernel_preference": {"_type": "KernelPreference", "_data": "AUTO"}, "set_inductor_config": true}}
  1. Add following to sh examples/ppo_trainer/run_deepseek7b_llm.sh
     actor_rollout_ref.rollout.quantization=torchao \
     actor_rollout_ref.rollout.quantization_config_file=torchao_config.json \
  1. Run test
    VLLM_DISABLE_COMPILE_CACHE=1 sh examples/ppo_trainer/run_deepseek7b_llm.sh
# baseline
(TaskRunner pid=539843) ("Initial validation metrics: {'val-aux/openai/gsm8k/reward/mean@1': "
(TaskRunner pid=539843)  "0.6717210007581501, 'val-core/openai/gsm8k/acc/mean@1': 0.6717210007581501, "
(TaskRunner pid=539843)  "'val-aux/num_turns/min': 2, 'val-aux/num_turns/max': 2, "
(TaskRunner pid=539843)  "'val-aux/num_turns/mean': 2.0}")
(TaskRunner pid=539843) step:105 - val-aux/openai/gsm8k/reward/mean@1:0.6717210007581501 - val-core/openai/gsm8k/acc/mean@1:0.6717210007581501 - val-aux/num_turns/min:2 - val-aux/num_turns/max:2 - val-aux/num_turns/mean:2.0



# fp8
(TaskRunner pid=3763210) validation generation end
(TaskRunner pid=3763210) ("Initial validation metrics: {'val-aux/openai/gsm8k/reward/mean@1': "
(TaskRunner pid=3763210)  "0.6739954510993177, 'val-core/openai/gsm8k/acc/mean@1': 0.6739954510993177, "
(TaskRunner pid=3763210)  "'val-aux/num_turns/min': 2, 'val-aux/num_turns/max': 2, "
(TaskRunner pid=3763210)  "'val-aux/num_turns/mean': 2.0}")
(TaskRunner pid=3763210) step:105 - val-aux/openai/gsm8k/reward/mean@1:0.6739954510993177 - val-core/openai/gsm8k/acc/mean@1:0.6739954510993177 - val-aux/num_turns/min:2 - val-aux/num_turns/max:2 - val-aux/num_turns/mean:2.0

Docs:

no docs added yet since I didn't find a place to add quantized rollout docs in https://github.com/volcengine/verl/blob/main/docs/workers/fsdp_workers.rst, happy to add later when there are more docs

We can add simple string options (e.g. fp8_tensorwise, fp8_rowwise, fp8_blockwise etc.) in the future if needed.

Reviewers:

Subscribers:

Tasks:

Tags:

Checklist Before Starting

  • Search for similar PRs. https://github.com/volcengine/verl/pulls?q=sort%3Aupdated-desc+is%3Apr+is%3Aopen+quantization+
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for on-the-fly quantization for vLLM rollouts using torchao. The changes involve adding configuration options for quantization and implementing the quantization logic within the FSDP sharding manager.

My review identified a critical issue in the implementation where model weights would fail to load if quantization is disabled. I have provided a code suggestion to fix this. Additionally, I've pointed out that the current method for selecting layers to quantize is too specific and may miss some linear layers, which could lead to unexpected behavior.

@jerryzh168 jerryzh168 marked this pull request as draft August 16, 2025 00:02
@jerryzh168
Copy link
Copy Markdown
Contributor Author

please let me know if the API makes sense, I can clean up both PRs after confirmation

@vadimkantorov
Copy link
Copy Markdown

vadimkantorov commented Aug 18, 2025

I also wonder how torchao fp8 quantization compares to vllm's own impl for quantization="fp8" (and related) - and curious, why does vllm not use torchao for this (IIUC there are several "quantization backends" in vllm?), why do they have to use CUDA kernels https://github.com/vllm-project/vllm/tree/main/csrc/quantization/fp8 (why Triton is not sufficient) and why are these kernels not upstreamed to PyTorch core :) Hoping that at least on kernel level, fragmentation can be reduced by upstreaming more of these https://github.com/vllm-project/vllm/tree/main/csrc/quantization :)

And also, how does the approach in this PR compare to FlashRL approach (which also patches vllm)

@jerryzh168
Copy link
Copy Markdown
Contributor Author

I also wonder how torchao fp8 quantization compares to vllm's own impl for quantization="fp8" (and related) - and curious, why does vllm not use torchao for this (IIUC there are several "quantization backends" in vllm?), why do they have to use CUDA kernels vllm-project/vllm@main/csrc/quantization/fp8 (why Triton is not sufficient) and why are these kernels not upstreamed to PyTorch core :) Hoping that at least on kernel level, fragmentation can be reduced by upstreaming more of these vllm-project/vllm@main/csrc/quantization :)

we haven't officially compared with them yet, but we are integrating with fbgemm kernels which should be SOTA.
why does vllm not use torchao: torchao is integrated into vllm recently: https://docs.vllm.ai/en/latest/features/quantization/torchao.html (a few months ago), and we are actively working on improving torchao for vllm users. I don't have context on vllm-project/vllm@main/csrc/quantization/fp8 though.

And also, how does the approach in this PR compare to FlashRL approach (which also patches vllm)

also haven't compared yet, but we can discuss what would be the API that makes most of sense I think

@jerryzh168 jerryzh168 force-pushed the on-the-fly-quant branch 2 times, most recently from d06a31d to 3bf8cd5 Compare October 3, 2025 20:17
@jerryzh168 jerryzh168 force-pushed the on-the-fly-quant branch 2 times, most recently from 353c1eb to 8ac8d67 Compare November 3, 2025 18:07
@jerryzh168 jerryzh168 changed the title [vllm] feat: Support on the fly quant for rollout with torchao [vllm] feat: Support quant for rollout with torchao Nov 3, 2025
@jerryzh168 jerryzh168 force-pushed the on-the-fly-quant branch 2 times, most recently from df0ec96 to 266420a Compare November 13, 2025 00:59
@jerryzh168 jerryzh168 changed the title [vllm] feat: Support quant for rollout with torchao [vllm,fsdp] feat: Support quant for rollout with torchao Nov 13, 2025
@jerryzh168 jerryzh168 changed the title [vllm,fsdp] feat: Support quant for rollout with torchao [vllm, fsdp] feat: Support quant for rollout with torchao Nov 13, 2025
@jerryzh168 jerryzh168 force-pushed the on-the-fly-quant branch 2 times, most recently from 108b3ce to ddd405e Compare November 19, 2025 01:05
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Nov 19, 2025

CLA assistant check
All committers have signed the CLA.

@jerryzh168 jerryzh168 changed the title [vllm, fsdp] feat: Support quant for rollout with torchao [vllm, fsdp] feat: Support online quant for rollout with torchao Nov 19, 2025
@jerryzh168 jerryzh168 marked this pull request as ready for review November 19, 2025 01:20
@jerryzh168 jerryzh168 changed the title [vllm, fsdp] feat: Support online quant for rollout with torchao [vllm] feat: Support online quant for rollout with torchao Nov 19, 2025
# Specify served_model_name to avoid displaying overly long model paths in Grafana
served_model_name: ${oc.select:actor_rollout_ref.model.path,null}

quantization: null
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment above quantization and quantization_config_file.

Summary:
Only supporting quantizing all linear layers with torchao config for now. see vllm PR for
how to generate the quantization file.
Also requires vllm changes: vllm-project/vllm#23014

Test Plan:
sh examples/ppo_trainer/run_deepseek7b_llm.sh

Reviewers:

Subscribers:

Tasks:

Tags:
@wuxibin89 wuxibin89 merged commit 45e707a into verl-project:main Dec 23, 2025
55 of 56 checks passed
jsfanfanfan pushed a commit to meituan-search/verl that referenced this pull request Jan 9, 2026
…ect#3084)

### What does this PR do?
Add support for torchao online quantization for vllm, configure the type
of quantization by serializing a config file. See vllm PR for how to
generate the quantization file
(vllm-project/vllm#23014).
Requires vllm changes: vllm-project/vllm#23014
and vllm-project/vllm#26327

### Test

1. generate the torchao config file (can change to other configs:
https://docs.pytorch.org/ao/main/api_ref_quantization.html#inference-apis-for-quantize)
torchao_config.json
```
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
from torchao.core.config import config_to_dict
import json

config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())

with open("torchao_config.json", "w") as f:
    f.write(json.dumps(config_to_dict(config)))

# LLM(..., quantization="torchao", hf_overrides={"quantization_config_file": "torchao_config.json"})
```

this is fp8 dynamic quant:

```
{"_type": "Float8DynamicActivationFloat8WeightConfig", "_version": 2, "_data": {"activation_dtype": {"_type": "torch.dtype", "_data": "float8_e4m3fn"}, "weight_dtype": {"_type": "torch.dtype", "_data": "float8_e4m3fn"}, "granularity": [{"_type": "PerRow", "_version": 1, "_data": {}}, {"_type": "PerRow", "_version": 1, "_data": {}}], "mm_config": {"_type": "Float8MMConfig", "_version": 1, "_data": {"emulate": false, "use_fast_accum": true, "pad_inner_dim": false}}, "activation_value_lb": null, "activation_value_ub": null, "kernel_preference": {"_type": "KernelPreference", "_data": "AUTO"}, "set_inductor_config": true}}
```




2. Add following to `sh examples/ppo_trainer/run_deepseek7b_llm.sh`
```
     actor_rollout_ref.rollout.quantization=torchao \
     actor_rollout_ref.rollout.quantization_config_file=torchao_config.json \
```
3. Run test
VLLM_DISABLE_COMPILE_CACHE=1 sh
examples/ppo_trainer/run_deepseek7b_llm.sh



```
# baseline
(TaskRunner pid=539843) ("Initial validation metrics: {'val-aux/openai/gsm8k/reward/mean@1': "
(TaskRunner pid=539843)  "0.6717210007581501, 'val-core/openai/gsm8k/acc/mean@1': 0.6717210007581501, "
(TaskRunner pid=539843)  "'val-aux/num_turns/min': 2, 'val-aux/num_turns/max': 2, "
(TaskRunner pid=539843)  "'val-aux/num_turns/mean': 2.0}")
(TaskRunner pid=539843) step:105 - val-aux/openai/gsm8k/reward/mean@1:0.6717210007581501 - val-core/openai/gsm8k/acc/mean@1:0.6717210007581501 - val-aux/num_turns/min:2 - val-aux/num_turns/max:2 - val-aux/num_turns/mean:2.0



# fp8
(TaskRunner pid=3763210) validation generation end
(TaskRunner pid=3763210) ("Initial validation metrics: {'val-aux/openai/gsm8k/reward/mean@1': "
(TaskRunner pid=3763210)  "0.6739954510993177, 'val-core/openai/gsm8k/acc/mean@1': 0.6739954510993177, "
(TaskRunner pid=3763210)  "'val-aux/num_turns/min': 2, 'val-aux/num_turns/max': 2, "
(TaskRunner pid=3763210)  "'val-aux/num_turns/mean': 2.0}")
(TaskRunner pid=3763210) step:105 - val-aux/openai/gsm8k/reward/mean@1:0.6739954510993177 - val-core/openai/gsm8k/acc/mean@1:0.6739954510993177 - val-aux/num_turns/min:2 - val-aux/num_turns/max:2 - val-aux/num_turns/mean:2.0
```


Docs:

no docs added yet since I didn't find a place to add quantized rollout
docs in
https://github.com/volcengine/verl/blob/main/docs/workers/fsdp_workers.rst,
happy to add later when there are more docs


We can add simple string options (e.g. fp8_tensorwise, fp8_rowwise,
fp8_blockwise etc.) in the future if needed.

Reviewers:

Subscribers:

Tasks:

Tags:


### Checklist Before Starting

- [x] Search for similar PRs.
https://github.com/volcengine/verl/pulls?q=sort%3Aupdated-desc+is%3Apr+is%3Aopen+quantization+
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
…ect#3084)

### What does this PR do?
Add support for torchao online quantization for vllm, configure the type
of quantization by serializing a config file. See vllm PR for how to
generate the quantization file
(vllm-project/vllm#23014).
Requires vllm changes: vllm-project/vllm#23014
and vllm-project/vllm#26327

### Test

1. generate the torchao config file (can change to other configs:
https://docs.pytorch.org/ao/main/api_ref_quantization.html#inference-apis-for-quantize)
torchao_config.json
```
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
from torchao.core.config import config_to_dict
import json

config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())

with open("torchao_config.json", "w") as f:
    f.write(json.dumps(config_to_dict(config)))

# LLM(..., quantization="torchao", hf_overrides={"quantization_config_file": "torchao_config.json"})
```

this is fp8 dynamic quant:

```
{"_type": "Float8DynamicActivationFloat8WeightConfig", "_version": 2, "_data": {"activation_dtype": {"_type": "torch.dtype", "_data": "float8_e4m3fn"}, "weight_dtype": {"_type": "torch.dtype", "_data": "float8_e4m3fn"}, "granularity": [{"_type": "PerRow", "_version": 1, "_data": {}}, {"_type": "PerRow", "_version": 1, "_data": {}}], "mm_config": {"_type": "Float8MMConfig", "_version": 1, "_data": {"emulate": false, "use_fast_accum": true, "pad_inner_dim": false}}, "activation_value_lb": null, "activation_value_ub": null, "kernel_preference": {"_type": "KernelPreference", "_data": "AUTO"}, "set_inductor_config": true}}
```




2. Add following to `sh examples/ppo_trainer/run_deepseek7b_llm.sh`
```
     actor_rollout_ref.rollout.quantization=torchao \
     actor_rollout_ref.rollout.quantization_config_file=torchao_config.json \
```
3. Run test
VLLM_DISABLE_COMPILE_CACHE=1 sh
examples/ppo_trainer/run_deepseek7b_llm.sh



```
# baseline
(TaskRunner pid=539843) ("Initial validation metrics: {'val-aux/openai/gsm8k/reward/mean@1': "
(TaskRunner pid=539843)  "0.6717210007581501, 'val-core/openai/gsm8k/acc/mean@1': 0.6717210007581501, "
(TaskRunner pid=539843)  "'val-aux/num_turns/min': 2, 'val-aux/num_turns/max': 2, "
(TaskRunner pid=539843)  "'val-aux/num_turns/mean': 2.0}")
(TaskRunner pid=539843) step:105 - val-aux/openai/gsm8k/reward/mean@1:0.6717210007581501 - val-core/openai/gsm8k/acc/mean@1:0.6717210007581501 - val-aux/num_turns/min:2 - val-aux/num_turns/max:2 - val-aux/num_turns/mean:2.0



# fp8
(TaskRunner pid=3763210) validation generation end
(TaskRunner pid=3763210) ("Initial validation metrics: {'val-aux/openai/gsm8k/reward/mean@1': "
(TaskRunner pid=3763210)  "0.6739954510993177, 'val-core/openai/gsm8k/acc/mean@1': 0.6739954510993177, "
(TaskRunner pid=3763210)  "'val-aux/num_turns/min': 2, 'val-aux/num_turns/max': 2, "
(TaskRunner pid=3763210)  "'val-aux/num_turns/mean': 2.0}")
(TaskRunner pid=3763210) step:105 - val-aux/openai/gsm8k/reward/mean@1:0.6739954510993177 - val-core/openai/gsm8k/acc/mean@1:0.6739954510993177 - val-aux/num_turns/min:2 - val-aux/num_turns/max:2 - val-aux/num_turns/mean:2.0
```


Docs:

no docs added yet since I didn't find a place to add quantized rollout
docs in
https://github.com/volcengine/verl/blob/main/docs/workers/fsdp_workers.rst,
happy to add later when there are more docs


We can add simple string options (e.g. fp8_tensorwise, fp8_rowwise,
fp8_blockwise etc.) in the future if needed.

Reviewers:

Subscribers:

Tasks:

Tags:


### Checklist Before Starting

- [x] Search for similar PRs.
https://github.com/volcengine/verl/pulls?q=sort%3Aupdated-desc+is%3Apr+is%3Aopen+quantization+
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
sophiayyya pushed a commit to sophiayyya/verl that referenced this pull request Jan 25, 2026
…ect#3084)

### What does this PR do?
Add support for torchao online quantization for vllm, configure the type
of quantization by serializing a config file. See vllm PR for how to
generate the quantization file
(vllm-project/vllm#23014).
Requires vllm changes: vllm-project/vllm#23014
and vllm-project/vllm#26327

### Test

1. generate the torchao config file (can change to other configs:
https://docs.pytorch.org/ao/main/api_ref_quantization.html#inference-apis-for-quantize)
torchao_config.json
```
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
from torchao.core.config import config_to_dict
import json

config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())

with open("torchao_config.json", "w") as f:
    f.write(json.dumps(config_to_dict(config)))

# LLM(..., quantization="torchao", hf_overrides={"quantization_config_file": "torchao_config.json"})
```

this is fp8 dynamic quant:

```
{"_type": "Float8DynamicActivationFloat8WeightConfig", "_version": 2, "_data": {"activation_dtype": {"_type": "torch.dtype", "_data": "float8_e4m3fn"}, "weight_dtype": {"_type": "torch.dtype", "_data": "float8_e4m3fn"}, "granularity": [{"_type": "PerRow", "_version": 1, "_data": {}}, {"_type": "PerRow", "_version": 1, "_data": {}}], "mm_config": {"_type": "Float8MMConfig", "_version": 1, "_data": {"emulate": false, "use_fast_accum": true, "pad_inner_dim": false}}, "activation_value_lb": null, "activation_value_ub": null, "kernel_preference": {"_type": "KernelPreference", "_data": "AUTO"}, "set_inductor_config": true}}
```




2. Add following to `sh examples/ppo_trainer/run_deepseek7b_llm.sh`
```
     actor_rollout_ref.rollout.quantization=torchao \
     actor_rollout_ref.rollout.quantization_config_file=torchao_config.json \
```
3. Run test
VLLM_DISABLE_COMPILE_CACHE=1 sh
examples/ppo_trainer/run_deepseek7b_llm.sh



```
# baseline
(TaskRunner pid=539843) ("Initial validation metrics: {'val-aux/openai/gsm8k/reward/mean@1': "
(TaskRunner pid=539843)  "0.6717210007581501, 'val-core/openai/gsm8k/acc/mean@1': 0.6717210007581501, "
(TaskRunner pid=539843)  "'val-aux/num_turns/min': 2, 'val-aux/num_turns/max': 2, "
(TaskRunner pid=539843)  "'val-aux/num_turns/mean': 2.0}")
(TaskRunner pid=539843) step:105 - val-aux/openai/gsm8k/reward/mean@1:0.6717210007581501 - val-core/openai/gsm8k/acc/mean@1:0.6717210007581501 - val-aux/num_turns/min:2 - val-aux/num_turns/max:2 - val-aux/num_turns/mean:2.0



# fp8
(TaskRunner pid=3763210) validation generation end
(TaskRunner pid=3763210) ("Initial validation metrics: {'val-aux/openai/gsm8k/reward/mean@1': "
(TaskRunner pid=3763210)  "0.6739954510993177, 'val-core/openai/gsm8k/acc/mean@1': 0.6739954510993177, "
(TaskRunner pid=3763210)  "'val-aux/num_turns/min': 2, 'val-aux/num_turns/max': 2, "
(TaskRunner pid=3763210)  "'val-aux/num_turns/mean': 2.0}")
(TaskRunner pid=3763210) step:105 - val-aux/openai/gsm8k/reward/mean@1:0.6739954510993177 - val-core/openai/gsm8k/acc/mean@1:0.6739954510993177 - val-aux/num_turns/min:2 - val-aux/num_turns/max:2 - val-aux/num_turns/mean:2.0
```


Docs:

no docs added yet since I didn't find a place to add quantized rollout
docs in
https://github.com/volcengine/verl/blob/main/docs/workers/fsdp_workers.rst,
happy to add later when there are more docs


We can add simple string options (e.g. fp8_tensorwise, fp8_rowwise,
fp8_blockwise etc.) in the future if needed.

Reviewers:

Subscribers:

Tasks:

Tags:


### Checklist Before Starting

- [x] Search for similar PRs.
https://github.com/volcengine/verl/pulls?q=sort%3Aupdated-desc+is%3Apr+is%3Aopen+quantization+
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants