[vllm] feat: Support online quant for rollout with torchao#3084
[vllm] feat: Support online quant for rollout with torchao#3084wuxibin89 merged 1 commit intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for on-the-fly quantization for vLLM rollouts using torchao. The changes involve adding configuration options for quantization and implementing the quantization logic within the FSDP sharding manager.
My review identified a critical issue in the implementation where model weights would fail to load if quantization is disabled. I have provided a code suggestion to fix this. Additionally, I've pointed out that the current method for selecting layers to quantize is too specific and may miss some linear layers, which could lead to unexpected behavior.
4180a17 to
a841cba
Compare
|
please let me know if the API makes sense, I can clean up both PRs after confirmation |
|
I also wonder how torchao fp8 quantization compares to vllm's own impl for And also, how does the approach in this PR compare to FlashRL approach (which also patches vllm) |
we haven't officially compared with them yet, but we are integrating with fbgemm kernels which should be SOTA.
also haven't compared yet, but we can discuss what would be the API that makes most of sense I think |
d06a31d to
3bf8cd5
Compare
353c1eb to
8ac8d67
Compare
df0ec96 to
266420a
Compare
108b3ce to
ddd405e
Compare
ddd405e to
4c99091
Compare
4c99091 to
c49be5e
Compare
| # Specify served_model_name to avoid displaying overly long model paths in Grafana | ||
| served_model_name: ${oc.select:actor_rollout_ref.model.path,null} | ||
|
|
||
| quantization: null |
There was a problem hiding this comment.
Please add comment above quantization and quantization_config_file.
Summary: Only supporting quantizing all linear layers with torchao config for now. see vllm PR for how to generate the quantization file. Also requires vllm changes: vllm-project/vllm#23014 Test Plan: sh examples/ppo_trainer/run_deepseek7b_llm.sh Reviewers: Subscribers: Tasks: Tags:
c49be5e to
4728239
Compare
…ect#3084) ### What does this PR do? Add support for torchao online quantization for vllm, configure the type of quantization by serializing a config file. See vllm PR for how to generate the quantization file (vllm-project/vllm#23014). Requires vllm changes: vllm-project/vllm#23014 and vllm-project/vllm#26327 ### Test 1. generate the torchao config file (can change to other configs: https://docs.pytorch.org/ao/main/api_ref_quantization.html#inference-apis-for-quantize) torchao_config.json ``` from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow from torchao.core.config import config_to_dict import json config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()) with open("torchao_config.json", "w") as f: f.write(json.dumps(config_to_dict(config))) # LLM(..., quantization="torchao", hf_overrides={"quantization_config_file": "torchao_config.json"}) ``` this is fp8 dynamic quant: ``` {"_type": "Float8DynamicActivationFloat8WeightConfig", "_version": 2, "_data": {"activation_dtype": {"_type": "torch.dtype", "_data": "float8_e4m3fn"}, "weight_dtype": {"_type": "torch.dtype", "_data": "float8_e4m3fn"}, "granularity": [{"_type": "PerRow", "_version": 1, "_data": {}}, {"_type": "PerRow", "_version": 1, "_data": {}}], "mm_config": {"_type": "Float8MMConfig", "_version": 1, "_data": {"emulate": false, "use_fast_accum": true, "pad_inner_dim": false}}, "activation_value_lb": null, "activation_value_ub": null, "kernel_preference": {"_type": "KernelPreference", "_data": "AUTO"}, "set_inductor_config": true}} ``` 2. Add following to `sh examples/ppo_trainer/run_deepseek7b_llm.sh` ``` actor_rollout_ref.rollout.quantization=torchao \ actor_rollout_ref.rollout.quantization_config_file=torchao_config.json \ ``` 3. Run test VLLM_DISABLE_COMPILE_CACHE=1 sh examples/ppo_trainer/run_deepseek7b_llm.sh ``` # baseline (TaskRunner pid=539843) ("Initial validation metrics: {'val-aux/openai/gsm8k/reward/mean@1': " (TaskRunner pid=539843) "0.6717210007581501, 'val-core/openai/gsm8k/acc/mean@1': 0.6717210007581501, " (TaskRunner pid=539843) "'val-aux/num_turns/min': 2, 'val-aux/num_turns/max': 2, " (TaskRunner pid=539843) "'val-aux/num_turns/mean': 2.0}") (TaskRunner pid=539843) step:105 - val-aux/openai/gsm8k/reward/mean@1:0.6717210007581501 - val-core/openai/gsm8k/acc/mean@1:0.6717210007581501 - val-aux/num_turns/min:2 - val-aux/num_turns/max:2 - val-aux/num_turns/mean:2.0 # fp8 (TaskRunner pid=3763210) validation generation end (TaskRunner pid=3763210) ("Initial validation metrics: {'val-aux/openai/gsm8k/reward/mean@1': " (TaskRunner pid=3763210) "0.6739954510993177, 'val-core/openai/gsm8k/acc/mean@1': 0.6739954510993177, " (TaskRunner pid=3763210) "'val-aux/num_turns/min': 2, 'val-aux/num_turns/max': 2, " (TaskRunner pid=3763210) "'val-aux/num_turns/mean': 2.0}") (TaskRunner pid=3763210) step:105 - val-aux/openai/gsm8k/reward/mean@1:0.6739954510993177 - val-core/openai/gsm8k/acc/mean@1:0.6739954510993177 - val-aux/num_turns/min:2 - val-aux/num_turns/max:2 - val-aux/num_turns/mean:2.0 ``` Docs: no docs added yet since I didn't find a place to add quantized rollout docs in https://github.com/volcengine/verl/blob/main/docs/workers/fsdp_workers.rst, happy to add later when there are more docs We can add simple string options (e.g. fp8_tensorwise, fp8_rowwise, fp8_blockwise etc.) in the future if needed. Reviewers: Subscribers: Tasks: Tags: ### Checklist Before Starting - [x] Search for similar PRs. https://github.com/volcengine/verl/pulls?q=sort%3Aupdated-desc+is%3Apr+is%3Aopen+quantization+ - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
…ect#3084) ### What does this PR do? Add support for torchao online quantization for vllm, configure the type of quantization by serializing a config file. See vllm PR for how to generate the quantization file (vllm-project/vllm#23014). Requires vllm changes: vllm-project/vllm#23014 and vllm-project/vllm#26327 ### Test 1. generate the torchao config file (can change to other configs: https://docs.pytorch.org/ao/main/api_ref_quantization.html#inference-apis-for-quantize) torchao_config.json ``` from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow from torchao.core.config import config_to_dict import json config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()) with open("torchao_config.json", "w") as f: f.write(json.dumps(config_to_dict(config))) # LLM(..., quantization="torchao", hf_overrides={"quantization_config_file": "torchao_config.json"}) ``` this is fp8 dynamic quant: ``` {"_type": "Float8DynamicActivationFloat8WeightConfig", "_version": 2, "_data": {"activation_dtype": {"_type": "torch.dtype", "_data": "float8_e4m3fn"}, "weight_dtype": {"_type": "torch.dtype", "_data": "float8_e4m3fn"}, "granularity": [{"_type": "PerRow", "_version": 1, "_data": {}}, {"_type": "PerRow", "_version": 1, "_data": {}}], "mm_config": {"_type": "Float8MMConfig", "_version": 1, "_data": {"emulate": false, "use_fast_accum": true, "pad_inner_dim": false}}, "activation_value_lb": null, "activation_value_ub": null, "kernel_preference": {"_type": "KernelPreference", "_data": "AUTO"}, "set_inductor_config": true}} ``` 2. Add following to `sh examples/ppo_trainer/run_deepseek7b_llm.sh` ``` actor_rollout_ref.rollout.quantization=torchao \ actor_rollout_ref.rollout.quantization_config_file=torchao_config.json \ ``` 3. Run test VLLM_DISABLE_COMPILE_CACHE=1 sh examples/ppo_trainer/run_deepseek7b_llm.sh ``` # baseline (TaskRunner pid=539843) ("Initial validation metrics: {'val-aux/openai/gsm8k/reward/mean@1': " (TaskRunner pid=539843) "0.6717210007581501, 'val-core/openai/gsm8k/acc/mean@1': 0.6717210007581501, " (TaskRunner pid=539843) "'val-aux/num_turns/min': 2, 'val-aux/num_turns/max': 2, " (TaskRunner pid=539843) "'val-aux/num_turns/mean': 2.0}") (TaskRunner pid=539843) step:105 - val-aux/openai/gsm8k/reward/mean@1:0.6717210007581501 - val-core/openai/gsm8k/acc/mean@1:0.6717210007581501 - val-aux/num_turns/min:2 - val-aux/num_turns/max:2 - val-aux/num_turns/mean:2.0 # fp8 (TaskRunner pid=3763210) validation generation end (TaskRunner pid=3763210) ("Initial validation metrics: {'val-aux/openai/gsm8k/reward/mean@1': " (TaskRunner pid=3763210) "0.6739954510993177, 'val-core/openai/gsm8k/acc/mean@1': 0.6739954510993177, " (TaskRunner pid=3763210) "'val-aux/num_turns/min': 2, 'val-aux/num_turns/max': 2, " (TaskRunner pid=3763210) "'val-aux/num_turns/mean': 2.0}") (TaskRunner pid=3763210) step:105 - val-aux/openai/gsm8k/reward/mean@1:0.6739954510993177 - val-core/openai/gsm8k/acc/mean@1:0.6739954510993177 - val-aux/num_turns/min:2 - val-aux/num_turns/max:2 - val-aux/num_turns/mean:2.0 ``` Docs: no docs added yet since I didn't find a place to add quantized rollout docs in https://github.com/volcengine/verl/blob/main/docs/workers/fsdp_workers.rst, happy to add later when there are more docs We can add simple string options (e.g. fp8_tensorwise, fp8_rowwise, fp8_blockwise etc.) in the future if needed. Reviewers: Subscribers: Tasks: Tags: ### Checklist Before Starting - [x] Search for similar PRs. https://github.com/volcengine/verl/pulls?q=sort%3Aupdated-desc+is%3Apr+is%3Aopen+quantization+ - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
…ect#3084) ### What does this PR do? Add support for torchao online quantization for vllm, configure the type of quantization by serializing a config file. See vllm PR for how to generate the quantization file (vllm-project/vllm#23014). Requires vllm changes: vllm-project/vllm#23014 and vllm-project/vllm#26327 ### Test 1. generate the torchao config file (can change to other configs: https://docs.pytorch.org/ao/main/api_ref_quantization.html#inference-apis-for-quantize) torchao_config.json ``` from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow from torchao.core.config import config_to_dict import json config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()) with open("torchao_config.json", "w") as f: f.write(json.dumps(config_to_dict(config))) # LLM(..., quantization="torchao", hf_overrides={"quantization_config_file": "torchao_config.json"}) ``` this is fp8 dynamic quant: ``` {"_type": "Float8DynamicActivationFloat8WeightConfig", "_version": 2, "_data": {"activation_dtype": {"_type": "torch.dtype", "_data": "float8_e4m3fn"}, "weight_dtype": {"_type": "torch.dtype", "_data": "float8_e4m3fn"}, "granularity": [{"_type": "PerRow", "_version": 1, "_data": {}}, {"_type": "PerRow", "_version": 1, "_data": {}}], "mm_config": {"_type": "Float8MMConfig", "_version": 1, "_data": {"emulate": false, "use_fast_accum": true, "pad_inner_dim": false}}, "activation_value_lb": null, "activation_value_ub": null, "kernel_preference": {"_type": "KernelPreference", "_data": "AUTO"}, "set_inductor_config": true}} ``` 2. Add following to `sh examples/ppo_trainer/run_deepseek7b_llm.sh` ``` actor_rollout_ref.rollout.quantization=torchao \ actor_rollout_ref.rollout.quantization_config_file=torchao_config.json \ ``` 3. Run test VLLM_DISABLE_COMPILE_CACHE=1 sh examples/ppo_trainer/run_deepseek7b_llm.sh ``` # baseline (TaskRunner pid=539843) ("Initial validation metrics: {'val-aux/openai/gsm8k/reward/mean@1': " (TaskRunner pid=539843) "0.6717210007581501, 'val-core/openai/gsm8k/acc/mean@1': 0.6717210007581501, " (TaskRunner pid=539843) "'val-aux/num_turns/min': 2, 'val-aux/num_turns/max': 2, " (TaskRunner pid=539843) "'val-aux/num_turns/mean': 2.0}") (TaskRunner pid=539843) step:105 - val-aux/openai/gsm8k/reward/mean@1:0.6717210007581501 - val-core/openai/gsm8k/acc/mean@1:0.6717210007581501 - val-aux/num_turns/min:2 - val-aux/num_turns/max:2 - val-aux/num_turns/mean:2.0 # fp8 (TaskRunner pid=3763210) validation generation end (TaskRunner pid=3763210) ("Initial validation metrics: {'val-aux/openai/gsm8k/reward/mean@1': " (TaskRunner pid=3763210) "0.6739954510993177, 'val-core/openai/gsm8k/acc/mean@1': 0.6739954510993177, " (TaskRunner pid=3763210) "'val-aux/num_turns/min': 2, 'val-aux/num_turns/max': 2, " (TaskRunner pid=3763210) "'val-aux/num_turns/mean': 2.0}") (TaskRunner pid=3763210) step:105 - val-aux/openai/gsm8k/reward/mean@1:0.6739954510993177 - val-core/openai/gsm8k/acc/mean@1:0.6739954510993177 - val-aux/num_turns/min:2 - val-aux/num_turns/max:2 - val-aux/num_turns/mean:2.0 ``` Docs: no docs added yet since I didn't find a place to add quantized rollout docs in https://github.com/volcengine/verl/blob/main/docs/workers/fsdp_workers.rst, happy to add later when there are more docs We can add simple string options (e.g. fp8_tensorwise, fp8_rowwise, fp8_blockwise etc.) in the future if needed. Reviewers: Subscribers: Tasks: Tags: ### Checklist Before Starting - [x] Search for similar PRs. https://github.com/volcengine/verl/pulls?q=sort%3Aupdated-desc+is%3Apr+is%3Aopen+quantization+ - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
What does this PR do?
Add support for torchao online quantization for vllm, configure the type of quantization by serializing a config file. See vllm PR for how to generate the quantization file (vllm-project/vllm#23014).
Requires vllm changes: vllm-project/vllm#23014 and vllm-project/vllm#26327
Test
this is fp8 dynamic quant:
sh examples/ppo_trainer/run_deepseek7b_llm.shVLLM_DISABLE_COMPILE_CACHE=1 sh examples/ppo_trainer/run_deepseek7b_llm.sh
Docs:
no docs added yet since I didn't find a place to add quantized rollout docs in https://github.com/volcengine/verl/blob/main/docs/workers/fsdp_workers.rst, happy to add later when there are more docs
We can add simple string options (e.g. fp8_tensorwise, fp8_rowwise, fp8_blockwise etc.) in the future if needed.
Reviewers:
Subscribers:
Tasks:
Tags:
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingChecklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)