[Quantization][RL] Support Online Blockwise FP8 Quantization by AniZpZ · Pull Request #15440 · sgl-project/sglang

AniZpZ · 2025-12-19T03:23:20Z

Motivation

Following #9650, support blockwise fp8 rollout with flashrl

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

reset the author info

…ormat

gemini-code-assist · 2025-12-19T03:23:24Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Wilboludriver · 2025-12-22T08:36:12Z

Experimental Details

Model: Qwen/Qwen3-8B-Base
Training Recipe: DAPO
Configuration:

Training Dataset: DAPO-Math-17k
Quantization Scheme: dynamic blockwise fp8
Validation: AIME-2024
Prompt batch size 32, n=16.
Rollout batch size: 32316
Train_batch_size & ppo_mini_batch_size 32
Token-level TIS, C=2
Max response length 20K
8*H20. veRL; CUDA12.9

Results （2026.01.12 Updated)

Observations and Outlook

Accuracy of Quantization: The current blockwise FP8 rollout implementation, which converts weights by FP32 -> BF16 -> FP8, shows only minor training-inference discrepancies and maintains training metrics consistent with the BF16 baseline. In contrast, per-channel FP8 quantization leads to notable precision loss during text generation. Further experiments indicate that direct FP32-to-FP8 quantization results in a larger performance gap and an elevated final validation score, which is attributed to longer generated responses.

Gen. Throughput: FP8 rollout initially delivers slightly higher throughput than BF16 but is later surpassed as training progresses. In additional runs with a maximum response length of 10K (compared to the current 20K setting), FP8 rollout achieves significantly higher throughput than BF16 when long-tail generation is constrained.

@AniZpZ

zhaochenyang20 · 2025-12-23T04:31:30Z

/rerun-failed-ci

Hecate0821 · 2025-12-23T07:11:34Z

TODOs:

Investigate why FP8 achieves higher accuracy and determine if it is purely due to noise.
Analyze why FP8 throughput becomes lower than BF16 in subsequent steps.

zhaochenyang20 · 2025-12-23T23:33:30Z

TODOs:

Investigate why FP8 achieves higher accuracy and determine if it is purely due to noise.

Analyze why FP8 throughput becomes lower than BF16 in subsequent steps.

Shall we finish these todos, then merge this PR?

…uantization

AniZpZ · 2026-01-12T09:15:01Z

TODOs:

Investigate why FP8 achieves higher accuracy and determine if it is purely due to noise.

Analyze why FP8 throughput becomes lower than BF16 in subsequent steps.

Shall we finish these todos, then merge this PR?

@zhaochenyang20 @Hecate0821 @FlamingoPg TODOs have been solved and CI passed

zhaochenyang20 · 2026-01-14T02:50:33Z

python/sglang/srt/model_loader/loader.py

+        # Note: only [128, 128] block size is available for now
+        default_block_size = [128, 128]


default_block_size = [128, 128] is set for twice. We shall only have it once and set [128, 128] as default value.

zhaochenyang20 · 2026-01-14T02:50:56Z

python/sglang/srt/model_loader/loader.py

+                    if quant_method is not None:
+                        quant_method.process_weights_after_loading(module)
+                        logger.info(
+                            f"[QuantizedRL] Fllback to per-channel quantization for module: {name}; "


fallback to

zhaochenyang20 · 2026-01-14T02:51:32Z

python/sglang/srt/layers/quantization/fp8_utils.py

+
+
+# Adapt from https://github.com/volcengine/verl/pull/4415/files#diff-79538cec3426fe5c75d07b39a15e90971f19e98404755792f9b28859b8902ae1
+def scaled_fp8_blockwise(


could we adds dedicated comments and return type hint to this function?

zhaochenyang20 · 2026-01-14T02:52:22Z

python/sglang/srt/model_loader/loader.py

+                        logger.debug(
+                            f"[QuantizedRL] Set quant_method weight_block_size={default_block_size} for module: {name}"
+                        )
+                except Exception as e:


Please do not catch errors like this. This may catch unexpected errors.

could we only catch RuntimeError/ValueError?

zhaochenyang20 · 2026-01-14T02:52:53Z

python/sglang/srt/layers/quantization/fp8_utils.py

+    # Permute to (BLK_M, BLK_N, BLOCK_SIZE_M, BLOCK_SIZE_N)
+    data_hp = data_hp.permute(0, 2, 1, 3)
+    # Flatten to (BLK_M, BLK_N, BLOCK_SIZE_M * BLOCK_SIZE_N)
+    data_hp = data_hp.to(torch.float32).contiguous().flatten(start_dim=2)


is this a must-have to make it fp32?

Converting to fp32 ensures the precision for scale calculations, as the scales are also in fp32.

zhaochenyang20 · 2026-01-14T02:54:50Z

python/sglang/srt/model_loader/loader.py

        )
        logger.info(
-            "FP8 approach: Model loads with native SGLang FP8 quantization. "
+            "FP8 approach: Model loads and gets blockwise fp8 quantization on    . "


this log seems strange

zhaochenyang20 · 2026-01-14T02:58:57Z

python/sglang/srt/model_loader/loader.py


-        def _get_tp_sharded_scale(full_scale_tensor):
-            """Get tp sharded scale from full scale tensor"""
+        def _get_tp_sharded_scale(full_scale_tensor, is_blockwise=False):


This _get_tp_sharded_scale function is too long and seems to convert multiple things together. Could we turn this into serveral functions?

Wilboludriver added 10 commits December 18, 2025 20:59

added quantization for rebinding phase

7e2869a

reset the author info

added quant for initial load

b379410

added config reset for dynamic blockwise fp8 under flash_rl loading f…

09853d8

…ormat

fixed initial blockwise quantization; fixed the config

16d4c9e

fixed lm head quant config

bd0c760

changed to per-channel mode for testing

32449e1

fixed quant config set for initial load

d65da7b

fixed skipped layers config and assertions

65f7696

fixed param & scale update for blockwise fp8

a95a812

minor fix

0749ab7

AniZpZ requested review from BBuf, Edwardf0t1, FlamingoPg and ch-wan as code owners December 19, 2025 03:23

AniZpZ added quant LLM Quantization reinforcement-learning run-ci labels Dec 19, 2025

AniZpZ changed the title ~~[Quantization][RL] Support Online Blockwise FP8 Quantization~~ [WIP][Quantization][RL] Support Online Blockwise FP8 Quantization Dec 19, 2025

AniZpZ added 2 commits December 19, 2025 15:36

fmt

ef5d738

fmt

02ed1c3

AniZpZ assigned FlamingoPg and zhaochenyang20 Dec 19, 2025

Wilboludriver and others added 3 commits December 19, 2025 17:57

convert assertions to exceptions for better error handling

ce6a38c

minor fix

41201cf

upd

3582b4b

AniZpZ changed the title ~~[WIP][Quantization][RL] Support Online Blockwise FP8 Quantization~~ [Quantization][RL] Support Online Blockwise FP8 Quantization Dec 22, 2025

Merge branch 'main' into dev/blockwise-fp8-rollout

131a23f

eternally-z mentioned this pull request Dec 22, 2025

[recipe,sglang] feat: add Truncated importance sampling + sglang recipe verl-project/verl#4462

Open

7 tasks

Wilboludriver and others added 7 commits January 8, 2026 16:42

forced the conversion from fp32 weights to bf16 weights beffore fp8 q…

29018e9

…uantization

Merge branch 'main' into dev/blockwise-fp8-rollout

42fbdb7

read dtype from model config

e44171a

Merge branch 'main' into dev/blockwise-fp8-rollout

453c4ac

fix

c1c69e1

Merge branch 'main' into dev/blockwise-fp8-rollout

385dad3

Merge branch 'main' into dev/blockwise-fp8-rollout

0aa937a

Merge branch 'main' into dev/blockwise-fp8-rollout

741159e

zhaochenyang20 requested changes Jan 14, 2026

View reviewed changes

Wilboludriver and others added 4 commits January 15, 2026 16:18

minor fix

6c40847

simplified the _apply_scale_update function

72aa224

minor fix

8ddb35d

Merge branch 'main' into dev/blockwise-fp8-rollout

417e17b

		# Note: only [128, 128] block size is available for now
		default_block_size = [128, 128]



		# Adapt from https://github.com/volcengine/verl/pull/4415/files#diff-79538cec3426fe5c75d07b39a15e90971f19e98404755792f9b28859b8902ae1
		def scaled_fp8_blockwise(

Conversation

AniZpZ commented Dec 19, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 19, 2025

Uh oh!

Wilboludriver commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Experimental Details

Results （2026.01.12 Updated)

Observations and Outlook

Uh oh!

zhaochenyang20 commented Dec 23, 2025

Uh oh!

Hecate0821 commented Dec 23, 2025

Uh oh!

zhaochenyang20 commented Dec 23, 2025

Uh oh!

AniZpZ commented Jan 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Wilboludriver commented Dec 22, 2025 •

edited

Loading