[ROCm][Perf] Add Fused Shared Expert (FSE) support for Qwen3-Next#39280
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements Fused Shared Expert (FSE) support for ROCm AITER MoE kernels, specifically targeting Qwen3Next models. The changes include a new weight injection mechanism in the MoE runner and logic to remap shared expert weights to fused expert slots during model loading. Review feedback highlights two critical issues in qwen3_next.py: a crash-inducing logic error when unpacking the output of SharedFusedMoE in the default case, and a potential TypeError caused by passing None instead of 0 for the number of shared experts.
| if self.shared_expert is not None: | ||
| final_hidden_states = final_hidden_states[0] + final_hidden_states[1] | ||
| elif self.is_fse_enabled: | ||
| _, final_hidden_states = final_hidden_states |
There was a problem hiding this comment.
The current logic for unpacking the result from SharedFusedMoE is broken for the default case where both shared_expert and is_fse_enabled are False. Since SharedFusedMoE.forward always returns a tuple (either (shared, fused) or (None, fused)), final_hidden_states will remain a tuple if both conditions are False, causing a crash in the subsequent .view() call or all_gather operation. The logic should be simplified to always unpack the second element when shared_expert is None.
| if self.shared_expert is not None: | |
| final_hidden_states = final_hidden_states[0] + final_hidden_states[1] | |
| elif self.is_fse_enabled: | |
| _, final_hidden_states = final_hidden_states | |
| if self.shared_expert is not None: | |
| final_hidden_states = final_hidden_states[0] + final_hidden_states[1] | |
| else: | |
| _, final_hidden_states = final_hidden_states |
| enable_eplb=self.enable_eplb, | ||
| num_redundant_experts=self.n_redundant_experts, | ||
| is_sequence_parallel=self.is_sequence_parallel, | ||
| n_shared_experts=1 if self.is_fse_enabled else None, |
There was a problem hiding this comment.
Passing None for n_shared_experts when FSE is disabled will cause a TypeError in moe_runner_base.py during the comparison if num_fused_shared > 0:. It should default to 0 instead of None to ensure compatibility with the runner's logic and the AITER metadata initialization.
| n_shared_experts=1 if self.is_fse_enabled else None, | |
| n_shared_experts=1 if self.is_fse_enabled else 0, |
| assert shared_experts_input is not None | ||
| self._shared_experts.apply(shared_experts_input, order) | ||
|
|
||
| def _inject_fse_weights( |
There was a problem hiding this comment.
I don't quite prefer injection. And this fused expert is not a new feature, it was first introduced in DeepSeekV3
Can you try to implement following the approach taken by DeepSeek
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/deepseek_v2.py
?
Another thing, I would also like @robertgshaw2-redhat feedback regarding to this PR.
There was a problem hiding this comment.
Thanks for the feedback. Looking into refactoring this to use the same approach taken in Deepseek
There was a problem hiding this comment.
I don't quite prefer injection. And this fused expert is not a new feature, it was first introduced in DeepSeekV3
Can you try to implement following the approach taken by DeepSeek https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/deepseek_v2.py ?
Another thing, I would also like @robertgshaw2-redhat feedback regarding to this PR.
I'd vote for the DeepSeek approach as well.
6493060 to
554600d
Compare
|
@tjtanaa the PR has been revised and description updated. Could you review it again? @ChuanLi1101 @dllehr-amd could you also take a look? The PR now covers:
Coming back to your question about re-using the DeepSeekV3.2 shared expert fusion, the main difference is that Qwen3-Next has a learned Note on code placement. The changes follow the existing runner/router separation rather than living in
|
554600d to
c49851e
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
c49851e to
9110fd9
Compare
9110fd9 to
3344962
Compare
|
@nholmber can you help to rebase the PR. Thanks. |
|
This pull request has merge conflicts that must be resolved before it can be |
3344962 to
5e26cf4
Compare
|
Rebased |
2293eae to
66fe572
Compare
Signed-off-by: Doug Lehr <douglehr@amd.com>
|
Hi @nholmber, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
…outer Move the import of aiter_topK_meta_data from module level into the _compute_routing method body. The module-level import captured the initial None value and never saw the reassignment by init_aiter_topK_meta_data, causing shared expert weights to be silently dropped and a ~33 point accuracy regression on gsm8k. Also remove unused fse_fuse_gate variable in layer.py and fix E501 line length in router_factory.py. Signed-off-by: Tres Popp <tres.popp@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>
|
Hi @nholmber, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Tres Popp <tres.popp@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>
| enable_eplb=self.enable_eplb, | ||
| num_redundant_experts=self.n_redundant_experts, | ||
| is_sequence_parallel=self.is_sequence_parallel, | ||
| n_shared_experts=1, |
There was a problem hiding this comment.
@tpopp We are binding the n_shared_experts and shared_expert_gate to FusedMoE now without any checks. This may be the cause of the fail?
When FSE is disabled (non-ROCm or env var off), the shared expert is handled by the model's own MLP. Passing shared_expert_gate to FusedMoE in that case caused _fse_fuse_gate to activate, fusing gate weights into [num_experts+1, hidden] and corrupting routing. Set shared_expert_gate=None and n_shared_experts=None in the non-FSE path so FusedMoE does not attempt gate fusion. Fixes test_hybrid[tiny-random/qwen3-next-moe] regression. Signed-off-by: Tres Popp <tres.popp@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>
| if n_shared_experts is not None and self.aiter_fmoe_shared_expert_enabled | ||
| else 0 | ||
| ) | ||
| self.shared_expert_gate = shared_expert_gate |
There was a problem hiding this comment.
seems unnessrary to have this attribute?
There was a problem hiding this comment.
I'm going to wait for CI to finish before pushing anything else. I'm happy to remove it. This is consistent with some other attributes that aren't used elsewhere and that was the reason for this. I thought there might be debugging or other reasons that most construction args are saved as attributes.
There was a problem hiding this comment.
i know, i hate all those old attrs since it makes it hard to tell what "owns" the object
| ) | ||
|
|
||
| shared_weights = torch.sigmoid(shared_logits) | ||
| topk_weights, topk_ids = inject_shared_expert_weights( |
There was a problem hiding this comment.
seems ot me this inject_shared_experts_weight function should be defined in this file
| ) | ||
|
|
||
| if ( | ||
| num_fused_shared_experts > 0 |
There was a problem hiding this comment.
what happens if num_fused_shared_experts > 0 and either scoring_func != softmax or is not aiter?
should we just reject?
There was a problem hiding this comment.
currently we take FusedTopKRouter. Which is what happened prior as well. So I think we're okay on that front. It's not a change in behavior in the router unless the specific 3 conditions here are met
There was a problem hiding this comment.
please open a github issue to audit and guard this for future so we have a clear view of what does and does not work
|
@robertgshaw2-redhat I've created #42088. Can you or @dllehr-amd assign it to me? |
|
test failures unreleated. passes all key moe tests |
…lm-project#39280) Signed-off-by: nholmber <nholmber@users.noreply.github.com> Signed-off-by: Tres Popp <tres.popp@amd.com> Signed-off-by: Doug Lehr <douglehr@amd.com> Co-authored-by: nholmber <nholmber@users.noreply.github.com> Co-authored-by: Tres <tpopp@users.noreply.github.com> Co-authored-by: Tres Popp <tres.popp@amd.com> Co-authored-by: Doug Lehr <douglehr@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com>
…lm-project#39280) Signed-off-by: nholmber <nholmber@users.noreply.github.com> Signed-off-by: Tres Popp <tres.popp@amd.com> Signed-off-by: Doug Lehr <douglehr@amd.com> Co-authored-by: nholmber <nholmber@users.noreply.github.com> Co-authored-by: Tres <tpopp@users.noreply.github.com> Co-authored-by: Tres Popp <tres.popp@amd.com> Co-authored-by: Doug Lehr <douglehr@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com>
…lm-project#39280) Signed-off-by: nholmber <nholmber@users.noreply.github.com> Signed-off-by: Tres Popp <tres.popp@amd.com> Signed-off-by: Doug Lehr <douglehr@amd.com> Co-authored-by: nholmber <nholmber@users.noreply.github.com> Co-authored-by: Tres <tpopp@users.noreply.github.com> Co-authored-by: Tres Popp <tres.popp@amd.com> Co-authored-by: Doug Lehr <douglehr@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com>
…lm-project#39280) Signed-off-by: nholmber <nholmber@users.noreply.github.com> Signed-off-by: Tres Popp <tres.popp@amd.com> Signed-off-by: Doug Lehr <douglehr@amd.com> Co-authored-by: nholmber <nholmber@users.noreply.github.com> Co-authored-by: Tres <tpopp@users.noreply.github.com> Co-authored-by: Tres Popp <tres.popp@amd.com> Co-authored-by: Doug Lehr <douglehr@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
## Purpose Extend the AITER Fused Shared Expert (FSE) path - originally added for DeepSeek-V2/V3 (vllm-project#28540) and Qwen3-Next (vllm-project#39280) - to the GLM-4 MoE family (GLM-4.5, GLM-4.6, GLM-4.7). When `VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1` the shared expert is folded into the AITER FusedMoE kernel as `n_shared_experts` extra expert slots, eliminating the separate shared-expert MLP forward pass at low/medium concurrency. ## Changes Single-file model wiring in `vllm/model_executor/models/glm4_moe.py`, mirroring the canonical `deepseek_v2.py` FSE pattern: * `Glm4MoE.__init__` - Cache `is_rocm_aiter_moe_enabled` and `is_fusion_moe_shared_experts_enabled` from `rocm_aiter_ops`. - When FSE is enabled, skip building the separate `shared_experts` MLP and pass `n_shared_experts=config.n_shared_experts` to `FusedMoE` so the AITER kernel routes the shared expert(s) as extra slots in the routed tensor. - Switch `apply_routed_scale_to_output` to `not self.is_rocm_aiter_moe_enabled`. AITER applies `routed_scaling_factor` internally, per routed slot; applying it again post-fusion would also scale the FSE shared-expert slot (which the kernel inserts with unit weight), producing a structural magnitude error in every MoE layer. This matches `deepseek_v2.py`. (`routed_scaling_factor=2.5` for GLM-4.7, so the unfixed path showed a ~48 pp gsm8k regression.) * `Glm4MoeModel.get_expert_mapping` - Widen `num_experts` by `config.n_shared_experts` when FSE is on so the weight loader enumerates the appended slots. * `Glm4MoeModel.load_weights` - Treat `mlp.shared_experts.{gate,up,down}_proj.*` as expert-style tensors when FSE is on (skip the stacked QKV/gate_up linear path). - Split each widened shared-expert tensor into `n_shared_experts` chunks along the intermediate-size axis (dim 0 for ColumnParallel gate/up_proj, dim 1 for RowParallel down_proj) and route each chunk to `mlp.experts.{n_routed_experts + j}.*` via the FusedMoE expert-aware weight loader. No changes to FusedMoE / AITER plumbing - all of that landed earlier with vllm-project#39280 (Qwen3-Next FSE). ## Test Plan * Model: `zai-org/GLM-4.7-FP8` * Hardware: 1x MI355X node, TP=4 * Container: ROCm vLLM image (AITER >= v0.1.13.post1, PR vllm-project#44265) * Accuracy: `lm_eval --tasks gsm8k --num_fewshot 5` * Throughput: `vllm bench serve --dataset-name random` sweep over (ISL, OSL, MC) in {1000/100, 5000/500, 10000/1000} x {4, 16, 64} Server launch: ``` VLLM_ROCM_USE_AITER=1 \ VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=<0|1> \ vllm serve zai-org/GLM-4.7-FP8 \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.92 \ --max-model-len 32768 \ --max-num-seqs 256 ``` ## Test Result ### Accuracy (gsm8k, 5-shot, exact_match) | Config | flexible-extract | strict-match | |---------------------|-----------------:|-----------------:| | FSE=0 (baseline) | 0.9469 ± 0.0062 | 0.9439 ± 0.0063 | | FSE=1 | 0.9439 ± 0.0063 | 0.9416 ± 0.0065 | All deltas within standard error. No accuracy regression. ### Throughput (`vllm bench serve`, random) | ISL | OSL | MC | TPOT mean (ms) FSE=0 -> FSE=1 (Δ) | TPOT p99 (ms) FSE=0 -> FSE=1 (Δ) | Output tok/s FSE=0 -> FSE=1 (Δ) | Total tok/s FSE=0 -> FSE=1 (Δ) | |-----:|-----:|---:|----------------------------------:|---------------------------------:|--------------------------------:|-------------------------------:| | 1000| 100| 4| 17.76 -> 14.36 (**-19.2%**) | 19.43 -> 15.93 (**-18.0%**) | 199.4 -> 243.6 (**+22.1%**) | 2193.7 -> 2679.1 (**+22.1%**) | | 1000| 100| 16| 20.96 -> 18.48 (**-11.9%**) | 24.29 -> 22.77 (-6.3%) | 631.0 -> 673.4 (**+6.7%**) | 6940.6 -> 7407.9 (**+6.7%**) | | 1000| 100| 64| 30.74 -> 30.23 (-1.7%) | 42.85 -> 43.44 (+1.4%) | 1452.7 -> 1424.3 (-2.0%) | 15980.1 -> 15667.6 (-2.0%) | | 5000| 500| 4| 17.82 -> 14.50 (**-18.7%**) | 18.63 -> 15.50 (**-16.8%**) | 211.5 -> 253.5 (**+19.9%**) | 2326.1 -> 2788.7 (**+19.9%**) | | 5000| 500| 16| 22.73 -> 20.76 (**-8.7%**) | 25.38 -> 23.07 (**-9.1%**) | 619.1 -> 657.7 (**+6.2%**) | 6810.4 -> 7234.6 (**+6.2%**) | | 5000| 500| 64| 39.79 -> 40.15 (+0.9%) | 46.15 -> 46.78 (+1.4%) | 1363.8 -> 1339.1 (-1.8%) | 15001.9 -> 14730.4 (-1.8%) | | 10000| 1000| 4| 18.00 -> 14.70 (**-18.3%**) | 18.68 -> 15.50 (**-17.0%**) | 210.3 -> 251.8 (**+19.7%**) | 2313.5 -> 2769.4 (**+19.7%**) | | 10000| 1000| 16| 24.47 -> 22.87 (-6.5%) | 26.66 -> 25.56 (-4.1%) | 589.6 -> 615.1 (**+4.3%**) | 6485.6 -> 6766.2 (**+4.3%**) | | 10000| 1000| 64| 46.37 -> 46.33 (-0.1%) | 51.14 -> 51.78 (+1.3%) | 1233.6 -> 1211.9 (-1.8%) | 13570.0 -> 13330.7 (-1.8%) | Verdict: FSE delivers +20-22% output throughput and -18-19% TPOT at low concurrency (MC=4), modest gains at MC=16, and is roughly break-even (<2% regression) at MC=64. No accuracy regression. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Olga Miroshnichenko <olga.miroshnichenko@amd.com>
## Purpose Extend the AITER Fused Shared Expert (FSE) path - originally added for DeepSeek-V2/V3 (vllm-project#28540) and Qwen3-Next (vllm-project#39280) - to the GLM-4 MoE family (GLM-4.5, GLM-4.6, GLM-4.7). When `VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1` the shared expert is folded into the AITER FusedMoE kernel as `n_shared_experts` extra expert slots, eliminating the separate shared-expert MLP forward pass at low/medium concurrency. ## Changes Single-file model wiring in `vllm/model_executor/models/glm4_moe.py`, mirroring the canonical `deepseek_v2.py` FSE pattern: * `Glm4MoE.__init__` - Cache `is_rocm_aiter_moe_enabled` and `is_fusion_moe_shared_experts_enabled` from `rocm_aiter_ops`. - When FSE is enabled, skip building the separate `shared_experts` MLP and pass `n_shared_experts=config.n_shared_experts` to `FusedMoE` so the AITER kernel routes the shared expert(s) as extra slots in the routed tensor. - Switch `apply_routed_scale_to_output` to `not self.is_rocm_aiter_moe_enabled`. AITER applies `routed_scaling_factor` internally, per routed slot; applying it again post-fusion would also scale the FSE shared-expert slot (which the kernel inserts with unit weight), producing a structural magnitude error in every MoE layer. This matches `deepseek_v2.py`. (`routed_scaling_factor=2.5` for GLM-4.7, so the unfixed path showed a ~48 pp gsm8k regression.) * `Glm4MoeModel.get_expert_mapping` - Widen `num_experts` by `config.n_shared_experts` when FSE is on so the weight loader enumerates the appended slots. * `Glm4MoeModel.load_weights` - Treat `mlp.shared_experts.{gate,up,down}_proj.*` as expert-style tensors when FSE is on (skip the stacked QKV/gate_up linear path). - Split each widened shared-expert tensor into `n_shared_experts` chunks along the intermediate-size axis (dim 0 for ColumnParallel gate/up_proj, dim 1 for RowParallel down_proj) and route each chunk to `mlp.experts.{n_routed_experts + j}.*` via the FusedMoE expert-aware weight loader. No changes to FusedMoE / AITER plumbing - all of that landed earlier with vllm-project#39280 (Qwen3-Next FSE). ## Test Plan * Model: `zai-org/GLM-4.7-FP8` * Hardware: 1x MI355X node, TP=4 * Container: ROCm vLLM image (AITER >= v0.1.13.post1, PR vllm-project#44265) * Accuracy: `lm_eval --tasks gsm8k --num_fewshot 5` * Throughput: `vllm bench serve --dataset-name random` sweep over (ISL, OSL, MC) in {1000/100, 5000/500, 10000/1000} x {4, 16, 64} Server launch: ``` VLLM_ROCM_USE_AITER=1 \ VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=<0|1> \ vllm serve zai-org/GLM-4.7-FP8 \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.92 \ --max-model-len 32768 \ --max-num-seqs 256 ``` ## Test Result ### Accuracy (gsm8k, 5-shot, exact_match) | Config | flexible-extract | strict-match | |---------------------|-----------------:|-----------------:| | FSE=0 (baseline) | 0.9469 ± 0.0062 | 0.9439 ± 0.0063 | | FSE=1 | 0.9439 ± 0.0063 | 0.9416 ± 0.0065 | All deltas within standard error. No accuracy regression. ### Throughput (`vllm bench serve`, random) | ISL | OSL | MC | TPOT mean (ms) FSE=0 -> FSE=1 (Δ) | TPOT p99 (ms) FSE=0 -> FSE=1 (Δ) | Output tok/s FSE=0 -> FSE=1 (Δ) | Total tok/s FSE=0 -> FSE=1 (Δ) | |-----:|-----:|---:|----------------------------------:|---------------------------------:|--------------------------------:|-------------------------------:| | 1000| 100| 4| 17.76 -> 14.36 (**-19.2%**) | 19.43 -> 15.93 (**-18.0%**) | 199.4 -> 243.6 (**+22.1%**) | 2193.7 -> 2679.1 (**+22.1%**) | | 1000| 100| 16| 20.96 -> 18.48 (**-11.9%**) | 24.29 -> 22.77 (-6.3%) | 631.0 -> 673.4 (**+6.7%**) | 6940.6 -> 7407.9 (**+6.7%**) | | 1000| 100| 64| 30.74 -> 30.23 (-1.7%) | 42.85 -> 43.44 (+1.4%) | 1452.7 -> 1424.3 (-2.0%) | 15980.1 -> 15667.6 (-2.0%) | | 5000| 500| 4| 17.82 -> 14.50 (**-18.7%**) | 18.63 -> 15.50 (**-16.8%**) | 211.5 -> 253.5 (**+19.9%**) | 2326.1 -> 2788.7 (**+19.9%**) | | 5000| 500| 16| 22.73 -> 20.76 (**-8.7%**) | 25.38 -> 23.07 (**-9.1%**) | 619.1 -> 657.7 (**+6.2%**) | 6810.4 -> 7234.6 (**+6.2%**) | | 5000| 500| 64| 39.79 -> 40.15 (+0.9%) | 46.15 -> 46.78 (+1.4%) | 1363.8 -> 1339.1 (-1.8%) | 15001.9 -> 14730.4 (-1.8%) | | 10000| 1000| 4| 18.00 -> 14.70 (**-18.3%**) | 18.68 -> 15.50 (**-17.0%**) | 210.3 -> 251.8 (**+19.7%**) | 2313.5 -> 2769.4 (**+19.7%**) | | 10000| 1000| 16| 24.47 -> 22.87 (-6.5%) | 26.66 -> 25.56 (-4.1%) | 589.6 -> 615.1 (**+4.3%**) | 6485.6 -> 6766.2 (**+4.3%**) | | 10000| 1000| 64| 46.37 -> 46.33 (-0.1%) | 51.14 -> 51.78 (+1.3%) | 1233.6 -> 1211.9 (-1.8%) | 13570.0 -> 13330.7 (-1.8%) | Verdict: FSE delivers +20-22% output throughput and -18-19% TPOT at low concurrency (MC=4), modest gains at MC=16, and is roughly break-even (<2% regression) at MC=64. No accuracy regression. Signed-off-by: Olga Miroshnichenko <olga.miroshnichenko@amd.com>
Purpose
Fuse shared expert into the AITER MoE kernel as an extra expert slot when
VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1, eliminating the separate shared expert MLP forward pass and greatly improving decode throughput.The router gate
[num_experts, hidden]and shared expert gate[num_shared, hidden]weight matrices are fused into a single[num_experts + num_shared, hidden]matrix at init. OneF.linearcall produces combined logits, and thetopk_softmaxkernel applies routing softmax and shared expert activation (sigmoid) in a single launch; no extra kernel launches for the shared expert gate projection, activation, or buffer copy.Changes:
qwen3_next.py: Model-level FSE wiring (init, weight loading, expert mapping, forward tuple unpack forSharedFusedMoEcompatibility)qwen3_next_mtp.py: MTP weight loading for fused expert slotmoe_runner_base.py: Lazy gate weight fusion inforward_dispatch(); threadnum_fused_shared_expertsthrough routing_aiter_ops.py: Extendtopk_softmaxwithnum_shared_expertsandshared_expert_scoring_funcparams; add runtime version check for graceful fallback with older AITERfused_topk_router.py: Fused kernel dispatch path + non-fused fallback (separate softmax, sigmoid, inject)base_router.py+ router subclasses: Addnum_fused_shared_expertsparam to_compute_routing()interfacerocm_aiter_fused_moe.py:inject_shared_expert_weights()for merging routed topk results with the shared expert bufferTest Plan
vllm/vllm-openai-rocm:v0.19.002d8af55e(with 7-argtopk_softmaxsupport), stock version that uses non-fused topk + sigmoid also testedvllm bench serve, random 1k input / 1k output at c4/c8/c16/c32Sample commands
Test Result
Accuracy (lm_eval GSM8K 8-shot, flexible-extract)
Verdict: All deltas within standard error. No accuracy regression.
Throughput (output tok/s, 1k input / 1k output)
TP1:
TP2:
Verdict: FSE provides +16–24% output throughput improvement across
concurrency levels and TP configurations.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.