[Compile] Conditional compilation. Introduce compile_ranges #24252

ilmarkov · 2025-09-04T14:31:53Z

Second part of splitting #22086

Dynamic Graph dispatch via compile_ranges: Introduces a new configuration option, compile_ranges, as an alternative to compile_sizes. This enables dynamic dispatch to different compiled graphs based on the input batch size.
Now with this approach, when allreduce fusion is enabled, vllm adds additional compile range split point in order to separate the graphs: 1. One with fused allreduce for small-middle shape inputs. 2 One with nccl based allreduce for large shape inputs

The existing compile_sizes feature is extended and generalized with compile_ranges. Defined by split points, these ranges allow vllm to dynamically dispatch requests to specific, pre-compiled graphs based on input batch size. For example, a configuration of (32, 64) defines three distinct ranges: [1, 32), [32, 64), and [64, max_num_batched_tokens). This provides granular control, allowing developers to statically enable or disable fusions within each graph to optimize performance for different batch sizes.

All the compilation now is going through piecewise_backend.py. All compilations will now be done in the bounds on certain compile range, dynamic shape compilation is removed.

Purpose

Corresponding RFC: #23113
The primary motivation for these changes is to enhance vllm's performance and adaptability for diverse workloads. By supporting allreduce fusion without custom ops and introducing dynamic graph dispatch, we empower users to fine-tune vllm for more efficient and scalable inference.

Test Plan

Added test test_compile_ranges.py

Follow ups

Deal with sharing shape env for all graphs which could lead to the situation when one compilation constraints SymInts for the other compilations. Might need support from torch.compile, e.g. shapenv.assume_ranges, shapenv.do_error_at_specialize.
Put fusions under O3 level of compilations.
Sharing an range info with the inductor for the SimInt. comment

Performance benchmarks:

Server:

 VLLM_ALLREDUCE_USE_SYMM_MEM=1  vllm serve {{model}} 
        --disable-log-requests --no-enable-prefix-caching -tp {{tp}} -dp 1 --max-num-seqs 256

To enable allreduce fusions:
--compilation-config "{\"pass_config\":{\"enable_fusion\":false,\"enable_attn_fusion\":false,\"enable_noop\":true,\"enable_sequence_parallelism\":false,\"enable_async_tp\":false,\"enable_fi_allreduce_fusion\":true}}"

Client. Input len 1024, output len 128.

B200 TP=2, Llama-3.1-70B-Instruct-FP8

Baseline:

QPS	Mean TTFT (ms)	Median TTFT (ms)	Mean TPOT (ms)	Median TPOT (ms)	Request Throughput (req/s)
1	85.644	83.395	11.812	11.661	0.976
5	125.548	88.135	16.611	15.562	4.878
10	196.623	109.034	27.632	26.632	9.754
15	291.392	146.879	46.534	46.904	14.544

Allreduce + RMSNorm + QuantFp8

QPS	Mean TTFT (ms)	Median TTFT (ms)	Mean TPOT (ms)	Median TPOT (ms)	Request Throughput (req/s)
1	71.489	70.008	10.725	10.647	0.978
5	116.128	74.080	14.436	13.352	4.888
10	183.171	91.187	23.219	20.959	9.776
15	201.879	124.434	36.656	34.716	14.607

B200 TP=4 Qwen3-Next-80B-A3B-Instruct, No EP

Baseline:

QPS	Mean TTFT (ms)	Median TTFT (ms)	Mean TPOT (ms)	Median TPOT (ms)	Request Throughput (req/s)
5	93.241	84.538	33.883	34.209	4.715
10	106.084	96.828	41.167	41.103	9.431
15	120.676	119.744	49.314	49.832	14.101

Allreduce + RMSNorm + QuantFp8

QPS	Mean TTFT (ms)	Median TTFT (ms)	Mean TPOT (ms)	Median TPOT (ms)	Request Throughput (req/s)
5	96.324	85.852	33.873	33.878	4.761
10	103.219	91.413	39.743	39.887	9.436
15	116.451	114.429	47.549	47.940	14.118

B200 TP=8 DeepSeek-V3.1, No EP.

Baseline:

QPS	Mean TTFT (ms)	Median TTFT (ms)	Mean TPOT (ms)	Median TPOT (ms)	Request Throughput (req/s)
1	97.928	48.912	13.845	13.535	0.972
5	68.071	51.548	16.486	16.476	4.864
10	81.586	60.076	22.646	22.421	9.677
15	102.587	73.730	27.765	27.719	14.442

Allreduce + RMSNorm + QuantFp8

QPS	Mean TTFT (ms)	Median TTFT (ms)	Mean TPOT (ms)	Median TPOT (ms)	Request Throughput (req/s)
1	98.466	47.478	13.175	12.933	0.973
5	67.292	51.342	15.711	15.695	4.869
10	81.177	58.212	20.094	19.978	9.699
15	97.646	73.333	25.690	25.834	14.486

Start up time increase

Increases start up time as it adds more graph compilations.
For the two graphs compilation (typical case for enabled allreduce fusions) cold start for Deepseek-V3 model takes 181.91 s , warm start takes 12.40 s.

Based on PR: #24604
First part: #24248

zou3519 · 2025-09-08T16:42:09Z

vllm/compilation/piecewise_backend.py

+
+    def __call__(self, *args) -> Any:


Btw, does this PR work, or is it mostly WIP? (Are you sure that the graph generated ends up being dynamic on the specific range that is passed?)

There's one problem that I don't know how to solve yet. Let's say we're compiling with ranges [2, 16] and (16, 4096]. Each compilation needs its own ShapeEnv (environment with symbols in it), which has the batch_size constrained to the particular range.

So what we should do is for each range, take the current ShapeEnv (which thinks the batch_size is dynamic on range [2, 4096], clone it, constrain to the current range (e.g. [2, 16]), and use this throughout the compilation.

I don't know how to "clone" ShapeEnvs. Is there anything else we can do here @laithsakka @bobrenjc93 ?

@bobrenjc93 reminded me that that is what https://github.com/pytorch/pytorch/blob/fecd9686f543487793e0c55977555b2cdbae1a73/torch/fx/experimental/symbolic_shapes.py#L3904-L3919 is for

It works already leaving aside a pytorch standalone_compile that should be fixed in new pytorch release in this commit. But the graphs for each range are dynamically generated, and fusions are applied differently in each graph.

Dynamo traces out a graph that is fully dynamic over the batch_size. We should tell torch.compile that that we know things about the batch_size for each range, for example, that the range is constrained to [2, 16]. This will help it generate better code. In order to do this, you'll need to grab the SymInt that is the batch_size and add constraints to it.

Ok, got it. These are the hints for torch.compile, I meant at the meeting. Thanks, I'll add ShapeEnv here

if we are using is_applicable_for_range (the current form of the PR this is fine), if we want to go with the other approach[see my other comment on the PR], which is more complicated i think if we are doing we want a reason) then yeh this is problematic mm./

zou3519 · 2025-09-08T22:44:05Z

vllm/compilation/sequence_parallelism.py

+        return compile_range is not None and (
+            compile_range[0]
+            == compile_range[1]) and (compile_range[1] % tp_size == 0)


The way I originally thought of doing this is something like:

return statically_known_true(batch_size %tp_size == 0):

If we are able to access the batch_size SymInt here, then we are able to query things about it.

cc @laithsakka @bobrenjc93 on if I'm butchering this API

Could you elaborate on how statically_known_true is going to improve the existing approach? Is it more stable?

Instead of implementing your own range analysis, PyTorch already encodes range information in the SymInts themselves. So this is more of a code-reuse thing.

So it really depends on the goals of those ranges. If the goal is solely/mainly to allow custom passes to branch on ranges, this is fine. In fact, it's simpler than mutating the shape env and having to fork it.
Also, we can then keep the invariant that inductor itself does not specialize and run the same checks here (which we do not have yet).
On the other hand, if someone really thinks that inductor can do better itself significantly if we actually specialize the shape env, then yeah we would not have to do something else.
But it sounds to me like the intention is the earlier one?

bobrenjc93 · 2025-09-14T15:51:44Z

@ilmarkov out of curiosity, do you have a sense of how much perf wins you'll get out of this (and from which models?)

mergify · 2025-09-16T04:51:38Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ilmarkov.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ilmarkov · 2025-09-17T15:19:29Z

@bobrenjc93 Without multiple graphs our fallback (for the large input sizes, i.e. when we don't use allreduce fusion) uses either custom ops or non optimized pytorch operations and which are slower than torch triton operations. I think reasonable perf comparison was done in #19830

vllm/compilation/compiler_interface.py

laithsakka · 2025-09-17T17:55:48Z

vllm/compilation/compiler_interface.py

+            if compile_range[0] == compile_range[1]:
+                dynamic_shapes = "from_example_inputs"
+            else:
+                dynamic_shapes = "from_graph"


both "from_graph" and "from_tracing_context" here have the same effect of getting the shape env we traced the DS graph with? if yes lets do less divergence.

We want to get this PR over the line soon, could you take this on in a follow up?

vllm/compilation/cuda_piecewise_backend.py

vllm/compilation/backends.py

vllm/compilation/compiler_interface.py

tests/compile/test_compile_ranges.py

vllm/config/compilation.py

laithsakka

one good side effect of this also other than custom passes is that
Each range is tuned with a hint from that range in inductor meaning that we can use this also to ensure that small inputs vs large inputs are max auto tuned with separate hints.
but splitting ranges

this would also work for unbacked which is good! (Well except that we would have to call override hint for unabcked with the actual example value when we do the range compilations cc @bobrenjc93 )

laithsakka · 2025-09-18T14:42:21Z

here is once concern of this, it will make the soundness story with respect to the DS added by inductor harder.
to explain, inductor have the ability to specialized for dynamic shapes. well now we assume it does not
maybe soon we will add a check that it actually does not [ this also could cause BC breaking if it does].

Now the ideal and only actual right fix, is to use unbacked, unbacked comes with a perf hit.
so then come the idea, use unbacked as fallback .. the idea was evaluate dynamo+ inductor guards on the input of the DS graph and eaither call the backed DS graph or the unbacked DS graph.

with this! now we we have so much more branching, we would need to track Inductor guards per each of those compilations
(inductor can guard differently on each of those ranges based on the example input). so the fall back solution becomes more expensive and more complicated cc @zou3519 @bobrenjc93 @jamesjwu

Signed-off-by: Luka Govedič <[email protected]>

…nels) Signed-off-by: Luka Govedič <[email protected]>

…replacements). TODO pass to remove unnecessary conversions? Signed-off-by: Luka Govedič <[email protected]>

Signed-off-by: Luka Govedič <[email protected]>

vllm/compilation/pass_manager.py

Signed-off-by: ilmarkov <[email protected]>

mergify · 2025-11-25T15:32:28Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ilmarkov.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: ilmarkov <[email protected]>

tests/compile/distributed/test_fusions_e2e.py

Signed-off-by: ilmarkov <[email protected]>

tests/compile/distributed/test_fusions_e2e.py

ProExpertProg · 2025-11-25T22:09:22Z

tests/compile/test_compile_ranges.py

+
+
+def test_compile_ranges(use_fresh_inductor_cache):
+    post_grad_range_checker = PostGradRangeChecker(


How come this works without disabling the vllm cache?

Probably clean inductor cache allows us to avoid cache hits of vllm cache

vllm/config/compilation.py

Signed-off-by: ilmarkov <[email protected]>

laithsakka · 2025-11-25T22:46:25Z

tests/compile/test_compile_ranges.py

+        compilation_config=CompilationConfig(
+            mode=CompilationMode.VLLM_COMPILE,
+            compile_ranges_split_points=[8, 32],
+            compile_sizes=[16, 64, 128],


nit: I wonder if we shall we call those now specialize sizes?

compile_specialize_sizes? so that it is symmetrical to compile_ranges

Yeah we can do in follow-up

tests/conftest.py

vllm/compilation/backends.py

laithsakka · 2025-11-25T23:03:23Z

vllm/compilation/backends.py

 compilation_start_time = 0.0


 class PiecewiseCompileInterpreter(torch.fx.Interpreter):


wonder if we shall just replace this with Fx graph pass at this point . (not in this PR )

Could you explain in more details what you mean? We'll add it to the list of follow-ups in the PR description

Yeah I'm not sure I understand what you mean, this sounds like a big change - could you open an RFC to explain what you mean? Also cc @zou3519

laithsakka · 2025-11-25T23:34:52Z

vllm/compilation/piecewise_backend.py

+        # First we try to find the range entry for the concrete compile size
+        # If not found, we search for the range entry
+        # that contains the runtime shape.
+        if runtime_shape in self.compile_sizes:


i mean do you need the branch here
the other path works for all no?

Yes, if we add the compile_sizes-based ranges to compile_ranges (at the moment they are in range_entries) and sort them keys . In current way it is more clear that we first to to specialize for the compile_sizes then search for ranges.

Yeah because the specific size can overlap with the ranges so I think this is fine

laithsakka · 2025-11-25T23:38:59Z

vllm/v1/worker/gpu_worker.py

+        all_sizes.update([x for x in warmup_sizes if isinstance(x, int)])
+        for compile_range in compile_ranges:
+            if not any(x in compile_range for x in all_sizes):
+                warmup_sizes.append(compile_range.end)


mm wonder what is the best value to pass here ? end, start or mid point
this will be used as the hint for the inductor compilation (well unless we cache hit).

This actually brings a new question, if we have two identical graphs, inductor internal cache will cache hit even if ranges are different (hint is different, do we want to force a cache miss there in that case?
kind if add the hint to the internal inductor cache lookup)

We add that cache hint via the PostGradPassManager

mm wonder what is the best value to pass here ? end, start or mid point

I think the end is the best value, often perf of kernels changes after passing a power-of-two multiple (which is what end is)

vllm/compilation/piecewise_backend.py

laithsakka

took another pass looks good over all some nits.

mergify · 2025-11-26T04:06:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ilmarkov.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: ilmarkov <[email protected]>

mergify · 2025-12-01T16:25:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ilmarkov.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

christian-pinto · 2025-12-01T17:19:45Z

@ProExpertProg I confirm that I can replicate the failure of the Prithvi tests. I don't know yet why this is failing. I will spend some time tomorrow to debug what is going on.

ProExpertProg · 2025-12-01T17:29:17Z

@christian-pinto sounds great, thanks for helping with this! Let us know if you need any help

ilmarkov mentioned this pull request Sep 4, 2025

[PERF] Allreduce Fusion tuning and compile_ranges introduction #22086

Closed

ilmarkov changed the title ~~[PERF] Introduce compile_ranges~~ [PERF] Conditional compilation. Introduce compile_ranges Sep 4, 2025

ilmarkov changed the title ~~[PERF] Conditional compilation. Introduce compile_ranges~~ [Compile] Conditional compilation. Introduce compile_ranges Sep 4, 2025

mergify bot added the ci/build label Sep 5, 2025

zou3519 reviewed Sep 8, 2025

View reviewed changes

ProExpertProg mentioned this pull request Sep 8, 2025

[RFC]: Enabling Multiple Graphs Based on pre-defined conditions #23113

Open

1 task

zou3519 reviewed Sep 8, 2025

View reviewed changes

mergify bot added the needs-rebase label Sep 16, 2025