Skip to content

[CI][torch.compile] Reduce e2e fusion test time#33293

Merged
ProExpertProg merged 30 commits intovllm-project:mainfrom
neuralmagic:luka/fix-fusion-tests
Feb 5, 2026
Merged

[CI][torch.compile] Reduce e2e fusion test time#33293
ProExpertProg merged 30 commits intovllm-project:mainfrom
neuralmagic:luka/fix-fusion-tests

Conversation

@ProExpertProg
Copy link
Collaborator

@ProExpertProg ProExpertProg commented Jan 29, 2026

Purpose

Fusion E2E tests are out of control: they have poor coverage but also take a long time in CI.

This PR simultaneously improves coverage, splits up the tests, and cuts running times by reducing n_hidden_layers and using dummy weights. The old E2E tests are removed completely in favor of a new fusions_e2e directory. We add utilities to make it easier to add models and fusions in the future.

In CI, the E2E fusion tests are now split into "quick" (all models, single config) and "sweep (single model, sweeping all of config). "quick" tests run on any change in vllm and are limited to <15mins. Sweep tests only run on specific changes to compilation/model forward code.

Additionally, distributed compilation tests are pulled out of distributed and added to distributed compilation tests.

Follow-ups

  • Add ROCm test cases (just tp1, add attention backend cases and ROCm-specific fusions)
  • Improve we are matching on
  • Fix broken fusions

Before

Distributed

Screenshot 2026-01-31 at 9 37 16 PM

Compile

Screenshot 2026-01-31 at 9 37 52 PM

PyTorch

Screenshot 2026-01-31 at 9 38 13 PM

SP

Screenshot 2026-01-31 at 9 39 13 PM

After

Distributed

Screenshot 2026-01-31 at 9 40 33 PM

Compile

Screenshot 2026-01-31 at 7 09 28 PM

PyTorch

Screenshot 2026-02-01 at 9 43 07 PM

SP

Screenshot 2026-02-01 at 9 42 15 PM

Test Plan

CI

Test Result

Looks good

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
@ProExpertProg ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 29, 2026
@mergify mergify bot added the ci/build label Jan 29, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to reduce test time on Blackwell GPUs by modifying the CI pipeline. It changes the pytest command for test_fusion_attn.py to only run test_attention_quant_pattern, effectively disabling the test_attn_quant integration test. While this does reduce test time, I have a concern that disabling test_attn_quant may leave a gap in test coverage for attention quantization on full models for Blackwell. I've added a comment with more details.

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
@ProExpertProg ProExpertProg changed the title Reduce Blackwell test time [WIP] Reduce WIP test time Jan 29, 2026
@ProExpertProg ProExpertProg changed the title [WIP] Reduce WIP test time [WIP] Reduce e2e fusion test time Jan 29, 2026
…debugging, fix qwen fusion logic

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
@ProExpertProg ProExpertProg changed the title [WIP] Reduce e2e fusion test time [CI][torch.compile] Reduce e2e fusion test time Jan 30, 2026
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
…bers, fixing check & skip consistency

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
@ProExpertProg ProExpertProg enabled auto-merge (squash) February 3, 2026 23:47
)
assert config.compilation_config.cudagraph_mode == CUDAGraphMode.NONE
assert config.compilation_config.pass_config.fuse_gemm_comms is True
assert config.compilation_config.pass_config.enable_qk_norm_rope_fusion is True
Copy link
Collaborator

@tjtanaa tjtanaa Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we start standardizing the terms?
fuse_gemm_comms -> gemm_comms_fusion (I saw most of them are in this format. {custom_op_name1}_{custom_op_name2}_fusion
enable_qk_norm_rope_fusion -> qk_norm_rope_fusion ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for pass config we did standardize on fuse_<op>_<op> - but the norm-rope one landed during the renaming. We should just fix that one, I haven't had a chance

@mgoin
Copy link
Member

mgoin commented Feb 4, 2026

This PR is confusing because there are so many more tests and in some cases the total time is increased, like the sequence parallel tests going from 1@1hr to now 2@40min. Can we just keep the same number of tests but make them faster?

@ProExpertProg
Copy link
Collaborator Author

ProExpertProg commented Feb 4, 2026

Where do you see 2h40mins? There are two tests 40mins each, 1 on L40 and 1 on h100 (nightly only), which is the same as before. Also before the tests were like 1h each because they also included the unit tests which were moved into distributed unit tests. To clarify, one of the SP tests came from distributed which is why that one is now 40mins faster.

For the E2E tests, they've been breaking a lot so I wanted some signal on all changes in vllm, and that's why I tried to keep them to under 20 mins. I don't think the old grouping made sense. And with test areas I feel like it's fine to have more ci tests because they're organized much better.

Do you have a specific proposal for grouping the tests? We can remove the L40 SP tests and run h100 in their place instead of on nightly if you prefer.

@ProExpertProg
Copy link
Collaborator Author

@mgoin also see #33731 for further cleanup and reorganization, perhaps that helps

Copy link
Collaborator

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well. The PassConfig is not related to this PR.

Comment on lines +769 to +772
if self.parallel_config.tensor_parallel_size == 1:
logger.warning("Sequence Parallelism requires TP>1, disabling")
self.compilation_config.pass_config.enable_sp = False
self.compilation_config.pass_config.fuse_gemm_comms = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happened before? an error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because all reduce is a noop during tp=1, SP would match on just rms and during replacement tracing lack of TP would cause an error


# Disable, compile cache to make sure custom passes run.
# Otherwise, we can't verify fusion happened through the logs.
monkeypatch.setenv("VLLM_DISABLE_COMPILE_CACHE", "1")
Copy link
Collaborator

@zou3519 zou3519 Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test coverage looked reasonable to me. I think we do want e2e tests around so that we can be sure these things are working or not.

Some ideas to further reduce the test time (that we could pursue in the future):

  1. If the compile cache is off, this means that we have a cold compile each time. If the goal is to just check that the fusion happened, we could figure out how to disable Inductor triton kernel generation (which is a sizable chunk of the compile time)
  2. we can avoid checking logs (there's probably some i/o there?) Instead there's a way to get the inductor graphs. We just want to check that the custom pass ran successfully and that there is a new fused custom op in the graph right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds good to me. I think we should still run the graph to check it's not broken but agreed that skipping triton would be nice. Also if we had a way to pass the counters between processes that would be much better instead of log parsing.

Are you okay with doing these in a follow-up?

Copy link
Collaborator

@zou3519 zou3519 Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, follow-up is good.

Also if we had a way to pass the counters between processes that would be much better instead of log parsing.

setting the vllm multiprocessing envvar to 0 seemed to work for me to retrieve pytorch's counters (in test_cold_start.py).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That only works for TP=1 :/

@ProExpertProg
Copy link
Collaborator Author

Waiting for H100 runner availability before merging

@ProExpertProg
Copy link
Collaborator Author

Okay found the culprit for the test_async_tp.py failure, so I feel good about merging

@ProExpertProg ProExpertProg merged commit 4d95135 into vllm-project:main Feb 5, 2026
52 checks passed
gameofdimension pushed a commit to gameofdimension/vllm that referenced this pull request Feb 5, 2026
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: ProExpertProg <luka.govedic@gmail.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: felix01.yu <felix01.yu@vipshop.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: ProExpertProg <luka.govedic@gmail.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: ProExpertProg <luka.govedic@gmail.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants