[CI][torch.compile] Reduce e2e fusion test time#33293
[CI][torch.compile] Reduce e2e fusion test time#33293ProExpertProg merged 30 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request aims to reduce test time on Blackwell GPUs by modifying the CI pipeline. It changes the pytest command for test_fusion_attn.py to only run test_attention_quant_pattern, effectively disabling the test_attn_quant integration test. While this does reduce test time, I have a concern that disabling test_attn_quant may leave a gap in test coverage for attention quantization on full models for Blackwell. I've added a comment with more details.
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
…debugging, fix qwen fusion logic Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
…bers, fixing check & skip consistency Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
| ) | ||
| assert config.compilation_config.cudagraph_mode == CUDAGraphMode.NONE | ||
| assert config.compilation_config.pass_config.fuse_gemm_comms is True | ||
| assert config.compilation_config.pass_config.enable_qk_norm_rope_fusion is True |
There was a problem hiding this comment.
should we start standardizing the terms?
fuse_gemm_comms -> gemm_comms_fusion (I saw most of them are in this format. {custom_op_name1}_{custom_op_name2}_fusion
enable_qk_norm_rope_fusion -> qk_norm_rope_fusion ?
There was a problem hiding this comment.
for pass config we did standardize on fuse_<op>_<op> - but the norm-rope one landed during the renaming. We should just fix that one, I haven't had a chance
|
This PR is confusing because there are so many more tests and in some cases the total time is increased, like the sequence parallel tests going from 1@1hr to now 2@40min. Can we just keep the same number of tests but make them faster? |
|
Where do you see 2h40mins? There are two tests 40mins each, 1 on L40 and 1 on h100 (nightly only), which is the same as before. Also before the tests were like 1h each because they also included the unit tests which were moved into distributed unit tests. To clarify, one of the SP tests came from distributed which is why that one is now 40mins faster. For the E2E tests, they've been breaking a lot so I wanted some signal on all changes in vllm, and that's why I tried to keep them to under 20 mins. I don't think the old grouping made sense. And with test areas I feel like it's fine to have more ci tests because they're organized much better. Do you have a specific proposal for grouping the tests? We can remove the L40 SP tests and run h100 in their place instead of on nightly if you prefer. |
tjtanaa
left a comment
There was a problem hiding this comment.
LGTM as well. The PassConfig is not related to this PR.
| if self.parallel_config.tensor_parallel_size == 1: | ||
| logger.warning("Sequence Parallelism requires TP>1, disabling") | ||
| self.compilation_config.pass_config.enable_sp = False | ||
| self.compilation_config.pass_config.fuse_gemm_comms = False |
There was a problem hiding this comment.
what happened before? an error?
There was a problem hiding this comment.
Yes, because all reduce is a noop during tp=1, SP would match on just rms and during replacement tracing lack of TP would cause an error
|
|
||
| # Disable, compile cache to make sure custom passes run. | ||
| # Otherwise, we can't verify fusion happened through the logs. | ||
| monkeypatch.setenv("VLLM_DISABLE_COMPILE_CACHE", "1") |
There was a problem hiding this comment.
The test coverage looked reasonable to me. I think we do want e2e tests around so that we can be sure these things are working or not.
Some ideas to further reduce the test time (that we could pursue in the future):
- If the compile cache is off, this means that we have a cold compile each time. If the goal is to just check that the fusion happened, we could figure out how to disable Inductor triton kernel generation (which is a sizable chunk of the compile time)
- we can avoid checking logs (there's probably some i/o there?) Instead there's a way to get the inductor graphs. We just want to check that the custom pass ran successfully and that there is a new fused custom op in the graph right?
There was a problem hiding this comment.
Yeah that sounds good to me. I think we should still run the graph to check it's not broken but agreed that skipping triton would be nice. Also if we had a way to pass the counters between processes that would be much better instead of log parsing.
Are you okay with doing these in a follow-up?
There was a problem hiding this comment.
Yes, follow-up is good.
Also if we had a way to pass the counters between processes that would be much better instead of log parsing.
setting the vllm multiprocessing envvar to 0 seemed to work for me to retrieve pytorch's counters (in test_cold_start.py).
There was a problem hiding this comment.
That only works for TP=1 :/
Signed-off-by: ProExpertProg <luka.govedic@gmail.com>
|
Waiting for H100 runner availability before merging |
|
Okay found the culprit for the test_async_tp.py failure, so I feel good about merging |
Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: ProExpertProg <luka.govedic@gmail.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: felix01.yu <felix01.yu@vipshop.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: ProExpertProg <luka.govedic@gmail.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: ProExpertProg <luka.govedic@gmail.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Purpose
Fusion E2E tests are out of control: they have poor coverage but also take a long time in CI.
This PR simultaneously improves coverage, splits up the tests, and cuts running times by reducing
n_hidden_layersand using dummy weights. The old E2E tests are removed completely in favor of a newfusions_e2edirectory. We add utilities to make it easier to add models and fusions in the future.In CI, the E2E fusion tests are now split into "quick" (all models, single config) and "sweep (single model, sweeping all of config). "quick" tests run on any change in vllm and are limited to <15mins. Sweep tests only run on specific changes to compilation/model forward code.
Additionally, distributed compilation tests are pulled out of distributed and added to distributed compilation tests.
Follow-ups
Before
Distributed
Compile
PyTorch
SP
After
Distributed
Compile
PyTorch
SP
Test Plan
CI
Test Result
Looks good