Add quickreduce as alternative to custom allreduce#16804
Add quickreduce as alternative to custom allreduce#16804ilmarkov wants to merge 28 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
hi, @ilmarkov , |
5b81d85 to
96e1a3e
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
96e1a3e to
d92ccc8
Compare
d92ccc8 to
6f17424
Compare
what are the cases when custom allreduce performs better than quickreduce? It would better if quickreduce can surpass custom allreduce in all cases, then we can use quickreduce as a drop-in replacement of custom allreduce without a new user-facing flag. |
|
@youkaichao It is slower for smaller input sizes. We could do the similar approach as custom allreduce has - use one shot for small buffers and two shot for larger ones. |
that would be great, can you implement it? we can use either quickreduce or custom allreduce at the engine level, instead of dynamically switching based on the input size. |
|
Yes, we can try to implement this approach. |
you can use an environment variable, like |
|
This pull request has merge conflicts that must be resolved before it can be |
6f17424 to
ad731a5
Compare
Signed-off-by: ilmarkov <imarkov@redhat.com>
Signed-off-by: ilmarkov <imarkov@redhat.com>
Signed-off-by: ilmarkov <imarkov@redhat.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: ilmarkov <imarkov@redhat.com>
Signed-off-by: ilmarkov <imarkov@redhat.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Add min sizes for QR Cleanup Signed-off-by: ilmarkov <imarkov@redhat.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: ilmarkov <imarkov@redhat.com>
Signed-off-by: ilmarkov <imarkov@redhat.com>
Signed-off-by: ilmarkov <imarkov@redhat.com>
380c1b1 to
f314fe4
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
closing in favor of #19744 |
Add quickreduce alternative to custom allreduce.
The collective is only enabled on AMD, MI300, for fp16/bf16 inputs and when custom allreduce is enabled. The kernels support full precision and quantized (int4 symmetric with group size 32) all reduce collective quantization algorithm.
The quickreduce can be enabled by setting
VLLM_ROCM_QR_QUANT_REGIME=[NONE|FP|INT8|INT6|INT4]env variable. quickreduce supports int8, int6, int4 quantization.PR supports fp16 and bf16 kernels but given the lack of intrinsics of bf16 math operations, bf16 kernels performance is worse (see kernel benchmark results below), so by default we convert bf16 all reduce input to fp16. To disable this behavior one can set
VLLM_ROCM_QR_CAST_BF16_TO_FP16=0env variable.As long as quickreduce only get the performance benefits at middle/higher input sizes (see kernel benchmarks), vllm keeps using custom allreduce for small inputs. The lower bounds on enabling quickreduce are chosen empirically.
Maximal input size for quickreduce is 2GB.
Benchmark results
(float16):
Server:
VLLM_USE_V1=1 VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve meta-llama/Llama-3.1-70B-Instruct --block_size=32 --disable-log-requests --no-enable-prefix-caching -tp $tp --dtype float16Client:
python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.1-70B-Instruct --dataset-name sonnet --dataset-path benchmarks/sonnet.txt --num-prompts 500--request-rate 10 --ignore-eosTP=8
TP=4
bfloat16 kernels (--dtype bfloat16; fp16 kernels results in the table are done with VLLM_ROCM_QR_CAST_BF16_TO_FP16=1):
TP=4
Kernels benchmarking
TP=2
TP=4
Evaluation results on MMLU benchmark (LLaMa 3.1 70B, TP=8)