[Kernel][Triton][AMD] Change default block size for triton_scaled_mm to 128 for 3-5x speedup #11698
+16
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changed default block-size for triton_scaled_mm to 128x128x128 from 32x32x32 for better performance. This results in roughly 3-5x speedup.
python benchmarks/benchmark_latency.py --dtype bfloat16 --enable-chunked-prefill False --load-format dummy --batch-size 64 --num-iters-warmup 2 --num-iters 5 --input-len 2048 --output-len 128 --model /models/Phi-3-medium-128k-instruct-quantized.w8a8/
Before:
Avg latency: 14.48 seconds
After:
Avg latency: 5.52 seconds
python benchmarks/benchmark_throughput.py --dtype bfloat16 --enable-chunked-prefill False --load-format dummy --input-len 2048 --output-len 128 --model /models/Phi-3-medium-128k-instruct-quantized.w8a8/
Before:
Throughput: 10269.32 tok/s
After:
Throughput: 31150.8 tokens/s