-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Optimize the fp-dequantizer to get high memory-BW utilization (#5373)
This PR removes the for loop inside the dequantizer kernel and use as many threads and blocks as needed to dequantize the quantized matrix. The previous implementation was processing each group per thread block which can reduce the efficiency when have having smaller group-size and also processes more data per-thread which is unnecessary and we can use more parallelism to improve the dequantization performance. Based on my testing results, for a 4K by 4K matrix, dequantizing from fp8 to bf16 gives 2.5x speedup (improving the BW efficiency from 1 TB/s to 2.5 TB/s on Nvidia H100 GPU). --------- Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
- Loading branch information
1 parent
cc9e7b9
commit a8b8215
Showing
1 changed file
with
68 additions
and
100 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters