Commit c9debf7
committed
Optimize FlashMask v3 performance (PaddlePaddle#75737)
* tune bwd tile size
* tune bwd tile size for seqlen <= 8192
* fix cuda 700 cause by incorrect bwd tile size
* set scheduler_needs_semaphore to true
* update fa submodule
* update fa submodule
* update fa submodule
* update fa submodule1 parent 33eff52 commit c9debf7
File tree
3 files changed
+6
-10
lines changed- paddle/phi/kernels/gpu
- third_party
3 files changed
+6
-10
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1026 | 1026 | | |
1027 | 1027 | | |
1028 | 1028 | | |
1029 | | - | |
1030 | | - | |
1031 | | - | |
1032 | | - | |
1033 | | - | |
| 1029 | + | |
| 1030 | + | |
| 1031 | + | |
1034 | 1032 | | |
1035 | 1033 | | |
1036 | 1034 | | |
| |||
1040 | 1038 | | |
1041 | 1039 | | |
1042 | 1040 | | |
1043 | | - | |
| 1041 | + | |
1044 | 1042 | | |
1045 | 1043 | | |
1046 | 1044 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1762 | 1762 | | |
1763 | 1763 | | |
1764 | 1764 | | |
1765 | | - | |
1766 | | - | |
1767 | | - | |
| 1765 | + | |
1768 | 1766 | | |
1769 | 1767 | | |
1770 | 1768 | | |
| |||
- csrc/CMakeLists.txt+5-3
- csrc/flashmask_v2/flash.h+2-1
- csrc/flashmask_v2/flash_api.cu+40-38
- csrc/flashmask_v2/flash_bwd_launch_template.h+42-39
- csrc/flashmask_v2/flash_fwd_kernel_sm90.h+57-29
- csrc/flashmask_v2/flash_fwd_launch_template.h+24-1
- csrc/flashmask_v2/flash_prepare_scheduler.cu+22-1
- csrc/flashmask_v2/generate_kernels.py+9-6
- csrc/flashmask_v2/mainloop_bwd_sm90_tma_gmma_ws.hpp+13-9
- csrc/flashmask_v2/mainloop_fwd_sm90_tma_gmma_ws.hpp+369-291
- csrc/flashmask_v2/named_barrier.hpp+17-12
- csrc/flashmask_v2/static_switch.h+23
- csrc/flashmask_v2/tile_scheduler.hpp+287-5
0 commit comments