Falcon optimization#974
Conversation
1. add new args use_flash_attention flash_attention_recompute flash_attention_causal_mask 2. add extra markstep per decoder layer
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
measurement: actual run: for 2048->2048, falcon 40b with bs8: with BS12: |
…um_kv_heads, not num_attention_heads. Impproved performance with removing broadcast as HPU can handle broadcasting in fusedsdpa.
306cc7e to
04618b3
Compare
Co-authored-by: Sayantan Sarkar <sasarkar@habana.ai> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
What does this PR do?
Fixes # (issue)
Before submitting