:rotating_light: [`FA4`] Initial support by vasqu · Pull Request #42435 · huggingface/transformers

vasqu · 2025-11-26T20:37:58Z

🚨 Breaking change

FA2 is only supported from version 2.3.3 and on

This is due to the fact that this is older than 2+ years (we deprecate torch in 2 year cycles for example) as well as it giving fairly high maintenance burden.

Related issues and PRs

Fixes #42405
Closes #42404 as it has a lot of unnecessary logic and tests alongside it

Testing

Sanity Testing

RUN_SLOW=1 pytest tests/models/llama/test_modeling_llama.py -k flash

Passes all flash attention 4 tests

First quick numbers (hopper)

# No attention mask (base fa)
# RUN_SLOW=1 pytest -s tests/generation/test_flash_attention_parity.py
Latency:
    With FA2: 381.5204345703125
    With FA3: 362.461669921875
    With FA4: 373.788427734375

# With attention mask (varlen fa)
Latency:
    With FA2: 509.337646484375
    With FA3: 476.020654296875
    With FA4: 476.72578125

NOTE: FA4 is optimized for blackwell, these are quick numbers on hopper --> it's faster than FA2 but around the same on varlen, between the other FAs on non-varlen

src/transformers/modeling_utils.py

HuggingFaceDocBuilderDev · 2025-11-26T20:46:58Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vasqu · 2025-11-28T17:27:50Z

FA4 support cc @stas00 if you wanna play around with this PR. It's pretty much ready, just not convinced by the numbers but I also don't have quick access to a blackwell GPU (at least today :D)

sfc-gh-sbekman

Thank you for working on this, Anton. Going to try it out.

To make it easier to try your PR please add to the OP how to install FA4, since it's non-trivial to find.

git clone https://github.com/Dao-AILab/flash-attention/
cd flash-attention
cd flash_attn/cute
uv build --wheel . -v --no-build-isolation --out-dir flash-attention/wheels
uv pip install flash-attention/wheels/flash_attn_cute*.whl --prerelease=allow

sfc-gh-sbekman · 2025-12-01T06:02:56Z

OK, gave it a test ride using your PR and the above comment's install of FA4 on B200.

I did a quick test with Llama-8b and the integration worked smoothly but the tflops performance is much worse than FA2 - 2-5x slower. Not sure if it's an issue with integration or the FA4 code or the pytorch version - most likely the upstream since the integration is just a wrapper

I tried pt-2.9.1-cu130 and pt-nightly-cu130 - same outcome

edit: sdpa in pt-nightly, which supposedly backported FA4, is about 3x faster than FA4 on its own using the same llama-8b - since they both should be using the same code, perhaps there is an issue with the integration?

vasqu · 2025-12-01T19:20:44Z

Thanks for checking this out and all the pointers @sfc-gh-sbekman ❤️

To make it easier to try your PR please add to the OP how to install FA4, since it's non-trivial to find.

For sure, I'll add some docs for FA4 before release. Maybe also FA3 in a different PR.

I did a quick test with Llama-8b and the integration worked smoothly but the tflops performance is much worse than FA2 - 2-5x slower. Not sure if it's an issue with integration or the FA4 code or the pytorch version - most likely the upstream since the integration is just a wrapper

Shoot, so it wasn't an GPU arch issue... This is weird

sdpa in pt-nightly, which supposedly backported FA4, is about 3x faster than FA4 on its own using the same llama-8b - since they both should be using the same code, perhaps there is an issue with the integration?

Do you have a code snippet? There are so many edge cases with sdpa that it maybe is not even entering the FA backend path? Could be quickly checked by restricting the backend usage on SDPA with their context manager

with torch.nn.attention.sdpa_kernel([torch.nn.attention.SDPBackend.FLASH_ATTENTION]):
    pass  # do your thing here

I'm also unsure how FA4 is integrated in SDPA? Do we need to use a flag there? I remember that cudnn backend needed special treatment

stas00 · 2025-12-02T01:50:13Z

Shoot, so it wasn't an GPU arch issue... This is weird

Did you mean that you too have observed a similar slowdown?

Do you have a code snippet?

I was just using https://github.com/snowflakedb/ArcticTraining/ normal SFT training recipe where I tried different attention mechanisms. Just normal fwd/bwd/step - nothing special added.

I'm also unsure how FA4 is integrated in SDPA? Do we need to use a flag there? I remember that cudnn backend needed special treatment

They copied/adapted the FA4 kernels see: #42435 - you'd need pt nightly for that to work.

vasqu · 2025-12-02T15:52:57Z

Did you mean that you too have observed a similar slowdown?

I just did some quick numbers on inference, see the test I noted down in the PR description. I used an H100 there and as you can see it's slower (not on the same magnitude as in your samples - would say it's a mixture of model size / context size)

I was just using https://github.com/snowflakedb/ArcticTraining/ normal SFT training recipe where I tried different attention mechanisms. Just normal fwd/bwd/step - nothing special added.

Gotcha, I will try to separate our implementation from the base fn of the FA library to see if our wrappers cause this or maybe some perf regression happened sometime else.

They copied/adapted the FA4 kernels see: #42435 - you'd need pt nightly for that to work.

Wrong link? My assumption / hunch was that maybe

Even nightly might not use FA4 per default and sticks to FA2 per default, i.e. might need some extra flags to enable that specific backend. But that's just my feeling, need to look into it.
Our implementations has some issues where attention masks are created even when it is not needed (full (causal) attention). If a mask is passed to SDPA, then the FA backend can never be entered per their restrictions. So I thought, maybe, we have this case (SDPA with xformers faster than FA4 - xformers is not so bad on short contexts <2k).

stas00 · 2025-12-02T17:56:57Z

My apologies, here is the correct link pytorch/pytorch#167348

sfc-gh-sbekman · 2025-12-12T22:34:04Z

Some useful updates from talking to Tri:

FA4 is supposed to replace FA2 and FA3 and would work with A/H/B archs
FA4 varlen support is planned in a few weeks time

edixiong · 2026-02-05T00:35:08Z

Hi @sfc-gh-sbekman @vasqu Thanks for contributing! I think the varlen is supported. Do you mind testing FA4 again?

sfc-gh-sbekman · 2026-02-05T19:30:10Z

just tried with the main version of fa - bwd doesn't work:

[rank3]:   File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/torch/utils/checkpoint.py", line 325, in backward
[rank3]:     torch.autograd.backward(outputs_with_grad, args_with_grad)
[rank3]:   File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/torch/autograd/__init__.py", line 364, in backward
[rank3]:     _engine_run_backward(
[rank3]:   File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/torch/autograd/graph.py", line 865, in _engine_run_backward
[rank3]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/torch/autograd/function.py", line 317, in apply
[rank3]:     return user_fn(self, *args)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/flash_attn/cute/interface.py", line 1385, in backward
[rank3]:     dq, dk, dv = _flash_attn_bwd(
[rank3]:                  ^^^^^^^^^^^^^^^^
[rank3]:   File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/flash_attn/cute/interface.py", line 1059, in _flash_attn_bwd
[rank3]:     _flash_attn_bwd.compile_cache[compile_key] = cute.compile(
[rank3]:                                                  ^^^^^^^^^^^^^
[rank3]:   File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/flash_attn/cute/cute_dsl_utils.py", line 118, in cute_compile_patched
[rank3]:     output = cute_compile_og(*args, **kwargs)
[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/flash_attn/cute/flash_bwd_sm100.py", line 802, in __call__
[rank3]:     ).launch(
[rank3]:   ^^^^^^^^^^^
[rank3]:   File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/flash_attn/cute/flash_bwd_sm100.py", line 1213, in kernel
[rank3]:     if warp_idx >= self.compute_warp_ids[0] and warp_idx <= self.compute_warp_ids[-1]:
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/flash_attn/cute/flash_bwd_sm100.py", line 1215, in then_block_16
[rank3]:     self.compute_loop(
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/flash_attn/cute/flash_bwd_sm100.py", line 2062, in compute_loop
[rank3]:     while work_tile.is_valid_tile:
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/flash_attn/cute/flash_bwd_sm100.py", line 2317, in if_region_4
[rank3]:     if process_tile:
[rank3]: ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/flash_attn/cute/flash_bwd_sm100.py", line 2339, in then_block_5
[rank3]:     consumer_state_dKV = self.epilogue_dK_or_dV_tma(
[rank3]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/flash_attn/cute/flash_bwd_sm100.py", line 2877, in epilogue_dK_or_dV_tma
[rank3]:     tdKVtdKV_t2r = self.split_wg(tdKVtdKV_t2r_p, wg_idx, num_wg)[None, None, 0, 0]
[rank3]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/flash_attn/cute/flash_bwd_sm100.py", line 1834, in split_wg
[rank3]:     t = cute.logical_divide(
[rank3]:   File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/nvidia_cutlass_dsl/python_packages/cutlass/_mlir/dialects/_cute_ops_gen.py", line 1805, in __init__
[rank3]:     super().__init__(self.build_generic(attributes=attributes, operands=operands, successors=_ods_successors, regions=regions, loc=loc, ip=ip))
[rank3]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: ValueError: Operation creation failed
loc("t = cute.logical_divide("("/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/flash_attn/cute/flash_bwd_sm100.py":1834:16)): error: failed to perform a valid division of '!cute.layout<"(((32,32),1),1,1,1):(((1,65536),0),0,0,0)">' by #cute.tile<"[1024:1;1:0;1:0;0:1]">

sfc-gh-sbekman · 2026-02-24T21:56:06Z

Tried again with today's FA4 and a new error is reported on H200 w/ varlen.

[rank6]:   File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/torch/autograd/function.py", line 311, in apply
[rank6]:     return user_fn(self, *args)
[rank6]:            ^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/flash_attn/cute/interface.py", line 1395, in backward
[rank6]:     dq, dk, dv = _flash_attn_bwd(
[rank6]:                  ^^^^^^^^^^^^^^^^
[rank6]:   File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/flash_attn/cute/interface.py", line 617, in _flash_attn_bwd
[rank6]:     assert not is_varlen, "varlen backward is not yet supported on sm90"
[rank6]:            ^^^^^^^^^^^^^
[rank6]: AssertionError: varlen backward is not yet supported on sm90

non-varlen works, but haven't measured the performance comparison

ArthurZucker

Ty!

src/transformers/modeling_utils.py

tests/generation/test_flash_attention_parity.py

…ly yikes

github-actions · 2026-03-13T18:34:38Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: gemma3, gpt_oss, sam3

imstevenpmwork · 2026-03-27T14:59:33Z

Hey @vasqu 👋 Just a quick heads-up that the API changes in this PR (shipped in the 5.4.0 release) introduced a breaking change for us.

Over at lerobot, we were relying on is_flash_attn_greater_or_equal_2_10, which looks like it got removed during the import_utils.py refactor. Just wanted to flag this to save some debugging time in case anyone else is suddenly staring at a red CI!

Cheers!

Second thought:
Maybe we can add it into the Release Notes. Users should move from is_flash_attn_greater_or_equal_2_10 to is_flash_attn_greater_or_equal("2_10")

ArthurZucker · 2026-03-27T15:03:46Z

@vasqu let's add BC unless we had a deprecation cycle

vasqu · 2026-03-27T15:10:09Z

Check out #45061, mb

initial implementation

e82beeb

vasqu commented Nov 26, 2025

View reviewed changes

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved

vasqu added 13 commits November 27, 2025 14:42

Merge branch 'main' into fa4-support

1a45b34

CB support

30c6682

change how we call item on max_seq_len_q/k

e9cdeea

fix

40168b4

tests

91a1b3b

fix fa2 clash

8d3dc6c

unify the fa dispatch

bf1d589

fix

f5b7f9c

modernbert...

6288f44

oops

15ed2eb

parity test

6be5bbe

style

dad1b04

nit

34c15c2

vasqu mentioned this pull request Nov 28, 2025

Add Flash Attention 4 (CuTe DSL) Support #42404

Closed

sfc-gh-sbekman reviewed Dec 1, 2025

View reviewed changes

vasqu mentioned this pull request Dec 1, 2025

gpt-oss is not working with flash-attention #42533

Closed

4 tasks

vasqu mentioned this pull request Jan 5, 2026

FlexAttention backend support for sequence packing #43075

Open

Rocketknight1 mentioned this pull request Mar 10, 2026

flash-attn-4 (flash_attn.cute) is not supported by attn_implementation="flash_attention_2" #44559

Closed

3 tasks

vasqu added 3 commits March 10, 2026 12:04

simple min version instead of list

7223fe6

fixup error message on non init check

da88dcf

fixup up non init check a tad more

654db43

ArthurZucker approved these changes Mar 12, 2026

View reviewed changes

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved

tests/generation/test_flash_attention_parity.py Outdated Show resolved Hide resolved

vasqu added 8 commits March 13, 2026 16:44

Merge branch 'main' into fa4-support

a690e55

refactor some FA constants out to main fa utils

d3485da

new marker for all fas needed

476789f

oops

19e4c44

style and make the fa kernel fallback generalized

08445b6

default none...

920bef7

more refactors

8ee8c56

style

cd2a9b3

vasqu changed the title ~~[FA4] Initial support~~ 🚨 [FA4] Initial support Mar 13, 2026

vasqu added 2 commits March 13, 2026 18:21

fix

27e0d58

this test faulty even on main, xformers can handle any shape apparent…

043f11f

…ly yikes

vasqu added 3 commits March 13, 2026 18:41

lets make this more robust, we should check for none within...

b0485b5

fix

eae216e

oops

15f6ba9

vasqu added this pull request to the merge queue Mar 13, 2026

Merged via the queue into huggingface:main with commit 65db6fc Mar 13, 2026
28 checks passed

vasqu deleted the fa4-support branch March 13, 2026 19:32

hmellor mentioned this pull request Mar 17, 2026

Fix Phi3 test that fails with Transformers v5 vllm-project/vllm#37298

Merged

vasqu mentioned this pull request Mar 27, 2026

[FA] Fix BC support for a few versions + add deprecation cycle #45061

Merged

Maximellerbach mentioned this pull request Mar 27, 2026

fix(deps): breaking change from transformers 5.4.0 huggingface/lerobot#3231

Merged

Conversation

vasqu commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚨 Breaking change

Related issues and PRs

Testing

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Nov 26, 2025

Uh oh!

vasqu commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-sbekman left a comment

Choose a reason for hiding this comment

Uh oh!

sfc-gh-sbekman commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vasqu commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Dec 2, 2025

Uh oh!

vasqu commented Dec 2, 2025

Uh oh!

stas00 commented Dec 2, 2025

Uh oh!

sfc-gh-sbekman commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edixiong commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-sbekman commented Feb 5, 2026

Uh oh!

sfc-gh-sbekman commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

Uh oh!

imstevenpmwork commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vasqu commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

vasqu commented Nov 26, 2025 •

edited

Loading

vasqu commented Nov 28, 2025 •

edited

Loading

sfc-gh-sbekman commented Dec 1, 2025 •

edited

Loading

vasqu commented Dec 1, 2025 •

edited

Loading

sfc-gh-sbekman commented Dec 12, 2025 •

edited

Loading

edixiong commented Feb 5, 2026 •

edited

Loading

sfc-gh-sbekman commented Feb 24, 2026 •

edited

Loading

imstevenpmwork commented Mar 27, 2026 •

edited

Loading

ArthurZucker commented Mar 27, 2026 •

edited

Loading