[Cute,Flex,Fwd] Allow vectorized score_mod definitions by reubenconducts · Pull Request #2236 · Dao-AILab/flash-attention

reubenconducts · 2026-02-05T16:48:58Z

[Reupload of #2215 with Sm90] This PR is a score_mod "power user" update that allows the user to specify vectorization for a given score_mod. It does so in two ways:

One can set score_mod.__vec_size__ and have the kernel read that, instead of using the current logic (vec_size = 2 if no aux_tensors are present, otherwise 1)
One can set buf.__assumed_align__ and buf.__leading_dim__ for any aux_tensors, allowing vectorized loads in the score_mod when set.
These options are not exposed in the API; they must be set specific to the given score_mod and aux_tensors, and are thus a "power user" feature.

For a kv bias load score_mod, we see up to 2.9x speedup:

bias.__assumed_align__ == None
### headdim = 128, causal = True, seqlen_q = 8192, seqlen = 8192, batch_size = 1, nheads = 16, varlen = True ###
FA Python fwd with vec_size 1: 0.905ms, 303.7 TFLOPS
FA Python fwd with vec_size 2: 0.789ms, 348.3 TFLOPS
FA Python fwd with vec_size 4: 0.515ms, 533.9 TFLOPS
FA Python fwd with vec_size 8: 0.422ms, 651.0 TFLOPS
FA Python fwd with vec_size 16: 0.414ms, 663.4 TFLOPS
FA Python fwd with vec_size 32: 0.403ms, 682.3 TFLOPS
FA Python fwd with vec_size 64: 0.415ms, 661.8 TFLOPS
FA Python fwd with vec_size 128: 0.387ms, 709.6 TFLOPS

bias.__assumed_align__ == 4
### headdim = 128, causal = True, seqlen_q = 8192, seqlen = 8192, batch_size = 1, nheads = 16, varlen = True ###
FA Python fwd with vec_size 1: 0.903ms, 304.5 TFLOPS
FA Python fwd with vec_size 2: 0.752ms, 365.6 TFLOPS
FA Python fwd with vec_size 4: 0.458ms, 599.5 TFLOPS
FA Python fwd with vec_size 8: 0.366ms, 751.7 TFLOPS
FA Python fwd with vec_size 16: 0.338ms, 814.4 TFLOPS
FA Python fwd with vec_size 32: 0.330ms, 832.4 TFLOPS
FA Python fwd with vec_size 64: 0.321ms, 855.4 TFLOPS
FA Python fwd with vec_size 128: 0.336ms, 818.4 TFLOPS

bias.__assumed_align__ == 8
### headdim = 128, causal = True, seqlen_q = 8192, seqlen = 8192, batch_size = 1, nheads = 16, varlen = True ###
FA Python fwd with vec_size 1: 0.904ms, 304.1 TFLOPS
FA Python fwd with vec_size 2: 0.749ms, 366.8 TFLOPS
FA Python fwd with vec_size 4: 0.462ms, 594.7 TFLOPS
FA Python fwd with vec_size 8: 0.351ms, 783.2 TFLOPS
FA Python fwd with vec_size 16: 0.328ms, 838.3 TFLOPS
FA Python fwd with vec_size 32: 0.332ms, 827.1 TFLOPS
FA Python fwd with vec_size 64: 0.331ms, 830.1 TFLOPS
FA Python fwd with vec_size 128: 0.912ms, 301.3 TFLOPS

bias.__assumed_align__ == 16
### headdim = 128, causal = True, seqlen_q = 8192, seqlen = 8192, batch_size = 1, nheads = 16, varlen = True ###
FA Python fwd with vec_size 1: 0.904ms, 304.1 TFLOPS
FA Python fwd with vec_size 2: 0.749ms, 366.8 TFLOPS
FA Python fwd with vec_size 4: 0.462ms, 594.7 TFLOPS
FA Python fwd with vec_size 8: 0.351ms, 783.2 TFLOPS
FA Python fwd with vec_size 16: 0.328ms, 838.3 TFLOPS
FA Python fwd with vec_size 32: 0.332ms, 827.1 TFLOPS
FA Python fwd with vec_size 64: 0.331ms, 830.1 TFLOPS
FA Python fwd with vec_size 128: 0.912ms, 301.3 TFLOPS

Tests check bitwise equality with unvectorized versions of score mods.

Of course, there is added complexity in defining score_mods to be performant, but it's strictly contained to within the score_mod definition (plus the 3 attributes mentioned above).

Still TODO, reserved for later PRs:

Work into the backward pass
Vectorize mask_mod application

cc: @drisspg @v0i0

drisspg · 2026-02-05T20:38:58Z

        )
-
+    if aux_tensors is not None:    
+        aux_tensor_metadata = get_aux_tensor_metadata(aux_tensors)


SUPER DUPER nit; aux_tensor_metadata = get_aux_tensor_metadata(aux_tensors) if aux_tensors else None

all real estate feels quite precious in this file

drisspg · 2026-02-05T20:39:43Z

+    custom score_mod callables.
+    """
+    assumed_align: int = getattr(t, "__assumed_align__", None)
+    leading_dim: int = getattr(t, "__leading_dim__", None)


I think in the future we it would be nice to find these programmatically instead of users facing (potentially)

drisspg · 2026-02-05T20:41:29Z

+        * cute.full_like(score, 0.125 * 0.6931471805599453 * 1.4426950408889634)
+    )
+    diff0 = q_idx[0] - kv_idx[0]
+    abs_diff = cute.make_rmem_tensor(kv_idx.shape, diff0.dtype)


should we write a note somewhere that vec_width for fwd score-mod is always encoded in kv_idx shape?

drisspg · 2026-02-05T20:42:03Z

+    batch_bias = aux_tensors[0]
+    dtype = batch_bias.element_type
+    b_idx0 = b_idx[0]
+    bias_frag = cute.make_rmem_tensor(1, dtype)


and to triple check is this is actually not vectorized right?

drisspg · 2026-02-05T20:43:25Z

+    bias_frag = cute.make_rmem_tensor(1, dtype)
+    bias_frag[0] = batch_bias[b_idx0]
+    bias_val = (bias_frag.load()).to(cutlass.Float32)
+    return tSrS_ssa + bias_val


or maybe it is and this + is doing broadcasting? if so should we also have some doc on this pattern for aux_tensor vectorization?

drisspg · 2026-02-05T20:49:44Z

    if hasattr(func, "__cute_hash__"):
        return func.__cute_hash__

+    # __vec_size__ is attr of @cute.jitted mod


hmm I think that since set_hash is True when we change the vecsize we are going to early return from line 40 right and not actually produce a new kernel, can you check in your tests with the for loops if this is the case? or we are producing new python funcs that dont have cute_hash set

* clean up and add more vectorized tests * remove commented out change * fix typo * add aux tensor alignment to compile key * add varlen score mod vec tests * uncomment test configs * sm90 fwd * update hash callable * format hash callable * shorten vec size tests

drisspg self-requested a review February 5, 2026 17:01

drisspg reviewed Feb 5, 2026

View reviewed changes

reubenconducts added 9 commits February 9, 2026 22:02

clean up and add more vectorized tests

fce9870

remove commented out change

cf14ef1

fix typo

85566ec

add aux tensor alignment to compile key

6d9ef84

add varlen score mod vec tests

01d228a

uncomment test configs

782f9bd

sm90 fwd

11447a4

update hash callable

ab02dd3

format hash callable

04fca59

reubenconducts force-pushed the rstern/vec-mod branch from fed436a to 04fca59 Compare February 9, 2026 22:14

drisspg reviewed Feb 9, 2026

View reviewed changes

Comment thread tests/cute/test_score_mod.py Outdated

drisspg approved these changes Feb 9, 2026

View reviewed changes

shorten vec size tests

eaa4a49

drisspg merged commit c4d8b06 into Dao-AILab:main Feb 11, 2026

reubenconducts mentioned this pull request Feb 17, 2026

[Cute,Flex,Sm100] vectorized mask_mod #2261

Merged

drisspg mentioned this pull request Apr 1, 2026

Add compress_factor for compressed causal attention #2418

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cute,Flex,Fwd] Allow vectorized score_mod definitions#2236

[Cute,Flex,Fwd] Allow vectorized score_mod definitions#2236
drisspg merged 10 commits into
Dao-AILab:mainfrom
reubenconducts:rstern/vec-mod

reubenconducts commented Feb 5, 2026

Uh oh!

drisspg Feb 5, 2026

Uh oh!

drisspg Feb 5, 2026

Uh oh!

drisspg Feb 5, 2026

Uh oh!

drisspg Feb 5, 2026

Uh oh!

drisspg Feb 5, 2026

Uh oh!

drisspg Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

reubenconducts commented Feb 5, 2026

Uh oh!

drisspg Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

drisspg Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

drisspg Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

drisspg Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

drisspg Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

drisspg Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants