Support fused masking in Attention #1924

jagrit06 · 2025-03-04T23:13:56Z

Proposed changes

Add support for fused additive, boolean, causal masks
Update fast::scaled_dot_product_attention to take variant for mask
Add tests for masking

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

jagrit06 · 2025-03-19T20:32:32Z

Adding a whole sludge of numbers here:

  1,    32,    32,   64,   32,    32, 0, float16,     None,  0.023,  0.013, +82.21%
  1,    32,    32,   64,   32,    32, 0, float16,     bool,  0.027,  0.013, +104.46%
  1,    32,    32,   64,   32,    32, 0, float16,   causal,  0.026,  0.012, +121.84%
  1,    64,    64,   64,   32,    32, 0, float16,     None,  0.022,  0.013, +74.93%
  1,    64,    64,   64,   32,    32, 0, float16,     bool,  0.025,  0.013, +88.46%
  1,    64,    64,   64,   32,    32, 0, float16,   causal,  0.025,  0.012, +113.23%
  1,   128,   128,   64,   32,    32, 0, float16,     None,  0.025,  0.015, +66.83%
  1,   128,   128,   64,   32,    32, 0, float16,     bool,  0.029,  0.016, +78.47%
  1,   128,   128,   64,   32,    32, 0, float16,   causal,  0.031,  0.015, +105.67%
  1,   256,   256,   64,   32,    32, 0, float16,     None,  0.036,  0.023, +56.15%
  1,   256,   256,   64,   32,    32, 0, float16,     bool,  0.047,  0.025, +85.18%
  1,   256,   256,   64,   32,    32, 0, float16,   causal,  0.046,  0.021, +114.84%
  1,   512,   512,   64,   32,    32, 0, float16,     None,  0.080,  0.057, +40.95%
  1,   512,   512,   64,   32,    32, 0, float16,     bool,  0.105,  0.064, +63.60%
  1,   512,   512,   64,   32,    32, 0, float16,   causal,  0.097,  0.044, +119.39%
  1,  1024,  1024,   64,   32,     8, 0, float16,     None,  0.226,  0.171, +32.34%
  1,  1024,  1024,   64,   32,     8, 0, float16,     bool,  0.317,  0.195, +62.47%
  1,  1024,  1024,   64,   32,     8, 0, float16,   causal,  0.274,  0.115, +138.78%
  1,  2048,  2048,   64,   32,     8, 0, float16,     None,  0.798,  0.594, +34.42%
  1,  2048,  2048,   64,   32,     8, 0, float16,     bool,  1.173,  0.687, +70.64%
  1,  2048,  2048,   64,   32,     8, 0, float16,   causal,  1.033,  0.376, +174.37%
  1,  4096,  4096,   64,   32,     8, 0, float16,     None,  2.963,  2.245, +31.99%
  1,  4096,  4096,   64,   32,     8, 0, float16,     bool,  4.439,  2.603, +70.51%
  1,  4096,  4096,   64,   32,     8, 0, float16,   causal,  3.918,  1.325, +195.76%
  1,  1024,  1024,   80,   32,     8, 0, float16,     None,  0.309,  0.217, +42.42%
  1,  1024,  1024,   80,   32,     8, 0, float16,     bool,  0.400,  0.250, +59.73%
  1,  1024,  1024,   80,   32,     8, 0, float16,   causal,  0.360,  0.138, +160.75%
  1,  2048,  2048,   80,   32,     8, 0, float16,     None,  1.096,  0.759, +44.52%
  1,  2048,  2048,   80,   32,     8, 0, float16,     bool,  1.471,  0.879, +67.41%
  1,  2048,  2048,   80,   32,     8, 0, float16,   causal,  1.332,  0.459, +189.97%
  1,  4096,  4096,   80,   32,     8, 0, float16,     None,  4.169,  2.884, +44.55%
  1,  4096,  4096,   80,   32,     8, 0, float16,     bool,  5.646,  3.356, +68.27%
  1,  4096,  4096,   80,   32,     8, 0, float16,   causal,  5.123,  1.617, +216.77%
  1,  1024,  1024,  128,   32,     8, 0, float16,     None,  0.343,  0.332, +3.25%
  1,  1024,  1024,  128,   32,     8, 0, float16,     bool,  0.435,  0.350, +24.13%
  1,  1024,  1024,  128,   32,     8, 0, float16,   causal,  0.392,  0.202, +94.12%
  1,  2048,  2048,  128,   32,     8, 0, float16,     None,  1.229,  1.182, +3.97%
  1,  2048,  2048,  128,   32,     8, 0, float16,     bool,  1.607,  1.238, +29.78%
  1,  2048,  2048,  128,   32,     8, 0, float16,   causal,  1.462,  0.673, +117.22%
  1,  4096,  4096,  128,   32,     8, 0, float16,     None,  4.698,  4.541, +3.45%
  1,  4096,  4096,  128,   32,     8, 0, float16,     bool,  6.174,  4.737, +30.33%
  1,  4096,  4096,  128,   32,     8, 0, float16,   causal,  5.653,  2.431, +132.52%

jagrit06 · 2025-03-19T20:34:52Z

Key highlights are in the longer sequences, for example at head dim 128

  1,  2048,  2048,  128,   32,     8, 0, float16,     None,  1.229,  1.182, +3.97%
  1,  2048,  2048,  128,   32,     8, 0, float16,     bool,  1.607,  1.238, +29.78%
  1,  2048,  2048,  128,   32,     8, 0, float16,   causal,  1.462,  0.673, +117.22%
  1,  4096,  4096,  128,   32,     8, 0, float16,     None,  4.698,  4.541, +3.45%
  1,  4096,  4096,  128,   32,     8, 0, float16,     bool,  6.174,  4.737, +30.33%
  1,  4096,  4096,  128,   32,     8, 0, float16,   causal,  5.653,  2.431, +132.52

The causal makes version takes around 60% of the time taken by the unmasked version

awni · 2025-03-19T22:25:58Z

Do you mind sharing labels for those columns? The numbers look amazing 🚀 but I'm trying to understand the nuances a bit more.

angeloskath

🚀🚀🚀

Looks great and the results are sweet!

mlx/backend/metal/kernels/steel/attn/kernels/steel_attention.h

jagrit06 · 2025-03-19T23:25:13Z

Do you mind sharing labels for those columns? The numbers look amazing 🚀 but I'm trying to understand the nuances a bit more.

Yes, I thought they copied over

  B,   qsl,   ksl, hdim, n_qh, n_kvh, t,   dtype,     mask, t_unfs, t_fuse, diff%

So the table would be

  B,   qsl,   ksl, hdim, n_qh, n_kvh, t,   dtype,     mask, t_unfs, t_fuse, diff%
  1,  2048,  2048,  128,   32,     8, 0, float16,     None,  1.229,  1.182, +3.97%
  1,  2048,  2048,  128,   32,     8, 0, float16,     bool,  1.607,  1.238, +29.78%
  1,  2048,  2048,  128,   32,     8, 0, float16,   causal,  1.462,  0.673, +117.22%
  1,  4096,  4096,  128,   32,     8, 0, float16,     None,  4.698,  4.541, +3.45%
  1,  4096,  4096,  128,   32,     8, 0, float16,     bool,  6.174,  4.737, +30.33%
  1,  4096,  4096,  128,   32,     8, 0, float16,   causal,  5.653,  2.431, +132.52

angeloskath · 2025-03-20T00:00:59Z

Hm the test failure is weird, can you check it is a numerical tolerance issue and maybe set a fixed seed ? After that can't wait for you to merge :-)

jagrit06 · 2025-03-20T00:24:04Z

Hm the test failure is weird, can you check it is a numerical tolerance issue and maybe set a fixed seed ? After that can't wait for you to merge :-)

It goes away after re runs - I probably just need to a numerical seed to fix it up

* Update API to allow mask='causal' in fast::sdpa * Add fallback * Update steel::AttnParams * Fix typo * WIP, basic causal * Update tests * Update benchmarking * Update masking loop limits * Add bool masking and update tests * Update additive mask * Update benchmarks * Update benchmarks * Update tests * Update for bfloat error * Update early exit * Add random seed to tests

jagrit06 force-pushed the attn-mask branch from 9dd4089 to 8e6fa57 Compare March 5, 2025 17:10

awni mentioned this pull request Mar 12, 2025

scaled_dot_product_attention: allow Q x Q mask instead of just Q x K mask #1842

Closed

jagrit06 force-pushed the attn-mask branch from 06705e0 to 6ffe669 Compare March 19, 2025 17:56

jagrit06 changed the title ~~[WIP] Support fused masking in Attention~~ Support fused masking in Attention Mar 19, 2025

jagrit06 marked this pull request as ready for review March 19, 2025 20:31

jagrit06 requested review from angeloskath and awni March 19, 2025 20:35

jagrit06 force-pushed the attn-mask branch from 7a96e2a to a91db89 Compare March 19, 2025 21:57

angeloskath approved these changes Mar 19, 2025

View reviewed changes

mlx/backend/metal/kernels/steel/attn/kernels/steel_attention.h Show resolved Hide resolved

jagrit06 added 15 commits March 19, 2025 17:31

Update API to allow mask='causal' in fast::sdpa

650aebc

Add fallback

e577236

Update steel::AttnParams

a4c9066

Fix typo

3ec715a

WIP, basic causal

fdc5a3b

Update tests

c489d55

Update benchmarking

3db10d7

Update masking loop limits

7af48ea

Add bool masking and update tests

d90cf28

Update additive mask

922bd6f

Update benchmarks

e441e89

Update benchmarks

cd949d4

Update tests

bc3ed9d

Update for bfloat error

e5811d8

Update early exit

eb38c9d

Add random seed to tests

57b62c7

jagrit06 force-pushed the attn-mask branch from 5fc7093 to 57b62c7 Compare March 20, 2025 00:32

jagrit06 merged commit 9adcd1a into main Mar 20, 2025
5 checks passed

jagrit06 deleted the attn-mask branch March 20, 2025 18:01

This was referenced Mar 21, 2025

Metal SDPA with masking EricLBuehler/mistral.rs#1225

Merged

Integrate MLX SDPA kernels with mask huggingface/candle#2820

Open

jagrit06 mentioned this pull request May 29, 2025

Flash attention and flash decoding principles #129

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support fused masking in Attention #1924

Support fused masking in Attention #1924

jagrit06 commented Mar 4, 2025

Uh oh!

jagrit06 commented Mar 19, 2025

Uh oh!

jagrit06 commented Mar 19, 2025

Uh oh!

awni commented Mar 19, 2025

Uh oh!

angeloskath left a comment

Uh oh!

Uh oh!

jagrit06 commented Mar 19, 2025

Uh oh!

angeloskath commented Mar 20, 2025

Uh oh!

jagrit06 commented Mar 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Support fused masking in Attention #1924

Support fused masking in Attention #1924

Conversation

jagrit06 commented Mar 4, 2025

Proposed changes

Checklist

Uh oh!

jagrit06 commented Mar 19, 2025

Uh oh!

jagrit06 commented Mar 19, 2025

Uh oh!

awni commented Mar 19, 2025

Uh oh!

angeloskath left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jagrit06 commented Mar 19, 2025

Uh oh!

angeloskath commented Mar 20, 2025

Uh oh!

jagrit06 commented Mar 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants