Skip to content

Add attention variants and backend guide#2

Merged
sunway513 merged 1 commit into
mainfrom
docs/attention-variants-guide
Feb 7, 2026
Merged

Add attention variants and backend guide#2
sunway513 merged 1 commit into
mainfrom
docs/attention-variants-guide

Conversation

@sunway513
Copy link
Copy Markdown
Owner

Summary

  • Add comprehensive user-facing documentation for all attention variants in AITER
  • Cover MHA (Flash Attention), Paged Attention (decode + prefill), MLA (Multi-head Latent Attention), Unified Attention, and specialized variants (Lean, HSTU, Sparse, Chunked)
  • Include backend support matrices (ASM vs CK vs Triton), data type coverage, KV cache quantization options, and fused operation catalog

Highlights

  • Quick reference table helping users pick the right attention variant for their use case
  • Decision tree for backend selection (training vs inference, model type, GPU arch)
  • Data type matrices per variant and backend (BF16, FP16, FP8, INT8)
  • KV cache quantization guide with precision levels and memory savings
  • Practical API examples for MHA, PA decode/prefill, and MLA
  • GPU architecture support summary (MI300X vs MI350 vs other)

Test plan

  • Review report accuracy against current source code
  • Verify all referenced API functions and source files exist

🤖 Generated with Claude Code

Document all attention variants (MHA, PA, MLA, Unified, Sparse, etc.)
with backend support matrices, data type coverage, decision trees for
choosing the right variant, and practical API examples.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sunway513 sunway513 merged commit f175623 into main Feb 7, 2026
@sunway513 sunway513 deleted the docs/attention-variants-guide branch February 22, 2026 03:52
@sunway513 sunway513 restored the docs/attention-variants-guide branch February 22, 2026 03:54
sunway513 pushed a commit that referenced this pull request Mar 22, 2026
Apply spatial stream-K style work allocation to leanAttention.
sunway513 added a commit that referenced this pull request Apr 30, 2026
Wrapper-level safety guard for the padded-softmax bug raised by Copilot
inline comment #2 on PR ROCm#2969. Padded K/V tokens produce QK^T = 0 but
exp(0) = 1 still contributes to the softmax denominator and silently
scales the output for non-causal attention. Causal mode masks padded
positions so it is unaffected.

Empirical RCA at aiter-forge-baselines/2969_padded_softmax_rca.md:
  - Wan2.1 production (S_real=32760, S_pad=32768, ratio=0.024%):
    cos_min 0.999992, max_abs 0.0008 — safe, indistinguishable from
    bf16 noise floor.
  - 50% padding worst case: rel_err 37.3%, max_abs 0.281 — silent
    output scaling, would corrupt downstream.

Implements option (d) from the RCA decision doc (signed off by Peng):
hybrid threshold. Non-causal calls with n_pad/seq_len_pad > 0.005 are
rejected with a ValueError that points the caller at the three valid
remediations (causal=True, pre-pad to multiple of 128, or use a
masking-aware kernel).

Threshold rationale: 0.5% is the bf16 mantissa precision floor (~0.4%,
7 mantissa bits) plus 1 bit of margin. Production Wan2.1 (0.024%)
clears it by 20x, so the hot path stays open while the silent-disaster
worst case is closed.

Tests added (op_tests/flydsl_tests/test_flydsl_fmha.py):
  - test_flydsl_fmha_rejects_excessive_padding: B=1, S_real=129
    (S_pad=256, 49.6% pad), causal=False — must raise ValueError with
    "0.5% safety threshold" substring.
  - test_flydsl_fmha_allows_tight_padding: Wan2.1 case S_real=32760,
    causal=False — must succeed and match SDPA reference (cos_min
    >= 0.9999). Regression guard for the production hot path.

Validation on R9600D (gfx1201) inside wan-best container,
HIP_VISIBLE_DEVICES=4: 10 passed, 2 skipped (multi-GPU only).
black --check + ruff check both clean on touched files.

Kernel file aiter/ops/flydsl/kernels/flash_attn_func_gfx1201.py is
intentionally untouched — refactor is in a parallel branch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant