Skip to content

Add MLA variants and backend guide#1

Merged
sunway513 merged 2 commits into
mainfrom
docs/mla-kernel-support-report
Feb 7, 2026
Merged

Add MLA variants and backend guide#1
sunway513 merged 2 commits into
mainfrom
docs/mla-kernel-support-report

Conversation

@sunway513
Copy link
Copy Markdown
Owner

@sunway513 sunway513 commented Feb 7, 2026

Summary

  • Add comprehensive user-facing documentation for all MLA (Multi-head Latent Attention) variants in AITER
  • Cover Standard Decode, Persistent Decode, Standard Prefill, Persistent Prefill, Sparse MLA (Top-K), and Fused Operations (BMM+RoPE+Cache)
  • Include backend support matrices (ASM vs Triton), data type coverage, GQA ratio support, KV cache layouts, and RoPE handling

Highlights

  • Quick reference table helping users pick the right MLA variant for their use case
  • Decision tree for backend selection (prefill vs decode, persistent vs standard, sparse vs dense)
  • Data type matrices per variant and backend (BF16, FP8, FP4/MXFP4)
  • GQA ratio support table including ASM persistent mode's simulated ratios (32-112)
  • KV cache layout guide covering standard and 3-buffer FP8 layouts
  • Practical API examples for decode, persistent decode, prefill, sparse MLA, and fused cache operations
  • GPU architecture support summary (MI300X vs MI350 vs portable Triton)
  • Performance tuning guide with split-K auto-tuning details and key parameters

Test plan

  • Review report accuracy against current source code
  • Verify all referenced API functions, source files, and kernel configurations exist

🤖 Generated with Claude Code

sunway513 and others added 2 commits February 7, 2026 11:26
Document the current state of MLA kernel support across Triton and ASM
backends, covering precision, fusion levels, execution modes, GQA support,
KV cache layouts, and recommended areas for future development.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restructure the MLA documentation to match the format of the attention
and MOE guides: quick reference table, per-variant sections with backend
matrices, practical API examples, decision tree, data type and GQA
matrices, fused operations catalog, GPU architecture summary, and full
source/test file references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sunway513 sunway513 changed the title Add MLA kernel support report: Triton vs ASM comparison Add MLA variants and backend guide Feb 7, 2026
@sunway513 sunway513 merged commit f7908d6 into main Feb 7, 2026
11 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant