Skip to content

Add all remaining operator guides and complete README docs hub#4

Merged
sunway513 merged 6 commits into
mainfrom
docs/gemm-quant-guides-and-readme
Feb 7, 2026
Merged

Add all remaining operator guides and complete README docs hub#4
sunway513 merged 6 commits into
mainfrom
docs/gemm-quant-guides-and-readme

Conversation

@sunway513
Copy link
Copy Markdown
Owner

@sunway513 sunway513 commented Feb 7, 2026

Summary

  • Add 6 new operator guides covering every operator in AITER
  • Update README.md with a complete documentation hub linking all 9 guides
  • Every operator in the Supported Operators table now links to its relevant guide

New Documentation (this PR)

GEMM Variants & Tuning Guide

  • A8W8, A16W16, A4W4, batched, DeepGEMM, Triton FFN fusions
  • Complete tuning system docs (CSV format, env vars, model configs)

Quantization & Precision Guide

  • QuantType enum, per-tensor/token/block strategies
  • Fused quantization ops (FP8 + MXFP4), SmoothQuant, KV cache quant

Normalization Guide

  • RMSNorm, LayerNorm, GroupNorm with all fused variants
  • Add + SmoothQuant + Dynamic Quant fusions, backend dispatch logic
  • Fused QK norm + RoPE + cache + quant mega-kernels
  • Distributed RS + RMSNorm + Quant + AG fusion

RoPE (Rotary Position Embedding) Guide

  • SBHD, THD, 2D, 3D tensor formats
  • NeoX & GPT-J rotation styles, partial RoPE (nope_first)
  • 8 scaling methods (Linear, NTK, YaRN, Phi-3, DeepSeek, LLaMA3, MRoPE, DualChunk)
  • Autograd classes for training, fused operations

KV-Cache Management Guide

  • Paged, flash, ASM, and MLA cache layouts
  • Quantized cache (FP8, INT8, FP4) with per-token and per-block scales
  • Fused RoPE + cache write, fused BMM + RoPE + cache
  • Block swap/copy for beam search and speculative decoding

Elementwise & Activation Guide

  • SiLU/GELU/sigmoid/tanh activations
  • SwiGLU/GeGLU gating (the standard LLM FFN pattern)
  • Fused activation + quantize (FP8 group, MXFP4 block-scale)
  • Binary arithmetic with broadcasting, fused mul-add

README Updates

  • Documentation table expanded from 5 to 9 guides
  • All 13 operators in the table now link to their guide

Test plan

  • Verify all doc links in README resolve correctly
  • Review each guide for accuracy against source code
  • Confirm formatting renders properly on GitHub

🤖 Generated with Claude Code

sunway513 and others added 2 commits February 7, 2026 16:10
New operator guides covering GEMM variants (A8W8, A16W16, A4W4, batched,
DeepGEMM, Triton FFN, tuning system) and Quantization strategies (QuantType,
per-tensor/token/block, fused ops, FP8/MXFP4/INT4, SmoothQuant).

README now features a Documentation section linking all five operator guides
and the operator table links each op to its relevant guide.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New operator guides:
- Normalization: RMSNorm, LayerNorm, GroupNorm with all fused variants
  (add, SmoothQuant, dynamic quant), distributed fusion, backend dispatch
- RoPE: SBHD/THD/2D/3D formats, NeoX & GPT-J styles, 8 scaling methods,
  autograd classes, fused QK norm + RoPE + cache + quant
- KV-Cache: Paged/flash/ASM/MLA layouts, quantized cache (FP8/INT8/FP4),
  fused RoPE + cache write, block swap/copy management
- Elementwise & Activations: SiLU/GELU/sigmoid/tanh, SwiGLU gating,
  fused activation + quantize (FP8/MXFP4), binary arithmetic

README now links all 9 operator guides and every operator in the table
has a link to its relevant guide.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sunway513 sunway513 changed the title Add GEMM & Quantization guides, update README docs hub Add all remaining operator guides and complete README docs hub Feb 7, 2026
sunway513 and others added 4 commits February 7, 2026 16:35
Merge duplicate rows (MHA+PA, RMSNorm+LayerNorm, Elementwise+Sigmoid)
so each operator row maps to exactly one guide. Promote Communication
into the operator table and eliminate the separate Documentation section.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move operator table above the guide description note, separate test
instruction into its own line with generic example, and promote
Additional Resources to a proper section header.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document GPU_ARCHS, PREBUILD_KERNELS levels (0-3), and MAX_JOBS.
Reorganize Installation into subsections: Development Mode (JIT),
Precompiled Kernels, and Triton Communication Support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Explain what it is (multi-GPU reduce-scatter/all-gather via Iris),
mark as optional, and link to the full guide.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sunway513 sunway513 merged commit f60774b into main Feb 7, 2026
11 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant