Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
09db456
feat(models): add DeepSeek-V4 PR1 skeleton with bit-exact reference p…
valarLip Apr 24, 2026
880c410
feat(quant_v4): add FP4 e2m1 -> BF16 dequant for V4 expert weights
valarLip Apr 24, 2026
5838ee1
refactor(deepseek_v4): swap BF16 projections to ATOM TP linear classes
valarLip Apr 24, 2026
a5cddc9
feat(deepseek_v4): wire QuantizationConfig + implement load_weights()…
valarLip Apr 24, 2026
82a7c92
refactor(deepseek_v4): switch forward to ATOM 2D flat-token convention
valarLip Apr 25, 2026
7070db8
feat(deepseek_v4): swap MoE to FusedMoE for 384-expert TP/EP loading
valarLip Apr 25, 2026
62b78a3
fix(deepseek_v4): V4QuantConfig now matches FusedMoE's bare 'experts'…
valarLip Apr 25, 2026
4492693
fix(deepseek_v4): correct FusedMoE expert weight + scale + bias dispatch
valarLip Apr 25, 2026
3c37b76
feat(deepseek_v4): wire hash routing for first 3 layers via custom_ro…
valarLip Apr 25, 2026
70b9159
feat(deepseek_v4): full Block.forward (attn + FusedMoE) end-to-end on…
valarLip Apr 25, 2026
2ff14f7
feat(deepseek_v4): standard ATOM loader (load_model) now handles V4 c…
valarLip Apr 25, 2026
9861dad
fix(deepseek_v4): wo_a FP8 dequant via process_weights_after_loading …
valarLip Apr 25, 2026
cdbff35
feat(deepseek_v4): end-to-end inference with triton MoE and swiglu_limit
valarLip Apr 25, 2026
af17eb8
fix(deepseek_v4): apply swiglu_limit to shared_experts (upstream a1fd…
valarLip Apr 28, 2026
8ed9269
refactor(deepseek_v4): wire positions tensor through forward chain; s…
valarLip Apr 28, 2026
1e77a70
refactor: delegate ATOM KV cache subsystem to attention builders
valarLip Apr 28, 2026
90d806e
style: black format block_manager.py
valarLip Apr 28, 2026
0599729
merge: pull per_req_cache abstraction (PR #659) into V4 branch for PR…
valarLip Apr 28, 2026
84e0a05
feat(deepseek_v4): per_req_cache abstraction (pre2a + pre2c-A)
valarLip Apr 29, 2026
047ee11
feat(deepseek_v4): classical KV cache via block_table (pre2c-B)
valarLip Apr 29, 2026
aebe2ff
feat(deepseek_v4): multi-sequence forward dispatch (PR3-main)
valarLip Apr 29, 2026
b436e1e
fix(deepseek_v4): correct ue8m0 input quant + MoE routing scale
valarLip Apr 30, 2026
3870756
feat(debug_helper): generic env-gated dump / compare / ref-patch + V4…
valarLip Apr 30, 2026
1bce8c4
fix(weight-loading): bidirectional coverage check + V4 hash-layer bias
valarLip Apr 30, 2026
a709564
feat(deepseek_v4): pos%(2*ratio) ring buffer for Compressor state cache
valarLip Apr 30, 2026
61c3e82
feat(deepseek_v4): fused_compress_attn kernel + start_pos-free interface
valarLip May 1, 2026
b786769
feat(v4): replace weight-free RMSNorm with fused Triton, ~1.6% TTFT i…
ZhangLirong-amd May 1, 2026
1890dc0
feat(deepseek_v4): use triton sparse attn kernel and move attn kernel…
junhaha666 May 1, 2026
9703f73
fix(sparse_attn_v4): BLOCK_H=16 for ROCm MFMA lowering
valarLip May 1, 2026
352338d
feat(deepseek_v4): SGLang-style packed plan tensors for batched compr…
valarLip May 1, 2026
c6b34f8
feat(deepseek_v4): batch state-cache reset/write/topk (Phase 1+2a+2c)
valarLip May 2, 2026
2249aae
feat(deepseek_v4): hoist Indexer Compressor out of dispatch loop (Pha…
valarLip May 2, 2026
caed0c7
feat(deepseek_v4): use fp8_mqa_logits in Indexer score+topk (Phase 2b…
valarLip May 2, 2026
0eb32a9
feat(deepseek_v4): Phase 3 hoist per-fwd metadata + comprehensive cle…
valarLip May 2, 2026
14d65c8
feat(deepseek_v4): CG-A pre-allocate metadata buffers (CUDAGraph prep)
valarLip May 2, 2026
6a0ebb9
feat(deepseek_v4): CG-B CUDAGraph capture infrastructure
valarLip May 2, 2026
5412d9d
refactor(deepseek_v4): linear fusions, MoE cleanup, shape contracts, …
valarLip May 3, 2026
8ab1367
feat(deepseek_v4): FP8 CSA Indexer cache (-44% pool VRAM)
valarLip May 3, 2026
00834f6
feat(deepseek_v4): CG-friendly indexer Phase A — preshuffle + decode→…
valarLip May 3, 2026
55be12a
feat(deepseek_v4): adopt aiter top_k_per_row in indexer prefill+decode
valarLip May 3, 2026
cb7f84f
feat(deepseek_v4): CUDAGraph-friendly sparse decode via unified KV pool
valarLip May 4, 2026
28868b4
Merge origin/main into feat/deepseek-v4-pr1-skeleton
valarLip May 4, 2026
e6d93f4
fix(deepseek_v4): drop unused state_slot_per_seq local (ruff F841)
valarLip May 4, 2026
273cfac
feat(deepseek_v4): fuse Phase C CSA translate+pack into one triton ke…
valarLip May 5, 2026
3eb9861
refactor(deepseek_v4): two-source paged prefill + cleanup pass
valarLip May 5, 2026
34dde37
feat(deepseek_v4): fuse inverse RoPE into single Triton kernel 60us->…
ZhangLirong-amd May 6, 2026
3d38bdf
feat(deepseek_v4): 1. use rope_rotate_activation instead of rotary_em…
junhaha666 May 6, 2026
c3ec204
feat(deepseek_v4): per-PR accuracy + nightly bench CI + diagnostics
valarLip May 6, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 149 additions & 0 deletions .claude/skills/atom-patterns.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
---
name: atom-patterns
description: Coding patterns and architecture index for the ATOM LLM inference engine
version: 1.1.0
source: local-git-analysis
---

# ATOM Patterns

## Code Architecture

```
atom/
├── config.py # Config, QuantizationConfig, HF config loading
├── entrypoints/ # Server entry (openai_server.py)
├── examples/ # simple_inference.py (offline smoke test)
├── model_engine/ # Core engine pipeline
│ ├── llm_engine.py # Top-level engine
│ ├── engine_core.py # Per-DP-rank loop
│ ├── scheduler.py # Batch scheduling
│ └── model_runner.py # Forward pass, CUDAGraph, KV cache binding
├── model_loader/
│ └── loader.py # Weight loading (safetensors, FP8/FP4, WeightsMapper)
├── model_ops/ # AITER kernel wrappers
│ ├── linear.py # LinearBase, ColumnParallel, RowParallel
│ ├── moe.py # FusedMoE, Mxfp4MoEMethod, weight_loader
│ ├── fused_moe_triton.py # Triton matmul_ogs MoE path
│ ├── attention_mla.py # MLA attention (DeepSeek)
│ ├── attention_mha.py # Standard MHA attention
│ └── paged_attention.py # Paged attention backend
├── models/ # Model implementations
│ ├── deepseek_v2.py # DeepSeek V3/V3.2/GLM-5 (shared)
│ ├── deepseek_v4.py # DeepSeek V4 (HC, sparse attn, FP4 MoE)
│ ├── deepseek_mtp.py # DeepSeek MTP (speculative)
│ ├── llama.py # Llama family
│ └── qwen3*.py # Qwen3 variants
├── spec_decode/
│ └── eagle.py # MTP proposer (speculative decoding)
├── plugin/ # vLLM/SGLang plugin adapters
└── utils/
├── envs.py # All ATOM_* env var definitions
└── forward_context.py # Module-level forward context
```

## Model Implementation Pattern

### Adding a New Model

Every model class follows this contract:

```python
class NewModelForCausalLM(nn.Module):
# Weight loading config (class-level)
packed_modules_mapping = { ... }
weights_mapping = { ... }

def __init__(self, config: Config, prefix: str = ""):
...

def forward(self, input_ids, positions, intermediate_tensors=None, inputs_embeds=None):
return hidden_states # or logits

def compute_logits(self, hidden_states):
return self.lm_head(hidden_states)
```

Registration in `model_runner.py`:
```python
support_model_arch_dict = {
"NewModelForCausalLM": ("new_model", "NewModelForCausalLM"),
}
```

### Model Reuse Relationships

- `deepseek_v2.py` ← DeepSeek V3, V3.2, GLM-5
- `deepseek_v4.py` ← DeepSeek V4 (standalone, uses HC + sparse attn)
- `deepseek_mtp.py` ← DeepSeek MTP models
- `qwen3_5_mtp.py` ← Qwen3.5 MTP (hybrid GDN + full attention)

### TP Parallel Linear Pattern

- `ColumnParallelLinear`: shards output dim, no all-reduce needed
- `RowParallelLinear`: shards input dim, all-reduce on output (`reduce_results=True`)
- `ReplicatedLinear`: full copy on each rank (gates, small projections)

MoE pattern: FusedMoE + shared_experts both use `reduce_results=False`, parent does one all-reduce.

## Workflows

### Adding a Model (file co-change pattern)

1. `atom/models/new_model.py` — Model implementation
2. `atom/model_engine/model_runner.py` — Register in `support_model_arch_dict`
3. `atom/config.py` — Add to `_CONFIG_REGISTRY` if config schema differs
4. `.github/benchmark/models_accuracy.json` — CI accuracy test entry
5. `recipes/` — Usage recipe

### Bug Fix Workflow

1. Identify bug via activation dump / per-layer comparison
2. Fix in model file
3. `grep` same pattern across codebase (fix-then-sweep)
4. Verify with `simple_inference.py` smoke test
5. Run `lm_eval` for accuracy regression

### FP8/FP4 Weight Loading

- Checkpoint weights: `weight` (FP8/FP4 packed) + `weight.scale` (E8M0 block scale)
- ATOM renames `.scale` → `.weight_scale_inv` → `.weight_scale` (auto-rename in loader)
- `process_weights_after_loading()` hook: shuffle weights for CK kernel layout
- FP4 expert weights: `Mxfp4MoEMethod.create_weights()` + `mxf4_merged_weight_loader()`

### Debug Instrumentation Rules

- **NEVER modify `@support_torch_compile` decorated models** (breaks Dynamo)
- Put debug code in `forward()` (has `@torch.inference_mode()`), NOT in `run_model()`
- Gate debug prints with env vars (e.g., `ATOM_V4_DIAG=1`)
- Use `--level 0 --enforce-eager` to disable both torch.compile and CUDAGraph

## Testing Patterns

- Test location: `tests/` directory at repo root
- Framework: pytest
- No GPU needed: tests mock AITER and `torch.cuda`
- Naming: `test_<module>.py` (e.g., `test_scheduler.py`, `test_block_manager.py`)
- Smoke test: `python -m atom.examples.simple_inference --model <path> --kv_cache_dtype fp8`
- Accuracy: `lm_eval` with gsm8k (CI threshold != actual baseline)

## Environment Variables

All defined in `atom/utils/envs.py` as lazy lambdas:

```python
"ATOM_USE_TRITON_GEMM": lambda: os.getenv("ATOM_USE_TRITON_GEMM", "0") == "1",
```

Key vars:
- `AITER_LOG_LEVEL=WARNING` — suppress kernel log flooding
- `ATOM_USE_TRITON_MOE=1` — triton MoE for V4
- `ATOM_V4_TORCH_MOE=1` — torch fallback MoE for V4
- `ATOM_V4_DIAG=1` — V4 diagnostic prints

## CI/CD

- Accuracy tests: `.github/benchmark/models_accuracy.json` (model matrix)
- Benchmark: `.github/benchmark/models.json`
- Dashboard: `.github/dashboard/index.html` (gh-pages)
- Docker: `docker login --password-stdin`, `checkout@v6`, `upload-artifact@v7`
Loading
Loading