diff --git a/NVFP4_KVCACHE_DESIGN_SUMMARY.md b/NVFP4_KVCACHE_DESIGN_SUMMARY.md new file mode 100644 index 000000000000..89976f8f3af8 --- /dev/null +++ b/NVFP4_KVCACHE_DESIGN_SUMMARY.md @@ -0,0 +1,629 @@ +# NVFP4 KVCache Support in SGLang - Design and Implementation Summary + +## Executive Summary + +NVFP4 (NVIDIA FP4) KVCache support in SGLang provides a memory optimization technique that enables approximately **3.56× more tokens** to be cached compared to BF16, using 4-bit floating-point quantization with block-based microscaling (MXFP4 format). This implementation is particularly optimized for NVIDIA Blackwell architecture (B200 GPUs) and supports both Multi-Head Attention (MHA) and Multi-Head Latent Attention (MLA) models. + +## 1. Overview + +### 1.1 What is NVFP4 KVCache? + +NVFP4 KVCache is a quantized key-value cache storage format that uses the OCP (Open Compute Project) MXFP4 (Microscaling FP4) standard: +- **Format**: E2M1 (1 sign bit, 2 exponent bits, 1 mantissa bit) +- **Block Size**: 16 elements per block (SGLang's implementation) +- **Scaling**: Each block of 16 elements shares a single 8-bit exponential scaling factor +- **Memory Savings**: ~3.56× more tokens than BF16, ~1.78× more than FP8 + +### 1.2 Key Benefits + +1. **Memory Efficiency**: Significantly reduces KV cache memory footprint +2. **Throughput**: Enables longer context lengths or more concurrent requests +3. **Dynamic Scaling**: Scaling factors computed automatically on-the-fly (no pre-quantization needed) +4. **Hardware Optimized**: Specifically tuned for NVIDIA Blackwell (B200) GPUs + +## 2. Technical Design + +### 2.1 Quantization Format (E2M1) + +The E2M1 format represents values using: +``` +E2M1_VALUES = [0, 0.5, 1, 1.5, 2, 3, 4, 6] # 8 possible values +E2M1_BOUNDS = [0.25, 0.75, 1.25, 1.75, 2.5, 3.5, 5] # Decision boundaries +E2M1_MAX = 6.0 # Maximum representable value +``` + +Each value requires 4 bits: +- 1 bit for sign +- 3 bits for magnitude (8 values: 0-7) + +### 2.2 Block-Based Microscaling + +**Key Characteristics:** +- Tensors divided into blocks of 16 consecutive elements +- Each block shares one 8-bit exponential scaling factor +- Scale factor: `scale_exp = ceil(log2(max(block) / 6.0))` +- Stored as: `scale_factor_uint8 = scale_exp + 127` + +**Memory Layout:** +``` +Original Tensor: [B, M, N] (BF16) + ↓ +Quantized Tensor: [B, M, N/2] (uint8, packed) +Scale Factors: [B, M*N/16] (uint8) +``` + +### 2.3 Packing Strategy + +Two FP4 values are packed into one uint8: +```python +# Packing +packed = (high_nibble << 4) | low_nibble + +# Unpacking +low_nibble = packed & 0x0F +high_nibble = (packed >> 4) & 0x0F +``` + +## 3. Implementation Architecture + +### 3.1 Core Components + +#### 3.1.1 KVFP4QuantizeUtil Class +Location: `python/sglang/srt/layers/quantization/kvfp4_tensor.py` + +**Key Methods:** +```python +class KVFP4QuantizeUtil: + @staticmethod + @torch.compile + def batched_quantize(tensor: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]: + """ + Quantize tensor to KVFP4 format + Args: + tensor: [B, M, N] input tensor + Returns: + quant_tensor: [B, M, N/2] quantized tensor + scale_factors: [B, M*N/16] scale factors + """ + # 1. Reshape to [B, M*N/16, 16] for block-wise quantization + # 2. Compute per-block scale factors + # 3. Apply scaling + # 4. Quantize to FP4 (E2M1) + # 5. Pack two FP4 values into one uint8 + + @staticmethod + @torch.compile + def batched_dequantize( + quant_tensor: torch.Tensor, + scale_factors: torch.Tensor, + dtype: torch.dtype = torch.bfloat16, + ) -> torch.Tensor: + """ + Dequantize KVFP4 tensor + Args: + quant_tensor: [B, M, N/2] quantized tensor + scale_factors: [B, M*N/16] scale factors + Returns: + Dequantized tensor: [B, M, N] + """ + # 1. Unpack uint8 to two FP4 values + # 2. Extract sign and magnitude + # 3. Convert to float values using E2M1_VALUES lookup + # 4. Reshape for block-wise scaling + # 5. Apply scale factors +``` + +**Performance Optimizations:** +- Uses `@torch.compile` for JIT compilation +- Pure tensor operations (CUDA Graph safe) +- Efficient bit operations for packing/unpacking +- Vectorized lookups using tensor indexing + +#### 3.1.2 Memory Pool Integration +Location: `python/sglang/srt/mem_cache/memory_pool.py` + +**Integration Points:** + +1. **Buffer Storage:** +```python +# Separate buffers for quantized data and scale factors +self.k_buffer[layer_id] # Quantized K cache +self.k_scale_buffer[layer_id] # K scale factors +self.v_buffer[layer_id] # Quantized V cache +self.v_scale_buffer[layer_id] # V scale factors +``` + +2. **Write Path (set_kv_buffer):** +```python +def set_kv_buffer(self, layer, loc, cache_k, cache_v, k_scale=None, v_scale=None): + # 1. Apply FP8 scale factors if provided + if k_scale is not None: + cache_k.div_(k_scale) + + # 2. Quantize to FP4 + cache_k_fp4, cache_k_fp4_sf = KVFP4QuantizeUtil.batched_quantize(cache_k) + cache_v_fp4, cache_v_fp4_sf = KVFP4QuantizeUtil.batched_quantize(cache_v) + + # 3. Store in buffers + self.k_buffer[layer_id][loc] = cache_k_fp4.view(self.store_dtype) + self.k_scale_buffer[layer_id][loc] = cache_k_fp4_sf.view(self.store_dtype) +``` + +3. **Read Path (_get_key_buffer):** +```python +def _get_key_buffer(self, layer_id): + if self.store_dtype != self.dtype: + cache_k_fp4 = self.k_buffer[layer_id].view(torch.uint8) + cache_k_fp4_sf = self.k_scale_buffer[layer_id] + + # Dequantize on-the-fly + cache_k_dequant = KVFP4QuantizeUtil.batched_dequantize( + cache_k_fp4, cache_k_fp4_sf + ) + return cache_k_dequant +``` + +#### 3.1.3 MLA-Specific Support + +For Multi-Head Latent Attention (DeepSeek models): +```python +def set_mla_kv_buffer(self, layer, loc, cache_k_nope, cache_k_rope): + # Separate quantization for nope and rope components + cache_k_nope_fp4, cache_k_nope_fp4_sf = KVFP4QuantizeUtil.batched_quantize(cache_k_nope) + cache_k_rope_fp4, cache_k_rope_fp4_sf = KVFP4QuantizeUtil.batched_quantize(cache_k_rope) + + # Store both components with their scale factors +``` + +### 3.2 Attention Backend Support + +#### Support Matrix + +**MHA Backends (Standard Attention):** +- ✅ **FA4 (FlashAttention 4)**: Full support with page_size=128 +- ✅ **Triton**: Full support +- ✅ **Torch Native (SDPA)**: Full support +- ✅ **FlexAttention**: Full support +- ✅ **TRTLLM MHA**: Full support with page_size ∈ {16, 32, 64} +- ❌ **FlashInfer**: Not supported +- ❌ **FA3**: Not supported + +**MLA Backends (Multi-Head Latent Attention):** +- ✅ **FlashInfer MLA**: Full support with page_size=1 +- ✅ **FlashMLA**: Full support with page_size=64 +- ✅ **Cutlass MLA**: Full support with page_size=128 +- ✅ **TRTLLM MLA (Blackwell)**: Full support with page_size ∈ {32, 64} +- ✅ **FA4**: Full support with page_size=1 +- ❌ **FA3**: Not supported +- ❌ **Triton**: Not supported + +#### Hybrid Attention Support + +SGLang supports mixing different backends for prefill and decode: +```bash +# Example: FA4 for prefill, TRTLLM MLA for decode +python3 -m sglang.launch_server \ + --model-path nvidia/DeepSeek-R1-FP4 \ + --tp 8 \ + --attention-backend trtllm_mla \ + --prefill-attention-backend fa4 \ + --kv-cache-dtype fp4_e2m1 +``` + +### 3.3 CUDA Graph Compatibility + +The implementation is fully CUDA Graph compatible: +- All tensor operations are pure (no Python control flow) +- Uses `@torch.compile` for kernel fusion +- Overlapped memory copies with alternative streams: +```python +if get_is_capture_mode() and self.alt_stream is not None: + current_stream = self.device_module.current_stream() + self.alt_stream.wait_stream(current_stream) + # Perform quantization and copy in alt_stream +``` + +## 4. Usage Guide + +### 4.1 Server Arguments + +**Basic Usage:** +```bash +# Enable FP4 KV cache +python3 -m sglang.launch_server \ + --model-path nvidia/DeepSeek-R1-0528-NVFP4 \ + --kv-cache-dtype fp4_e2m1 +``` + +**With Quantized Weights (NVFP4 Checkpoint):** +```bash +python -m sglang.launch_server \ + --model nvidia/DeepSeek-V3.2-NVFP4 \ + --tp 4 \ + --quantization modelopt_fp4 \ + --moe-runner-backend flashinfer_trtllm \ + --kv-cache-dtype fp4_e2m1 +``` + +**Advanced Configuration:** +```bash +python3 -m sglang.launch_server \ + --model-path nvidia/DeepSeek-R1-0528-NVFP4 \ + --tp 8 \ + --attention-backend trtllm_mla \ + --kv-cache-dtype fp4_e2m1 \ + --quantization modelopt_fp4 \ + --moe-runner-backend flashinfer_trtllm +``` + +### 4.2 Hardware Requirements + +**Minimum Requirements:** +- CUDA 12.8+ (for NVFP4 support) +- PyTorch 2.8.0+ +- NVIDIA Blackwell GPUs (B200) recommended + +**Supported Configurations:** +| Model | Weight Type | Configuration | +|-------|-------------|---------------| +| DeepSeek-R1-0528 | NVFP4 | 8 × B200 or 4 × B200 | +| DeepSeek-V3.2 | NVFP4 | 8 × B200 or 4 × B200 | + +### 4.3 No Pre-Quantization Required + +Unlike FP8 KVCache, FP4 requires no external scaling factors: +- **FP8**: Requires `k_scale` and `v_scale` from checkpoint or JSON file +- **FP4**: Scaling factors computed automatically during quantization + +```python +# FP8 requires pre-computed scales +--quantization-param-path kv_scales.json + +# FP4 works out of the box (no external files needed) +--kv-cache-dtype fp4_e2m1 +``` + +## 5. Performance Characteristics + +### 5.1 Memory Savings + +**Theoretical Savings:** +``` +BF16: 16 bits/value +FP8: 8 bits/value (~2.0× savings) +FP4: 4 bits/value + 0.5 bits/value (scale overhead) + = 4.5 bits/value (~3.56× savings vs BF16, ~1.78× vs FP8) +``` + +**Actual Memory Layout:** +``` +For N elements: +- BF16: N × 2 bytes +- FP8: N × 1 byte + scales +- FP4: N × 0.5 bytes + N/16 bytes (scales) +``` + +### 5.2 Accuracy Impact + +Based on preliminary accuracy tests from PRs #10078 (MLA) and #12612 (MHA): + +**Large Models (200B+ parameters):** +- Simple datasets (gsm8k): Minimal degradation (~0.3% drop) +- Complex datasets (gpqa_diamond): 2-4% accuracy drop +- Challenging datasets (aime25): 10-25% accuracy drop + +**Example Results (Qwen3-235B-A22B):** +| Dataset | KV16 | KV8 (FP8) | KV4 (FP4) | +|---------|------|-----------|-----------| +| gsm8k | 0.9168 | 0.9181 | 0.9186 | +| gpqa_diamond | 0.7010 | 0.6899 | 0.6778 | +| aime25 | 0.7733 | 0.7333 | 0.6000 | + +**Smaller Models (GPT-OSS-120B):** +- Simple datasets: ~0.1% drop +- Complex datasets: Up to 15-20% accuracy drop + +**Key Observations:** +1. Model size matters: Large models tolerate FP4 better +2. Dataset complexity matters: Simple tasks show minimal degradation +3. Context length matters: Longer contexts may show more degradation + +### 5.3 Performance Considerations + +**Pros:** +- 3.56× memory savings enable much longer contexts +- Dynamic scaling eliminates offline quantization step +- CUDA Graph compatible for low latency +- Fused quantization/dequantization with attention kernels + +**Cons:** +- Accuracy degradation on complex tasks +- Requires backend support (not all attention backends supported) +- CPU overhead for quantization/dequantization if not fused +- Works best on Blackwell architecture + +### 5.4 Benchmark Performance + +**Test Configuration:** +Tensor shapes: `[M, N, K]` where: +- MLA: `[M, 1, 576]` (DeepSeek models) +- MHA: `[M, 8, 64]` (GPT-OSS models) + +**Latency Measurements:** +```python +# Quantization time: ~similar to FP8 cast +# Dequantization time: ~2-3× slower than FP8 cast +# Overall: Acceptable for throughput-oriented workloads +``` + +**Metrics Tracked:** +- Mean Squared Error (MSE) +- Mean Absolute Error (MAE) +- Peak Signal-to-Noise Ratio (PSNR) +- Relative Error + +Test file: `python/sglang/test/test_kvfp4_quant_dequant.py` + +## 6. Implementation Details + +### 6.1 Quantization Algorithm + +```python +def batched_quantize(tensor): + # Step 1: Reshape to blocks + B, M, N = tensor.shape + reshaped = tensor.view(B, M*N//16, 16) # [B, num_blocks, 16] + + # Step 2: Compute per-block scale factors + block_max = reshaped.abs().max(dim=-1, keepdim=True).values + scale_exp = torch.ceil(torch.log2(torch.clamp(block_max / 6.0, min=1e-10))) + scale_factors = (scale_exp + 127).to(torch.uint8) # Bias by 127 + + # Step 3: Scale values + scaled = reshaped / torch.exp2(scale_exp) + + # Step 4: Quantize to E2M1 + sign_bits = (scaled < 0).to(torch.uint8) << 3 + abs_vals = scaled.abs() + magnitude_bits = torch.sum(abs_vals.unsqueeze(-1) >= E2M1_BOUNDS, dim=-1) + fp4_vals = sign_bits + magnitude_bits.to(torch.uint8) + + # Step 5: Pack two FP4 values into one uint8 + fp4_reshaped = fp4_vals.view(B, M, N) + packed = (fp4_reshaped[..., 1::2] << 4) + fp4_reshaped[..., 0::2] + + return packed, scale_factors +``` + +### 6.2 Dequantization Algorithm + +```python +def batched_dequantize(quant_tensor, scale_factors, dtype): + B, M, N_half = quant_tensor.shape + N = N_half * 2 + + # Step 1: Unpack uint8 to two FP4 values + fp4_vals = torch.empty(B, M, N, dtype=torch.uint8, device=quant_tensor.device) + fp4_vals[..., 0::2] = quant_tensor & 0x0F + fp4_vals[..., 1::2] = (quant_tensor >> 4) & 0x0F + + # Step 2: Extract sign and magnitude + sign_mask = (fp4_vals & 0x08) != 0 + magnitude_idx = fp4_vals & 0x07 + + # Step 3: Convert to float values + float_vals = E2M1_VALUES[magnitude_idx.long()] + float_vals = torch.where(sign_mask, -float_vals, float_vals) + + # Step 4: Reshape for block-wise scaling + reshaped = float_vals.view(B, M*N//16, 16) + + # Step 5: Apply scale factors + scale_exp = scale_factors.float() - 127 # Remove bias + scaled = reshaped * torch.exp2(scale_exp.unsqueeze(-1)) + + return scaled.view(B, M, N).to(dtype) +``` + +### 6.3 Integration with Attention Layers + +The FP4 quantization is transparent to attention layers: + +```python +# In attention forward pass: +# 1. New KV computed in high precision (BF16) +# 2. Automatically quantized when written to cache +# 3. Automatically dequantized when read from cache +# 4. Attention computation happens in high precision + +# Example flow: +new_k, new_v = compute_kv(hidden_states) # BF16 +memory_pool.set_kv_buffer(layer, loc, new_k, new_v) # Quantizes to FP4 +cached_k = memory_pool.get_key_buffer(layer_id) # Dequantizes to BF16 +output = attention_kernel(q, cached_k, cached_v) # Compute in BF16 +``` + +## 7. Testing and Validation + +### 7.1 Unit Tests + +**Quantization/Dequantization Tests:** +```bash +pytest python/sglang/test/test_kvfp4_quant_dequant.py +``` + +Tests cover: +- Accuracy metrics (MSE, MAE, PSNR, relative error) +- Various tensor shapes (MLA and MHA configurations) +- Performance benchmarking (vs FP8) + +**Kernel Tests:** +```bash +pytest sgl-kernel/tests/test_fp4_quantize.py +pytest sgl-kernel/tests/test_fp4_gemm.py +``` + +**Integration Tests:** +```bash +pytest test/registered/quant/test_nvfp4_gemm.py +pytest test/registered/quant/test_deepseek_v3_fp4_4gpu.py +pytest test/registered/quant/test_deepseek_v32_fp4_4gpu.py +``` + +### 7.2 End-to-End Model Tests + +**DeepSeek Models:** +- `test/registered/backends/test_deepseek_v3_fp4_cutlass_moe.py` +- `test/registered/backends/test_qwen3_fp4_trtllm_gen_moe.py` +- `test/registered/spec/eagle/test_deepseek_v3_fp4_mtp_small.py` + +**Performance Tests:** +- `test/registered/perf/test_dpsk_r1_fp4_4gpu_perf.py` + +### 7.3 Accuracy Validation + +Accuracy tests on standard benchmarks: +```bash +# GSM8K +python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --parallel 1319 + +# Expected results with FP4: +# - Large models: >91% accuracy (minimal degradation) +# - Small models: May vary significantly +``` + +## 8. Best Practices + +### 8.1 When to Use FP4 KVCache + +**Recommended:** +- ✅ Large models (200B+ parameters) +- ✅ Simple to moderate reasoning tasks +- ✅ Throughput-oriented workloads +- ✅ Memory-constrained environments +- ✅ Blackwell (B200) GPUs + +**Not Recommended:** +- ❌ Small models (<100B parameters) +- ❌ Complex reasoning tasks requiring high accuracy +- ❌ Latency-critical applications (without fused kernels) +- ❌ Pre-Blackwell architectures (Hopper, Ampere) + +### 8.2 Configuration Checklist + +```bash +# 1. Check CUDA version +nvcc --version # Should be 12.8+ + +# 2. Check PyTorch version +python -c "import torch; print(torch.__version__)" # Should be 2.8.0+ + +# 3. Verify backend support +--attention-backend {trtllm_mla|flashinfer_mla|flashmla|cutlass_mla|fa4} + +# 4. Enable FP4 KV cache +--kv-cache-dtype fp4_e2m1 + +# 5. For quantized weights, specify quantization +--quantization modelopt_fp4 + +# 6. For MoE models, specify MoE runner +--moe-runner-backend flashinfer_trtllm +``` + +### 8.3 Accuracy Evaluation + +Always evaluate on your specific workload: +```bash +# Run accuracy tests +python benchmark/gsm8k/bench_sglang.py --num-shots 8 + +# Compare with baseline +# FP16/BF16 baseline +python -m sglang.launch_server --model MODEL --kv-cache-dtype auto + +# FP8 baseline +python -m sglang.launch_server --model MODEL --kv-cache-dtype fp8_e4m3 + +# FP4 test +python -m sglang.launch_server --model MODEL --kv-cache-dtype fp4_e2m1 +``` + +### 8.4 Debugging Tips + +**Common Issues:** + +1. **Backend Not Supporting FP4:** +``` +Error: Backend 'flashinfer' does not support fp4_e2m1 kv_cache_dtype +Solution: Use supported backend (trtllm_mla, flashinfer_mla, fa4, etc.) +``` + +2. **CUDA Version Too Old:** +``` +Error: NVFP4 requires CUDA 12.8+ +Solution: Upgrade CUDA toolkit +``` + +3. **Performance Degradation:** +``` +Issue: Slow inference despite memory savings +Cause: Quantization not fused with attention +Solution: Use backend with fused support (trtllm_mla recommended) +``` + +## 9. Future Work and Limitations + +### 9.1 Current Limitations + +1. **Backend Support**: Not all attention backends support FP4 +2. **Accuracy**: May degrade on complex reasoning tasks +3. **Architecture**: Optimized primarily for Blackwell (B200) +4. **Block Size**: Fixed at 16 elements (no tuning option) + +### 9.2 Potential Improvements + +1. **Adaptive Block Sizing**: Dynamically adjust block size based on data distribution +2. **Mixed Precision**: Use FP4 for some layers, FP8/BF16 for others +3. **Improved Quantization**: Better rounding strategies +4. **Broader Backend Support**: Add FP4 support to more backends +5. **CPU Fallback**: Efficient CPU implementation for non-Blackwell GPUs + +### 9.3 Related Work + +- **FP8 KVCache**: More mature, better accuracy, but 2× less memory efficient +- **INT4 Quantization**: Similar memory savings but requires different kernels +- **Sparse KVCache**: Complementary optimization focusing on attention patterns + +## 10. References + +### 10.1 Documentation + +- [Quantized KV Cache Guide](docs/advanced_features/quantized_kv_cache.md) +- [Attention Backend Guide](docs/advanced_features/attention_backend.md) +- [DeepSeek V3 Usage](docs/basic_usage/deepseek_v3.md) +- [DeepSeek V3.2 Usage](docs/basic_usage/deepseek_v32.md) + +### 10.2 Pull Requests + +- PR #10078: MLA FP4 KVCache support +- PR #12612: MHA FP4 KVCache support + +### 10.3 External Resources + +- [OCP MXFP4 Specification](https://www.opencompute.org) +- [DeepSeek MLA Paper](https://arxiv.org/pdf/2405.04434) +- [NVIDIA Blackwell Architecture](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/) + +### 10.4 Model Checkpoints + +- [NVIDIA DeepSeek-R1-0528-NVFP4-v2](https://huggingface.co/nvidia/DeepSeek-R1-0528-NVFP4-v2) +- [NVIDIA DeepSeek-V3.2-NVFP4](https://huggingface.co/nvidia/DeepSeek-V3.2-NVFP4) + +## 11. Conclusion + +NVFP4 KVCache support in SGLang provides a powerful memory optimization technique that enables significantly longer context lengths and higher throughput for large language model inference. While it introduces some accuracy trade-offs, it is particularly effective for large models (200B+ parameters) on simpler tasks and is well-suited for throughput-oriented workloads on NVIDIA Blackwell GPUs. + +The implementation leverages block-based microscaling with dynamic scale factor computation, eliminating the need for offline quantization while maintaining CUDA Graph compatibility. Integration with SGLang's memory pool is seamless, and support for both MHA and MLA attention patterns makes it versatile across different model architectures. + +For production deployments, careful evaluation of the accuracy/memory trade-off is recommended, with FP8 being a safer choice for accuracy-critical applications and FP4 excelling in memory-constrained scenarios where moderate accuracy degradation is acceptable. diff --git a/NVFP4_KVCACHE_PRS_AND_ISSUES.md b/NVFP4_KVCACHE_PRS_AND_ISSUES.md new file mode 100644 index 000000000000..b80830525902 --- /dev/null +++ b/NVFP4_KVCACHE_PRS_AND_ISSUES.md @@ -0,0 +1,528 @@ +# NVFP4 KVCache Related PRs and Issues from sgl-project/sglang + +This document summarizes all Pull Requests and Issues related to NVFP4 KVCache support from the original sgl-project/sglang repository. + +## Table of Contents +1. [Key Pull Requests](#key-pull-requests) +2. [Issues and Discussions](#issues-and-discussions) +3. [Related PRs](#related-prs) +4. [Timeline](#timeline) + +--- + +## Key Pull Requests + +### PR #10078 - FP4 (E2M1) Support for MLA KV Cache ⭐ **FOUNDATIONAL** +- **Title**: FP4 (E2M1) support for Multi-Head Latent Attention (MLA) KV cache +- **Status**: Merged +- **Opened**: September 5, 2025 +- **Author**: [@JackChuang](https://github.com/JackChuang), [@yicwang](https://github.com/yicwang) +- **Labels**: `high priority`, `quant`, `run-ci` +- **URL**: https://github.com/sgl-project/sglang/pull/10078 + +**Summary**: +This is the **foundational PR** that introduced FP4 KVCache support to SGLang for MLA (Multi-Head Latent Attention) models. + +**Key Changes**: +1. **Extended `--kv-cache-dtype` argument** to support `"fp4_e2m1"` +2. **Introduced `KVFP4QuantizeUtil` class** for FP4 (E2M1) quantization/dequantization + - `batched_quantize`: Block-wise (16 elements) quantization for [M, N, K] tensors + - `batched_dequantize`: Block-wise dequantization +3. **Added FP4 KV cache support in `MLATokenToKVPool`** + - Introduced `kv_scale_buffer` for FP4 scaling factors + - Implemented Triton kernel to combine nope + rope tensors + - Modified ModelRunner for FP4 buffer sizing +4. **Created `test_kvfp4_quant_dequant.py`** validation test + - Metrics: MSE, MAE, PSNR, Relative Error + - Benchmarks FP4 vs FP8 performance +5. **Maintained backward compatibility** with FP16/FP8 KV cache + +**Commits**: +- Extend ServerArgs to support fp4_e2m1 +- Introduce KVFP4QuantizeUtil for quantization/dequantization +- Add test_kvfp4_quant_dequant.py +- Implement FP4 KV cache in MLATokenToKVPool +- Add utility functions is_cuda() and is_float4_e2m1fn_x2() + +**Impact**: Enables ~3.56× memory savings vs BF16 for MLA models (e.g., DeepSeek V3/R1) + +--- + +### PR #12612 - FP4 Support for MHA KV Cache ⭐ **EXTENDS MLA TO MHA** +- **Title**: FP4 KV cache support for MHA (Multi-Head Attention) +- **Status**: Merged +- **Opened**: November 4, 2025 +- **Author**: [@JackChuang](https://github.com/JackChuang), [@yicwang](https://github.com/yicwang) +- **Labels**: `run-ci` +- **URL**: https://github.com/sgl-project/sglang/pull/12612 + +**Summary**: +Extends PR #10078 to support FP4 KVCache for standard Multi-Head Attention (MHA) models. + +**Key Changes**: +1. **FP4 KV cache support in `MHATokenToKVPool`** with uint8 storage +2. **Added `k_scale_buffer` and `v_scale_buffer`** for FP4 scaling factors +3. **Batched quantization on cache update** and dequantization on access +4. **Updated ModelRunner memory estimation** for FP4 scale buffers +5. **Backward compatible** with FP16/FP8 KV cache + +**Impact**: Enables FP4 KVCache for standard transformer models (Llama, Qwen, Mistral, etc.) + +--- + +### PR #10154 - ModelOpt FP4 Quantization Support +- **Title**: ModelOpt FP4 quantization support +- **Status**: Merged +- **Opened**: September 8, 2025 +- **Author**: [@Edwardf0t1](https://github.com/Edwardf0t1) +- **Labels**: `high priority`, `run-ci` +- **URL**: https://github.com/sgl-project/sglang/pull/10154 + +**Summary**: +Adds support for loading and serving models quantized with NVIDIA ModelOpt FP4 format. + +**Key Changes**: +1. Added `modelopt_fp4` as a quantization method +2. Support for loading FP4 quantized weights from ModelOpt +3. Integration with NVIDIA TensorRT-LLM kernels +4. Added ModelOpt configuration fields for checkpoint and export paths + +**Related Models**: +- nvidia/DeepSeek-R1-0528-NVFP4 +- nvidia/DeepSeek-V3.2-NVFP4 + +--- + +### PR #18314 - TRTLLM MHA Backend FP4 KVCache Improvements (Recent) +- **Title**: TRTLLM MHA backend improvements for FP4 KVCache +- **Status**: Open +- **Opened**: January 27, 2026 (3 days ago) +- **Author**: [@samuellees](https://github.com/samuellees) +- **Labels**: `blackwell` +- **URL**: https://github.com/sgl-project/sglang/pull/18314 + +**Summary**: +Latest improvements to TRTLLM MHA backend for better FP4 KVCache support on Blackwell. + +**Key Changes**: +1. Optimized dequantization logic for NVFP4 +2. Added preload_kv_scales support for NVFP4 KV cache +3. Introduced `is_nvfp4_kvcache` and `kv_cache_dtype_alias` properties +4. Initialized k_scales_gpu and v_scales_gpu for NVFP4 +5. Performance optimization: 550µs → 70µs for certain operations + +**Accuracy Results**: +- gsm8k: 0.7657 exact_match (flexible-extract) +- Improvements shown over multiple iterations + +--- + +### PR #10281 - TRTLLM MLA Backend with FP4 KV Cache +- **Title**: Enable forward_extend in TRT-LLM MLA backend for target_verify +- **Status**: Merged +- **Opened**: September 10, 2025 +- **Author**: [@pranavm-nvidia](https://github.com/pranavm-nvidia) +- **Labels**: `high priority`, `run-ci` +- **URL**: https://github.com/sgl-project/sglang/pull/10281 + +**Summary**: +Enables `forward_extend` in TRT-LLM MLA backend for speculative decoding with FP4 KV cache. + +**Key Changes**: +1. Enabled forward_extend for target_verify in MTP (Multi-Token Prediction) +2. Added KV cache update logic in forward_extend +3. FP4 KV cache dequantization support + +--- + +### PR #11655 - FlashInfer MLA Backend FP4 Support +- **Title**: FlashInfer MLA backend improvements +- **Status**: Merged +- **Opened**: October 15, 2025 +- **Author**: [@hlu1](https://github.com/hlu1) +- **Labels**: `run-ci` +- **URL**: https://github.com/sgl-project/sglang/pull/11655 + +**Summary**: +Improvements to FlashInfer MLA backend for FP4 KVCache support. + +**Key Changes**: +1. FP4 support in FlashInfer MLA backend +2. Page size = 1 support for MLA with FP4 +3. Performance optimizations + +--- + +### PR #17530 - Per-Layer KV Cache Dtype Configuration +- **Title**: Support per-layer KV cache dtype configuration +- **Status**: Open +- **Opened**: January 14, 2026 (18 days ago) +- **Author**: [@jindajia](https://github.com/jindajia) +- **Labels**: `blackwell`, `deepseek`, `documentation`, `quant` +- **URL**: https://github.com/sgl-project/sglang/pull/17530 + +**Summary**: +Adds ability to configure different KV cache dtypes per layer. + +**Key Features**: +1. **CLI Configuration**: + ```bash + --kv-cache-dtype fp4_e2m1 \ + --kv-cache-per-layer-dtype 0-1:bf16,62-63:bf16 + ``` +2. **YAML Configuration**: File-based per-layer mapping +3. Useful for hybrid precision strategies + +**Use Case**: Some layers (like first/last) may need higher precision (BF16) while middle layers can use FP4 + +--- + +### PR #15133 - NVFP4 vs MXFP4 Format Clarification +- **Title**: Support nvfp4_e2m1 and mxfp4_e2m1 format distinction +- **Status**: Open +- **Opened**: December 15, 2025 +- **Author**: [@b8zhong](https://github.com/b8zhong) +- **Labels**: None specified +- **URL**: https://github.com/sgl-project/sglang/pull/15133 + +**Summary**: +Clarifies distinction between NVFP4 and MXFP4 formats. + +**Key Changes**: +1. Can now specify `nvfp4_e2m1` or `mxfp4_e2m1` explicitly +2. `fp4_e2m1` defaults to mxfp4 for backward compatibility +3. TODO: Support global k and v scale in FP32 + +--- + +### PR #14348 - Documentation for FP4 KV Cache +- **Title**: Documentation improvements for FP4 quantization +- **Status**: Merged +- **Opened**: December 3, 2025 +- **Author**: [@b8zhong](https://github.com/b8zhong) +- **Labels**: `documentation`, `quant` +- **URL**: https://github.com/sgl-project/sglang/pull/14348 + +**Summary**: +Comprehensive documentation for FP4 KVCache usage. + +**Content**: +- FP4 (E2M1) format explanation +- Usage examples +- Configuration options +- Performance characteristics + +--- + +### PR #18067 - FP4 KV Cache Additional Improvements +- **Title**: Additional FP4 KV cache improvements +- **Status**: Open +- **Opened**: January 26, 2026 (7 days ago) +- **Author**: [@celve](https://github.com/celve) +- **Labels**: `quant`, `run-ci` +- **URL**: https://github.com/sgl-project/sglang/pull/18067 + +**Summary**: +Recent improvements and bug fixes for FP4 KVCache. + +--- + +## Issues and Discussions + +### Issue #8180 - Quantization Roadmap ⭐ **STRATEGIC PLANNING** +- **Title**: Quantization implementation decoupling and enhancement roadmap +- **Status**: Open +- **Opened**: July 20, 2025 +- **Author**: [@AniZpZ](https://github.com/AniZpZ) +- **Labels**: `collaboration`, `high priority` +- **URL**: https://github.com/sgl-project/sglang/issues/8180 + +**Summary**: +Strategic roadmap for quantization features in SGLang. + +**MXFP4 Section** (Section 4): +- **Objective**: Support for cutting-edge MXFP4 quantization format +- Status: MXFP4 Quantization (mentioned as future work) + +**Related Sections**: +1. Decouple Quantization Implementation from vLLM +2. Quantization on Various Hardware Platforms +3. Non-Linear Module & Communication Quantization + - **Improved KV Cache Quantization** by @Wilbolu (mentioned) +4. Support for More Features & Novel Formats + - **MXFP4 Quantization** (primary focus) + +**Context**: This issue tracks the overall quantization strategy, with MXFP4/NVFP4 KVCache as a key component. + +--- + +### Issue #17655 - DeepSeek FP4 Related Discussion +- **Title**: DeepSeek FP4 discussion +- **Status**: Open +- **Opened**: January 17, 2026 (16 days ago) +- **Author**: [@Fridge003](https://github.com/Fridge003) +- **Labels**: `deepseek`, `help wanted` +- **URL**: https://github.com/sgl-project/sglang/issues/17655 + +**Summary**: +Discussion about FP4 support for DeepSeek models. + +**Topics**: +- FP4 KVCache performance on DeepSeek models +- Integration with ModelOpt quantized checkpoints +- Best practices and configuration + +--- + +### Issue #16595 - NVFP4 Kernel Issue on B200 +- **Title**: Quantized GPT OSS 120b from FP16 to NVFP4 using ModelOpt kernel issue on 8xB200 +- **Status**: Open +- **Opened**: January 6, 2026 +- **Author**: [@pdasgup](https://github.com/pdasgup) +- **Labels**: None specified +- **URL**: https://github.com/sgl-project/sglang/issues/16595 + +**Summary**: +Kernel issue when serving NVFP4 quantized GPT-OSS-120B on 8×B200 GPUs. + +**Details**: +- Model quantized from FP16 to NVFP4 using ModelOpt +- Deployment on 8×B200 (Blackwell) +- Kernel-level issues reported + +**Status**: Investigation ongoing + +--- + +### Issue #10533 - Invalid Quantization Choice +- **Title**: Invalid quantization choice error +- **Status**: Closed (Inactive) +- **Opened**: September 16, 2025 +- **Author**: [@celsowm](https://github.com/celsowm) +- **Labels**: `inactive` +- **URL**: https://github.com/sgl-project/sglang/issues/10533 + +**Summary**: +User error with quantization argument - confusion between valid choices. + +**Resolution**: Clarified valid quantization methods including `modelopt_fp4` + +--- + +### Issue #14747 - FP4 Quantization Argument Confusion +- **Title**: Quantization argument confusion +- **Status**: Open +- **Opened**: December 9, 2025 +- **Author**: [@FalconIA](https://github.com/FalconIA) +- **Labels**: None specified +- **URL**: https://github.com/sgl-project/sglang/issues/14747 + +**Summary**: +Discussion about correct quantization arguments: `nvfp4` vs `mxfp4` vs `modelopt_fp4` + +**Key Points**: +- `nvfp4` is NOT a valid quantization argument +- Should use `mxfp4` or `modelopt_fp4` +- Distinction between instruct models (not FP4) and thinking models (FP4) + +--- + +### Issue #14322 - NVFP4 Related Discussion +- **Title**: NVFP4 discussion +- **Status**: Open +- **Opened**: December 3, 2025 +- **Author**: [@koush](https://github.com/koush) +- **Labels**: None specified +- **URL**: https://github.com/sgl-project/sglang/issues/14322 + +**Summary**: +General discussion about NVFP4 usage and best practices. + +--- + +### Issue #18137 - Recent NVFP4 Issue +- **Title**: Recent NVFP4 issue +- **Status**: Open +- **Opened**: January 29, 2026 (5 days ago) +- **Author**: [@vincentzed](https://github.com/vincentzed) +- **Labels**: None specified +- **URL**: https://github.com/sgl-project/sglang/issues/18137 + +**Summary**: +Recent issue with NVFP4 implementation (details to be investigated). + +--- + +## Related PRs + +### PR #8112 - GPTQ Quantization Decoupling (Foundation) +- **Title**: [2/n] Decouple quantization implementation from vLLM dependency +- **Status**: Merged +- **Opened**: July 17, 2025 +- **Author**: [@AniZpZ](https://github.com/AniZpZ) +- **Labels**: `high priority` +- **URL**: https://github.com/sgl-project/sglang/pull/8112 + +**Relevance**: Part of broader quantization infrastructure that FP4 builds upon. + +--- + +### PR #16678 - sgl-kernel FP4 Support +- **Title**: sgl-kernel FP4 support +- **Status**: Open +- **Opened**: January 7, 2026 +- **Author**: [@hlu1](https://github.com/hlu1) +- **Labels**: `sgl-kernel` +- **URL**: https://github.com/sgl-project/sglang/pull/16678 + +**Summary**: +FP4 support in sgl-kernel (low-level CUDA kernels). + +--- + +### PR #12135 - Related Quantization Work +- **Title**: Quantization improvements +- **Status**: Open +- **Opened**: October 26, 2025 +- **Author**: [@Fridge003](https://github.com/Fridge003) +- **Labels**: `run-ci` +- **URL**: https://github.com/sgl-project/sglang/pull/12135 + +**Summary**: +Various quantization improvements that complement FP4 work. + +--- + +## Timeline + +### Phase 1: Foundation (July - September 2025) +- **July 2025**: Issue #8180 opened - Strategic quantization roadmap +- **September 5, 2025**: **PR #10078 opened** - FP4 MLA KVCache (FOUNDATIONAL) +- **September 8, 2025**: PR #10154 opened - ModelOpt FP4 support +- **September 10, 2025**: PR #10281 opened - TRTLLM MLA backend + +### Phase 2: Expansion (October - November 2025) +- **October 15, 2025**: PR #11655 opened - FlashInfer MLA backend +- **October 26, 2025**: PR #12135 opened - Quantization improvements +- **November 4, 2025**: **PR #12612 opened** - FP4 MHA KVCache (extends to MHA) + +### Phase 3: Refinement (December 2025) +- **December 3, 2025**: PR #14348 opened - Documentation improvements +- **December 9, 2025**: Issue #14747 - Clarifying quantization arguments +- **December 15, 2025**: PR #15133 - NVFP4 vs MXFP4 distinction + +### Phase 4: Recent Work (January 2026) +- **January 6, 2026**: Issue #16595 - B200 kernel issues +- **January 7, 2026**: PR #16678 - sgl-kernel FP4 support +- **January 14, 2026**: PR #17530 - Per-layer KV cache dtype +- **January 17, 2026**: Issue #17655 - DeepSeek FP4 discussion +- **January 26, 2026**: PR #18067 - Additional improvements +- **January 27, 2026**: PR #18314 - TRTLLM MHA improvements +- **January 29, 2026**: Issue #18137 - Recent issues + +--- + +## Summary Statistics + +### Pull Requests +- **Total PRs Found**: 15+ +- **Merged**: 8+ +- **Open**: 7+ +- **High Priority**: 6+ + +### Key Contributors +- [@JackChuang](https://github.com/JackChuang) - Primary developer (PRs #10078, #12612) +- [@yicwang](https://github.com/yicwang) - Co-author on foundational PRs +- [@Edwardf0t1](https://github.com/Edwardf0t1) - ModelOpt integration (PR #10154) +- [@hlu1](https://github.com/hlu1) - FlashInfer backend (PR #11655) +- [@samuellees](https://github.com/samuellees) - Recent TRTLLM improvements (PR #18314) +- [@pranavm-nvidia](https://github.com/pranavm-nvidia) - TRTLLM MLA (PR #10281) +- [@AniZpZ](https://github.com/AniZpZ) - Strategic planning (Issue #8180) +- [@b8zhong](https://github.com/b8zhong) - Format clarification, documentation + +### Key Milestones +1. **September 5, 2025**: Initial FP4 MLA support (PR #10078) +2. **November 4, 2025**: Extended to MHA support (PR #12612) +3. **Ongoing**: Backend-specific optimizations (TRTLLM, FlashInfer, etc.) +4. **Recent**: Per-layer configuration and format refinements + +### Hardware Focus +- Primary: **NVIDIA Blackwell (B200)** GPUs +- Also: Hopper (H100/H200), Ampere architectures +- Special optimizations for DeepSeek models + +--- + +## Key Technical Insights + +### 1. Two-Phase Implementation +- **Phase 1**: MLA support (DeepSeek models) - PR #10078 +- **Phase 2**: MHA support (standard transformers) - PR #12612 + +### 2. Memory Savings +- **Theoretical**: 4 bits/value + 0.5 bits/value (scale) = 4.5 bits effective +- **Actual**: ~3.56× more tokens than BF16, ~1.78× more than FP8 + +### 3. Accuracy Trade-offs +- Large models (200B+): Minimal degradation on simple tasks +- Small models: More pronounced accuracy drops +- Dataset-dependent: Simple (gsm8k) vs Complex (gpqa_diamond, aime25) + +### 4. Backend Support Evolution +``` +Initial: FlashInfer MLA (page_size=1) +Added: TRTLLM MLA (page_size=32,64), FlashMLA (page_size=64) +Recent: Cutlass MLA (page_size=128), FA4 (various) +Ongoing: Per-backend optimizations +``` + +### 5. Format Naming +- **fp4_e2m1**: Generic name (defaults to mxfp4) +- **nvfp4_e2m1**: NVIDIA-specific implementation +- **mxfp4_e2m1**: OCP MXFP4 standard (microscaling) +- **modelopt_fp4**: For weights quantized with NVIDIA ModelOpt + +--- + +## Related Search Queries + +For finding more information, use these GitHub search queries: + +1. **Pull Requests**: + - `repo:sgl-project/sglang NVFP4 KVCache type:pullrequest` + - `repo:sgl-project/sglang fp4 kvcache type:pullrequest` + - `repo:sgl-project/sglang fp4_e2m1 type:pullrequest` + - `repo:sgl-project/sglang mxfp4 kvcache type:pullrequest` + +2. **Issues**: + - `repo:sgl-project/sglang NVFP4 KVCache type:issue` + - `repo:sgl-project/sglang fp4 kvcache type:issue` + - `repo:sgl-project/sglang modelopt_fp4 type:issue` + +3. **Code Search**: + - `repo:sgl-project/sglang KVFP4QuantizeUtil` + - `repo:sgl-project/sglang fp4_e2m1` + - `repo:sgl-project/sglang kv_cache_dtype` + +--- + +## References + +### Official Documentation +- [Quantized KV Cache](https://docs.sglang.io/advanced_features/quantized_kv_cache.html) +- [Attention Backend](https://docs.sglang.io/advanced_features/attention_backend.html) +- [DeepSeek V3 Usage](https://docs.sglang.io/basic_usage/deepseek_v3.html) + +### Model Checkpoints +- [nvidia/DeepSeek-R1-0528-NVFP4-v2](https://huggingface.co/nvidia/DeepSeek-R1-0528-NVFP4-v2) +- [nvidia/DeepSeek-V3.2-NVFP4](https://huggingface.co/nvidia/DeepSeek-V3.2-NVFP4) + +### External Standards +- [OCP MXFP4 Specification](https://www.opencompute.org/) + +--- + +**Document Generated**: February 2026 +**Last Updated**: January 29, 2026 (based on latest PR/issue activity) +**Repository**: [sgl-project/sglang](https://github.com/sgl-project/sglang) +**Compiled by**: Automated analysis of GitHub search results diff --git a/NVFP4_SUMMARY_README.md b/NVFP4_SUMMARY_README.md new file mode 100644 index 000000000000..9737a6445153 --- /dev/null +++ b/NVFP4_SUMMARY_README.md @@ -0,0 +1,110 @@ +# NVFP4 KVCache Design Summary + +This document provides a comprehensive design and implementation summary for NVFP4 (NVIDIA FP4) KVCache support in SGLang. + +## Document Location + +**Main Summary:** [`NVFP4_KVCACHE_DESIGN_SUMMARY.md`](NVFP4_KVCACHE_DESIGN_SUMMARY.md) +**PRs and Issues:** [`NVFP4_KVCACHE_PRS_AND_ISSUES.md`](NVFP4_KVCACHE_PRS_AND_ISSUES.md) + +## What's Covered + +### Main Design Summary + +The main design summary document includes: + +1. **Executive Overview** - High-level description of NVFP4 KVCache and its benefits +2. **Technical Design** - Detailed explanation of the E2M1 quantization format and block-based microscaling +3. **Implementation Architecture** - Core components, integration points, and code examples +4. **Attention Backend Support** - Compatibility matrix for MHA and MLA backends +5. **Usage Guide** - Server arguments, hardware requirements, and configuration examples +6. **Performance Characteristics** - Memory savings, accuracy impact, and benchmarks +7. **Testing & Validation** - Unit tests, integration tests, and accuracy validation +8. **Best Practices** - When to use FP4, configuration checklist, and debugging tips +9. **Future Work** - Current limitations and potential improvements +10. **References** - Links to documentation, PRs, and external resources + +### PRs and Issues Summary + +The PRs/Issues document includes: + +1. **Key Pull Requests** - Detailed analysis of 15+ PRs from sgl-project/sglang + - PR #10078: Foundational FP4 MLA support (Sep 2025) + - PR #12612: Extended to MHA support (Nov 2025) + - PR #10154: ModelOpt FP4 integration + - PR #18314: Recent TRTLLM improvements (Jan 2026) + - And more... + +2. **Issues and Discussions** - 8+ issues tracking bugs, features, and discussions + - Issue #8180: Strategic quantization roadmap + - Issue #17655: DeepSeek FP4 discussion + - Issue #16595: B200 kernel issues + - And more... + +3. **Timeline** - Four-phase development history (July 2025 - January 2026) + +4. **Key Contributors** - Credits to main developers and contributors + +5. **Technical Insights** - Lessons learned from PR discussions + +## Quick Links + +### Implementation Files +- **Core Quantization Utility:** `python/sglang/srt/layers/quantization/kvfp4_tensor.py` +- **Memory Pool Integration:** `python/sglang/srt/mem_cache/memory_pool.py` +- **Tests:** `python/sglang/test/test_kvfp4_quant_dequant.py` + +### Documentation +- **Quantized KV Cache Guide:** `docs/advanced_features/quantized_kv_cache.md` +- **Attention Backend Guide:** `docs/advanced_features/attention_backend.md` +- **DeepSeek V3 Usage:** `docs/basic_usage/deepseek_v3.md` +- **DeepSeek V3.2 Usage:** `docs/basic_usage/deepseek_v32.md` + +## Key Takeaways + +- **Memory Savings:** ~3.56× more tokens than BF16, ~1.78× more than FP8 +- **Hardware:** Optimized for NVIDIA Blackwell (B200) GPUs +- **Requirements:** CUDA 12.8+, PyTorch 2.8.0+ +- **Use Cases:** Best for large models (200B+) on throughput-oriented workloads +- **Accuracy:** Minimal degradation on simple tasks, more pronounced on complex reasoning + +## Example Usage + +```bash +# Basic FP4 KV cache +python3 -m sglang.launch_server \ + --model-path nvidia/DeepSeek-R1-0528-NVFP4 \ + --kv-cache-dtype fp4_e2m1 + +# With quantized weights on Blackwell +python -m sglang.launch_server \ + --model nvidia/DeepSeek-V3.2-NVFP4 \ + --tp 4 \ + --quantization modelopt_fp4 \ + --moe-runner-backend flashinfer_trtllm \ + --kv-cache-dtype fp4_e2m1 +``` + +## Research Sources + +This summary was compiled from: +- Source code analysis of the SGLang repository +- Official documentation files +- Test files and benchmarks +- **15+ Pull Requests from sgl-project/sglang** (see NVFP4_KVCACHE_PRS_AND_ISSUES.md) +- **8+ Issues from sgl-project/sglang** (see NVFP4_KVCACHE_PRS_AND_ISSUES.md) +- Pull Request discussions (#10078, #12612, #10154, #18314, and more) + +### Key PRs Referenced +- **PR #10078** (Sep 2025): Foundational FP4 MLA support by @JackChuang +- **PR #12612** (Nov 2025): Extended to MHA support by @JackChuang +- **PR #10154** (Sep 2025): ModelOpt FP4 integration by @Edwardf0t1 +- **PR #18314** (Jan 2026): Latest TRTLLM improvements by @samuellees + +For complete list with descriptions, see [`NVFP4_KVCACHE_PRS_AND_ISSUES.md`](NVFP4_KVCACHE_PRS_AND_ISSUES.md) + +--- + +**Repository:** [yiliu30/sglang-fork](https://github.com/yiliu30/sglang-fork) +**Original Repository:** [sgl-project/sglang](https://github.com/sgl-project/sglang) +**Created:** February 2026