Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 126 additions & 0 deletions benchmarks/multi_turn_tq/BENCHMARK_REPORT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# Multi-Turn KV Cache Compression Benchmark Report

## Summary

Comparing KV cache compression strategies on MiniMax-M2.7 with TP=2, simulating 192GB GPU memory.

**TurboQuant 4-bit achieves 85.7% cache hit rate vs FP8's 27.5%**, resulting in:

- **4.6× faster TTFT** than BF16 baseline (8.0s vs 37s)
- **2.4× faster TTFT** than FP8 (8.0s vs 19.5s)
- **1.3× faster total duration** than FP8 (492s vs 639s)

## Configuration

| Parameter | Value |
|-----------|-------|
| Model | MiniMax-M2.7 |
| TP Size | 2 |
| GPU Memory Util | 0.6 |
| Clients | 40 |
| Rounds | 8 |
| Common Prefix | 2,000 tokens |
| Per-Client Prefix | 32,000 tokens |
| Input/Round | 2,000 tokens |
| Output/Round | 200 tokens |

## Results

### Overall Metrics

| Metric | BF16 | FP8 | TQ 4-bit |
|--------|------|-----|----------|
| TTFT Mean | 36,986 ms | 19,497 ms | **8,006 ms** |
| TTFT P90 | 63,394 ms | 43,452 ms | **10,580 ms** |
| Cache Hit Rate | 6.6% | 27.5% | **85.7%** |
| Throughput | 12,154 tok/s | 16,159 tok/s | **21,038 tok/s** |
| Total Duration | 849s | 639s | **492s** |

### Per-Round Cache Hit Rate

| Round | BF16 | FP8 | TQ 4-bit |
|-------|------|-----|----------|
| 0 | 5.8% | 5.8% | 5.8% |
| 1 | 10.0% | 73.5% | **93.7%** |
| 2 | 7.5% | 48.4% | **94.1%** |
| 3 | 10.7% | 31.3% | **94.4%** |
| 4 | 7.0% | 22.4% | **94.7%** |
| 5 | 4.5% | 18.5% | **95.0%** |
| 6 | 4.3% | 15.6% | **95.2%** |
| 7 | 4.1% | 13.2% | **95.4%** |

### Per-Round TTFT (ms)

| Round | BF16 | FP8 | TQ 4-bit |
|-------|------|-----|----------|
| 0 | 27,723 | 21,111 | 29,844 |
| 1 | 28,104 | 3,676 | **4,289** |
| 2 | 31,519 | 11,413 | **4,382** |
| 3 | 33,229 | 16,977 | **4,701** |
| 4 | 37,693 | 20,896 | **4,903** |
| 5 | 42,181 | 23,893 | **5,002** |
| 6 | 45,708 | 26,919 | **5,295** |
| 7 | 49,731 | 31,086 | **5,628** |

## Reproduction Steps

```bash
cd benchmarks/multi_turn_tq

# BF16 Baseline
HIP_VISIBLE_DEVICES=4,5 \
GPU_MEMORY_UTIL=0.6 \
MAX_MODEL_LEN=80000 \
OUTPUT_TOKENS=200 \
SUB_QUESTION_TOKENS=2000 \
ATTENTION_BACKEND=ROCM_AITER_FA \
./run_benchmark.sh \
--kv-cache-dtype auto \
--tag fix_baseline \
--num-clients 40 \
--num-rounds 8 \
--common-prefix 2000 \
--prefix-tokens 32000 \
--port 6789

# FP8
HIP_VISIBLE_DEVICES=2,3 \
GPU_MEMORY_UTIL=0.6 \
MAX_MODEL_LEN=80000 \
OUTPUT_TOKENS=200 \
SUB_QUESTION_TOKENS=2000 \
ATTENTION_BACKEND=ROCM_AITER_FA \
./run_benchmark.sh \
--kv-cache-dtype fp8_e4m3 \
--tag fix_fp8 \
--num-clients 40 \
--num-rounds 8 \
--common-prefix 2000 \
--prefix-tokens 32000 \
--port 6791

# TurboQuant 4-bit
HIP_VISIBLE_DEVICES=6,7 \
GPU_MEMORY_UTIL=0.6 \
MAX_MODEL_LEN=80000 \
OUTPUT_TOKENS=200 \
SUB_QUESTION_TOKENS=2000 \
VLLM_TQ_DECODE_V3=1 \
./run_benchmark.sh \
--kv-cache-dtype turboquant_4bit_nc \
--tag fix3_tq4bit \
--num-clients 40 \
--num-rounds 8 \
--common-prefix 2000 \
--prefix-tokens 32000 \
--port 6790

# Compare results
python compare_results.py results/multiturn/results_fix_baseline_*.json results/multiturn/results_fix_fp8_*.json results/multiturn/results_fix3_tq4bit_*.json
```

## Result Files

- `results/multiturn/results_fix_baseline_20260429_022339.json`
- `results/multiturn/results_fix_fp8_20260429_022010.json`
- `results/multiturn/results_fix3_tq4bit_20260429_202437.json`
232 changes: 232 additions & 0 deletions benchmarks/multi_turn_tq/SKILL_MULTITURN_BENCHMARK.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,232 @@
---
name: multiturn-benchmark
description: >
Benchmark KV cache compression strategies (BF16 vs FP8 vs TurboQuant 4-bit) on vLLM
with multi-turn workloads on MiniMax-M2.7 / MI300X/MI355x. Finds scenarios where compression
outperforms baseline by creating memory pressure. Key result: TurboQuant achieves 85%
cache hit vs FP8's 27% under memory pressure, 2.6× faster TTFT.
Usage: /multiturn-benchmark [baseline|fp8|tq4bit] [--num-clients N] [--num-rounds N]
allowed-tools: Bash, Read, Grep, Glob
---

# Multi-Turn KV Cache Compression Benchmarking

**Tags**: vllm, kv-cache, compression, turboquant, fp8, multi-turn, benchmark, prefix-caching
**Model**: MiniMax-M2.7
**Hardware**: MI300X/MI355X (ROCm)

## Overview

This skill covers benchmarking KV cache compression strategies (BF16, FP8, TurboQuant 4-bit) on vLLM with multi-turn workloads. The goal is to find scenarios where compression techniques outperform baseline by creating memory pressure.

## Key Concepts

### When KV Cache Compression Shines

- Compression benefits appear when **memory is the bottleneck**
- Need: `total_tokens > GPU_capacity / compression_ratio`
- Under light load, baseline wins (no memory pressure, compression has overhead)
- Under heavy load, compression wins (avoids cache eviction)

### Cache Hit Rate

- `cache_hit_rate = cached_tokens / prompt_tokens`
- High rate (90%+) = KV cache reused, fast TTFT
- Low rate (<20%) = cache evicted, must recompute, slow TTFT
- Per-round rate should increase in later rounds (history cached)

### Memory Capacity Calculation

```
KV cache per token = 2 (K+V) × num_kv_heads × head_dim × dtype_bytes × num_layers

For MiniMax-M2.7 with TP=2:
- Layers: 62, KV heads: 8 (4 per GPU), Head dim: 128
- BF16: 127 KB/token
- FP8 (2×): 63.5 KB/token
- 4-bit (4×): 31.75 KB/token

Capacity at 0.6 util on 288GB GPU (simulating 192GB):
- BF16: ~756k tokens
- FP8: ~1.5M tokens
- 4-bit: ~3M tokens
```

## Directory Structure

```
benchmarks/multi_turn_tq/
├── run_benchmark.sh # Main wrapper script
├── bench_multiturn_enhanced.py # Python benchmark with per-round metrics
├── compare_results.py # Compare multiple result files
├── results/multiturn/ # JSON result files
├── logs/ # Server and benchmark logs
├── BENCHMARK_REPORT.md # Latest results report
└── SKILL_MULTITURN_BENCHMARK.md # This file
```

## Key Scripts

### run_benchmark.sh

Launches vLLM server and runs benchmark. Key environment variables:

| Variable | Description | Default |
|----------|-------------|---------|
| `GPU_MEMORY_UTIL` | Fraction of GPU memory to use | 0.9 |
| `MAX_MODEL_LEN` | Max context length | 8192 |
| `OUTPUT_TOKENS` | Max output tokens per round | 100 |
| `SUB_QUESTION_TOKENS` | Input tokens per follow-up round | 200 |
| `ATTENTION_BACKEND` | Attention backend (empty for auto) | - |
| `KV_SKIP_LAYERS` | Layers to skip for KV quantization | - |
| `VLLM_TQ_DECODE_V3` | Enable TurboQuant decode v3 | - |

Command line args: `--kv-cache-dtype`, `--tag`, `--num-clients`, `--num-rounds`, `--common-prefix`, `--prefix-tokens`, `--port`, `--skip-server`

### bench_multiturn_enhanced.py

Python benchmark that:

- Runs multi-turn conversations with round barrier mode
- Captures actual `cached_tokens` from server (requires `--enable-prompt-tokens-details`)
- Records per-round TTFT and cache hit rate
- Uses actual model responses in history (critical for prefix caching!)

## Important Fixes Applied

### 1. History Must Use Actual Response

**Bug**: Original code used placeholder `"[Response for round N]"` instead of actual model output.
**Impact**: Prefix caching fails because history doesn't match cached KV.
**Fix**: Store `result.generated_text` and use it in history.

### 2. TurboQuant Server Flags

```bash
VLLM_TQ_DECODE_V3=1 # Enable decode v3
KV_SKIP_LAYERS="" # Pass --kv-cache-dtype-skip-layers ""
# Do NOT set ATTENTION_BACKEND - let TQ auto-select
```

### 3. TurboQuant Memory Overhead

Known bug: TQ uses ~22GB more than expected. Workaround: increase `GPU_MEMORY_UTIL` by ~0.08 for TQ runs.

## Parameter Tuning Guide

### To Create Memory Pressure

Increase total tokens until baseline starts evicting:

```
total_tokens = num_clients × tokens_per_client
tokens_per_client = common_prefix + unique_prefix + rounds × (output + input)
```

### To See Gradual Degradation

Find settings where:

- Round 0-2: All fit
- Round 3-5: FP8 starts evicting, TQ still fits
- Round 6+: FP8 heavily evicting, TQ starts evicting

### Example Configurations

**Light load (no pressure)**:

- 16 clients, 8k prefix, 5 rounds → All methods similar

**Medium load (BF16 evicts)**:

- 40 clients, 16k prefix, 10 rounds → BF16 degrades, FP8/TQ OK

**Heavy load (FP8 evicts)**:

- 40 clients, 32k prefix, 8 rounds, 2k input/round → FP8 degrades, TQ wins

## Reproduction Commands

### Quick 3-Way Comparison

```bash
cd benchmarks/multi_turn_tq

# Run all 3 in parallel on different GPU pairs
HIP_VISIBLE_DEVICES=4,5 GPU_MEMORY_UTIL=0.6 MAX_MODEL_LEN=80000 \
OUTPUT_TOKENS=200 SUB_QUESTION_TOKENS=2000 ATTENTION_BACKEND=ROCM_AITER_FA \
./run_benchmark.sh --kv-cache-dtype auto --tag test_baseline \
--num-clients 40 --num-rounds 8 --common-prefix 2000 --prefix-tokens 32000 --port 6789 &

HIP_VISIBLE_DEVICES=2,3 GPU_MEMORY_UTIL=0.6 MAX_MODEL_LEN=80000 \
OUTPUT_TOKENS=200 SUB_QUESTION_TOKENS=2000 ATTENTION_BACKEND=ROCM_AITER_FA \
./run_benchmark.sh --kv-cache-dtype fp8_e4m3 --tag test_fp8 \
--num-clients 40 --num-rounds 8 --common-prefix 2000 --prefix-tokens 32000 --port 6791 &

HIP_VISIBLE_DEVICES=6,7 GPU_MEMORY_UTIL=0.68 MAX_MODEL_LEN=80000 \
OUTPUT_TOKENS=200 SUB_QUESTION_TOKENS=2000 VLLM_TQ_DECODE_V3=1 KV_SKIP_LAYERS="" \
./run_benchmark.sh --kv-cache-dtype turboquant_4bit_nc --tag test_tq4bit \
--num-clients 40 --num-rounds 8 --common-prefix 2000 --prefix-tokens 32000 --port 6790 &

wait

# Compare
python compare_results.py results/multiturn/results_test_*.json
```

### Simulating Different GPU Sizes

```bash
# 288GB GPU at 0.6 util ≈ 173GB (simulates 192GB)
# 288GB GPU at 0.5 util ≈ 144GB (simulates 160GB)
# 288GB GPU at 0.4 util ≈ 115GB (simulates 128GB)
```

## Interpreting Results

### Good Result Pattern

```
Per-Round Cache Hit Rate:
Round | BF16 | FP8 | TQ
0 | 5.8% | 5.8% | 5.8% <- All same (cold start)
1 | 10% | 73% | 94% <- TQ >> FP8 >> BF16
2 | 7% | 48% | 94% <- FP8 degrading, TQ stable
...
7 | 4% | 13% | 95% <- TQ maintains high hit rate
```

### Warning Signs

- Cache hit rate same across all methods → not enough memory pressure
- TQ cache hit dropping → exceeded even TQ capacity
- FP8 similar to TQ → workload too light for 4-bit advantage

## Troubleshooting

### Server Fails to Start

- Check logs in `logs/server_*.log`
- Verify GPU memory available: `rocm-smi --showmeminfo vram`
- Kill stale processes: `pkill -9 -f "vllm serve"`

### Low Cache Hit Despite Multi-Turn

- Verify history uses actual model response (not placeholder)
- Check `--enable-prompt-tokens-details` flag on server
- Ensure `--enable-prefix-caching` is set

### OOM Errors

- Reduce `--num-clients` or `--prefix-tokens`
- Lower `GPU_MEMORY_UTIL`
- Check for zombie GPU processes: `rocm-smi --showpids`

## Files Reference

| File | Purpose |
|------|---------|
| `results_*.json` | Raw benchmark results with per-round metrics |
| `all_results.jsonl` | Appended results for tracking |
| `server_*.log` | vLLM server output, includes cache stats |
| `benchmark_*.log` | Benchmark script output |
Loading
Loading