Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions docs/source/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,18 +33,18 @@ TensorRT LLM supports the latest and most popular LLM [architectures](https://nv
TensorRT LLM strives to support the most popular models on **Day 0**.

### 🚀 **Advanced Optimization & Production Features**
- **In-Flight Batching & Paged Attention**: {ref}`inflight-batching` eliminates wait times by dynamically managing request execution, processing context and generation phases together for maximum GPU utilization and reduced latency.
- **Multi-GPU Multi-Node Inference**: Seamless distributed inference with tensor, pipeline, and expert parallelism across multiple GPUs and nodes through the Model Definition API.
- **Advanced Quantization**:
- **[In-Flight Batching & Paged Attention](./features/paged-attention-ifb-scheduler.md)**: In-flight batching eliminates wait times by dynamically managing request execution, processing context and generation phases together for maximum GPU utilization and reduced latency.
- **[Multi-GPU Multi-Node Inference](./features/parallel-strategy.md)**: Seamless distributed inference with tensor, pipeline, and expert parallelism across multiple GPUs and nodes through the Model Definition API.
- **[Advanced Quantization](./features/quantization.md)**:
- **FP4 Quantization**: Native support on NVIDIA B200 GPUs with optimized FP4 kernels
- **FP8 Quantization**: Automatic conversion on NVIDIA H100 GPUs leveraging Hopper architecture
- **Speculative Decoding**: Multiple algorithms including EAGLE, MTP and NGram
- **KV Cache Management**: Paged KV cache with intelligent block reuse and memory optimization
- **Chunked Prefill**: Efficient handling of long sequences by splitting context into manageable chunks
- **LoRA Support**: Multi-adapter support with HuggingFace and NeMo formats, efficient fine-tuning and adaptation
- **Checkpoint Loading**: Flexible model loading from various formats (HuggingFace, NeMo, custom)
- **Guided Decoding**: Advanced sampling with stop words, bad words, and custom constraints
- **Disaggregated Serving (Beta)**: Separate context and generation phases across different GPUs for optimal resource utilization
- **[Speculative Decoding](./features/speculative-decoding.md)**: Multiple algorithms including EAGLE, MTP and NGram
- **[KV Cache Management](./features/kvcache.md)**: Paged KV cache with intelligent block reuse and memory optimization
- **[Chunked Prefill](./features/paged-attention-ifb-scheduler.md)**: Efficient handling of long sequences by splitting context into manageable chunks
- **[LoRA Support](./features/lora.md)**: Multi-adapter support with HuggingFace and NeMo formats, efficient fine-tuning and adaptation
- **[Checkpoint Loading](./features/checkpoint-loading.md)**: Flexible model loading from various formats (HuggingFace, NeMo, custom)
- **[Guided Decoding](./features/guided-decoding.md)**: Advanced sampling with stop words, bad words, and custom constraints
- **[Disaggregated Serving (Beta)](./features/disagg-serving.md)**: Separate context and generation phases across different GPUs for optimal resource utilization

### 🔧 **Latest GPU Architecture Support**

Expand Down