Skip to content

cr7258/ai-infra-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 

Repository files navigation

AI Infra 学习会议

主题 时间 预习资料 录频 文档 问题反馈 & 课后思考题
vLLM Quickstart 2025-05-11 Doc: vLLM AI INFRA 学习 01 - LLM 全景图介绍/vLLM 快速入门 01-vllm-quickstart
PagedAttention 2025-05-25 Blog: vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

Paper: Efficient Memory Management for Large Language Model Serving with PagedAttention

Video: Fast LLM Serving with vLLM and PagedAttention
AI INFRA 学习 02 - vLLM PagedAttention 论文精读 02-pagedattention 02-PagedAttention 问题反馈
Prefix Caching 2025-06-08 Doc: Automatic Prefix Caching

Design Doc: Automatic Prefix Caching

Paper: SGLang: Efficient Execution of Structured Language Model Programs
AI INFRA 学习 03 - Prefix Caching 原理详解 03-prefix-caching
Speculative Decoding 2025-06-22 Doc: Speculative Decoding

Blog: How Speculative Decoding Boosts vLLM Performance by up to 2.8x

Video: Hacker's Guide to Speculative Decoding in VLLM

Video: Speculative Decoding in vLLM

Paper: Accelerating Large Language Model Decoding with Speculative Sampling

Paper: Fast Inference from Transformers via Speculative Decoding
AI INFRA 学习 04 - Speculative Decoding 实现方案 04-speculative-decoding
Chunked-Prefills 2025-07-13 Doc: vLLM Chunked Prefill

Paper: SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Paper: DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

Paper: Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
AI INFRA 学习 05 - Chunked-Prefills 分块预填充 05-chunked-prefills 05-Chunked-Prefills 问题反馈 & 课后思考题
Disaggregating Prefill and Decoding 2025-09-21 Doc: Disaggregated Prefilling

Doc: vLLM Production Stack Disaggregated Prefill

Paper: DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Paper: Splitwise: Efficient generative LLM inference using phase splitting

Video: vLLM Office Hours - Disaggregated Prefill and KV Cache Storage in vLLM
AI INFRA 学习 06 - PD 分离推理架构详解 06-disaggregating-prefill-and-decoding 06-PD 分离问题反馈
LoRA Adapters Doc: LoRA Adapters
Paper: LoRA: Low-Rank Adaptation of Large Language Models
Quantization
Distributed Inference and Serving Doc: Distributed Inference and Serving

交流群(加群请备注来意)

微信公众号

搜索框传播样式-白色版

About

This repository organizes materials, recordings, and schedules related to AI-infra learning meetings.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published