Skip to content

Conversation

@mgoin
Copy link
Contributor

@mgoin mgoin commented Jun 7, 2024

vLLM is a high-throughput and memory-efficient open-source serving engine for LLMs.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Fast model execution with CUDA/HIP graph
  • Quantization: FP8, GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
  • Optimized CUDA kernels

Transparent Logo:
vllm-logo-text-light

@coyotte508 coyotte508 requested a review from ngxson as a code owner November 14, 2024 22:22
@julien-c
Copy link
Member

julien-c commented Feb 5, 2025

this was done in #693, closing

@julien-c julien-c closed this Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants