Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/deployment/frameworks/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ chatbox
dify
dstack
helm
litellm
lobe-chat
lws
modal
Expand Down
75 changes: 75 additions & 0 deletions docs/source/deployment/frameworks/litellm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
(deployment-litellm)=

# LiteLLM

[LiteLLM](https://github.com/BerriAI/litellm) call all LLM APIs using the OpenAI format [Bedrock, Huggingface, VertexAI, TogetherAI, Azure, OpenAI, Groq etc.]

LiteLLM manages:

- Translate inputs to provider's `completion`, `embedding`, and `image_generation` endpoints
- [Consistent output](https://docs.litellm.ai/docs/completion/output), text responses will always be available at `['choices'][0]['message']['content']`
- Retry/fallback logic across multiple deployments (e.g. Azure/OpenAI) - [Router](https://docs.litellm.ai/docs/routing)
- Set Budgets & Rate limits per project, api key, model [LiteLLM Proxy Server (LLM Gateway)](https://docs.litellm.ai/docs/simple_proxy)

And LiteLLM supports all models on VLLM.

## Prerequisites

- Setup vLLM and litellm environment

```console
pip install vllm litellm
```

## Deploy

### Chat completion

- Start the vLLM server with the supported chat completion model, e.g.

```console
vllm serve qwen/Qwen1.5-0.5B-Chat
```

- Call it with litellm:

```python
import litellm

messages = [{ "content": "Hello, how are you?","role": "user"}]

# hosted_vllm is prefix key word and necessary
response = litellm.completion(
model="hosted_vllm/qwen/Qwen1.5-0.5B-Chat", # pass the vllm model name
messages=messages,
api_base="http://{your-vllm-server-host}:{your-vllm-server-port}/v1",
temperature=0.2,
max_tokens=80)

print(response)
```

### Embeddings

- Start the vLLM server with the supported embedding model, e.g.

```console
vllm serve BAAI/bge-base-en-v1.5
```

- Call it with litellm:

```python
from litellm import embedding
import os

os.environ["HOSTED_VLLM_API_BASE"] = "http://{your-vllm-server-host}:{your-vllm-server-port}/v1"

# hosted_vllm is prefix key word and necessary
# pass the vllm model name
embedding = embedding(model="hosted_vllm/BAAI/bge-base-en-v1.5", input=["Hello world"])

print(embedding)
```

For details, see the tutorial [Using vLLM in LiteLLM](https://docs.litellm.ai/docs/providers/vllm).