Skip to content

Commit c32b6fd

Browse files
Derrick/bt 12247 add llama 3.2 11b (#358)
Added llama3.2 vision instruct with vllm
1 parent b283c55 commit c32b6fd

File tree

5 files changed

+404
-0
lines changed

5 files changed

+404
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Llama 3.2 11B Vision Instruct vLLM Truss
2+
3+
This is a [Truss](https://truss.baseten.co/) for Llama 3.2 11B Vision Instruct with vLLM. Llama 3.2 11B Vision Instruct is a multimodal (text + vision) LLM. This README will walk you through how to deploy this Truss on Baseten to get your own instance of it.
4+
5+
6+
## Deployment
7+
8+
First, clone this repository:
9+
10+
```sh
11+
git clone https://github.com/basetenlabs/truss-examples/
12+
cd llama/llama-3_2-11b-vision-instruct
13+
```
14+
15+
Before deployment:
16+
17+
1. Make sure you have a [Baseten account](https://app.baseten.co/signup) and [API key](https://app.baseten.co/settings/account/api_keys).
18+
2. Install the latest version of Truss: `pip install --upgrade truss`
19+
3. Apply for access to the Llama 3.2 11B Vision Instruct model on hugging face [here](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct).
20+
4. Retrieve your Hugging Face token from the [settings](https://huggingface.co/settings/tokens).
21+
5. Set your Hugging Face token as a Baseten secret [here](https://app.baseten.co/settings/secrets) with the key `hf_access_token`. Note that you will *not* be able to successfully deploy this model without doing this.
22+
23+
With `llama-3_2-11b-vision-instruct` as your working directory, you can deploy the model with:
24+
25+
```sh
26+
truss push --publish --trusted
27+
```
28+
29+
Paste your Baseten API key if prompted.
30+
31+
For more information, see [Truss documentation](https://truss.baseten.co).
32+
33+
### Notes
34+
35+
Limitations from vLLM allow for a maximum of 1 image as input. You will get a memory error otherwise. You can keep track of the issue [here](https://github.com/vllm-project/vllm/issues/8826).
36+
37+
## Example usage
38+
39+
```sh
40+
truss predict -d '{"messages": [{"role": "user", "content": "Tell me about yourself"}]}'
41+
```
42+
43+
Here's another example of invoking your model via a REST API but for image input:
44+
45+
```
46+
curl -X POST " https://app.baseten.co/model_versions/YOUR_MODEL_VERSION_ID/predict" \
47+
-H "Content-Type: application/json" \
48+
-H 'Authorization: Api-Key {YOUR_API_KEY}' \
49+
-d '{
50+
"messages": [
51+
{
52+
"role": "user",
53+
"content": [
54+
{
55+
"type": "text",
56+
"text": "What type of animal is this? Answer in French only"
57+
},
58+
{
59+
"type": "image_url",
60+
"image_url": {
61+
"url": "https://vetmed.illinois.edu/wp-content/uploads/2021/04/pc-keller-hedgehog.jpg"
62+
}
63+
}
64+
]
65+
}
66+
],
67+
"stream": true,
68+
"max_tokens": 64,
69+
"temperature": 0.2
70+
}'
71+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
model_name: "Llama 3.2 11B Vision Instruct VLLM openai compatible"
2+
python_version: py311
3+
model_metadata:
4+
example_model_input: {
5+
messages: [
6+
{
7+
role: "user",
8+
content: [
9+
{
10+
type: "text",
11+
text: "Describe this image in one sentence."
12+
},
13+
{
14+
type: "image_url",
15+
image_url: {
16+
url: "https://picsum.photos/id/237/200/300"
17+
}
18+
}
19+
]
20+
}
21+
],
22+
stream: true,
23+
max_tokens: 512,
24+
temperature: 0.5
25+
}
26+
repo_id: meta-llama/Llama-3.2-11B-Vision-Instruct
27+
openai_compatible: true
28+
vllm_config:
29+
tensor_parallel_size: 1
30+
enforce_eager: true
31+
max_num_seqs: 16
32+
limit_mm_per_prompt: {image: 1}
33+
tags:
34+
- text-generation
35+
- multimodal
36+
requirements:
37+
- vllm==0.6.2
38+
- uvloop>=0.18.0
39+
resources:
40+
accelerator: A100
41+
use_gpu: true
42+
runtime:
43+
predict_concurrency: 128
44+
secrets:
45+
hf_access_token: null

llama/llama-3_2-11b-vision-instruct/model/__init__.py

Whitespace-only changes.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
import asyncio
2+
import logging
3+
import os
4+
import threading
5+
6+
import httpx
7+
8+
logger = logging.getLogger(__name__)
9+
10+
DEFAULT_HEALTH_CHECK_INTERVAL = 5 # seconds
11+
12+
13+
async def monitor_vllm_server_health(vllm_server_url, health_check_interval):
14+
assert vllm_server_url is not None, "vllm_server_url must not be None"
15+
try:
16+
async with httpx.AsyncClient() as client:
17+
while True:
18+
response = await client.get(f"{vllm_server_url}/health")
19+
if response.status_code != 200:
20+
raise RuntimeError("vLLM is unhealthy")
21+
await asyncio.sleep(health_check_interval)
22+
except Exception as e:
23+
logging.error(
24+
f"vLLM has gone into an unhealthy state due to error: {e}, restarting service now..."
25+
)
26+
os._exit(1)
27+
28+
29+
async def monitor_vllm_engine_health(vllm_engine, health_check_interval):
30+
assert vllm_engine is not None, "vllm_engine must not be None"
31+
try:
32+
while True:
33+
await vllm_engine.check_health()
34+
await asyncio.sleep(health_check_interval)
35+
except Exception as e:
36+
logging.error(
37+
f"vLLM has gone into an unhealthy state due to error: {e}, restarting service now..."
38+
)
39+
os._exit(1)
40+
41+
42+
def run_background_vllm_health_check(
43+
use_openai_compatible_server=False,
44+
health_check_interval=DEFAULT_HEALTH_CHECK_INTERVAL,
45+
vllm_engine=None,
46+
vllm_server_url=None,
47+
):
48+
logger.info("Starting background health check loop")
49+
loop = asyncio.new_event_loop()
50+
if use_openai_compatible_server:
51+
loop.create_task(
52+
monitor_vllm_server_health(vllm_server_url, health_check_interval)
53+
)
54+
else:
55+
loop.create_task(monitor_vllm_engine_health(vllm_engine, health_check_interval))
56+
thread = threading.Thread(target=loop.run_forever, daemon=True)
57+
thread.start()

0 commit comments

Comments
 (0)