Skip to content

Commit 694ea34

Browse files
Merge pull request #33 from vast-ai/update-serverless-docs-lucas
Update Serverless docs for deprecations and new SDK
2 parents 87b6de6 + 598e96a commit 694ea34

11 files changed

+310
-328
lines changed

documentation/serverless/create-endpoints-and-workergroups.mdx

Lines changed: 3 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -106,11 +106,10 @@ curl --location 'https://console.vast.ai/api/v0/endptjobs/' \
106106

107107
- `api_key`(string): The Vast API key associated with the account that controls the Endpoint. The key can also be placed in the header as an Authorization: Bearer.
108108
- `endpoint_name`(string): The name of the Endpoint that the Workergroup will be created under.
109-
109+
OR
110+
- 'endpoint_id' (integer): The id of the Endpoint that the Workergroup will be created under.
110111
AND one of the following:
111112

112-
113-
114113
- `template_hash` (string): The hexadecimal string that identifies a particular template. 
115114

116115
OR
@@ -125,24 +124,15 @@ OR
125124
- `launch_args` (string): A command-line style string containing additional parameters for instance creation that will be parsed and applied when the autoscaler creates new workers. This allows you to customize instance configuration beyond what's specified in templates.
126125

127126

128-
129127
**Optional** (Default values will be assigned if not specified):
130128

131-
- `min_load`(integer): A minimum baseline load (measured in tokens/second for LLMs) that the serverless engine will assume your Endpoint needs to handle, regardless of actual measured traffic. Default value is 1.0.
132-
- `target_util` (float): A ratio that determines how much spare capacity (headroom) the serverless engine maintains. Default value is 0.9.
133-
- `cold_mult`(float): A multiplier applied to your target capacity for longer-term planning (1+ hours). This parameter controls how much extra capacity the serverless engine will plan for in the future compared to immediate needs. Default value is 3.0.
134-
- `test_workers` (integer): The number of different physical machines that a Workergroup should test during its initial "exploration" phase to gather performance data before transitioning to normal demand-based scaling. Default value is 3.
135129
- `gpu_ram` (integer): The amount of GPU memory (VRAM) in gigabytes that your model or workload requires to run. This parameter tells the serverless engine how much GPU memory your model needs. Default value is 24.
136130

137131
```json JSON icon="js"
138132
{
139133
"api_key": "YOUR_VAST_API_KEY",
140134
"endpoint_name": "YOUR_ENDPOINT_NAME",
141135
"template_hash": "YOUR_TEMPLATE_HASH",
142-
"min_load": 1.0,
143-
"target_util": 0.9,
144-
"cold_mult": 3.0,
145-
"test_workers": 3,
146136
"gpu_ram": 24
147137
}
148138
```
@@ -185,10 +175,6 @@ curl --location 'https://console.vast.ai/api/v0/workergroups/' \
185175
--data '{
186176
"endpoint_name": "MY_ENDPOINT",
187177
"template_hash": "MY_TEMPLATE_HASH",
188-
"min_load": 10,
189-
"target_util": 0.9,
190-
"cold_mult": 3.0,
191-
"test_workers": 3,
192178
"gpu_ram": 24
193179
}'
194180
```
@@ -197,7 +183,7 @@ curl --location 'https://console.vast.ai/api/v0/workergroups/' \
197183

198184

199185
```none Terminal
200-
vastai create workergroup --endpoint_name "MY_ENDPOINT" --template_hash "MY_TEMPLATE_HASH" --min_load 10 --target_util 0.9 --cold_mult 3.0 --test_workers 3 --gpu_ram 24
186+
vastai create workergroup --endpoint_name "MY_ENDPOINT" --template_hash "MY_TEMPLATE_HASH" --gpu_ram 24
201187
```
202188

203189

documentation/serverless/debugging.mdx

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,6 @@ echo "${WORKSPACE_DIR:-/workspace}/pyworker.log"
5050

5151
To handle high load on instances:
5252

53-
- **Set **`test_workers`** high**: Create more instances initially for Worker Groups with anticipated high load.
5453
- **Adjust **`cold_workers`: Keep enough workers around to prevent them from being destroyed during low initial load.
5554
- **Increase **`cold_mult`: Quickly create instances by predicting higher future load based on current high load. Adjust back down once enough instances are created.
5655
- **Check **`max_workers`: Ensure this parameter is set high enough to create the necessary number of workers.

documentation/serverless/getting-started-with-serverless.mdx

Lines changed: 27 additions & 156 deletions
Original file line numberDiff line numberDiff line change
@@ -150,9 +150,6 @@ Before we start, there are a few things you will need:
150150

151151
For our simple setup, we can enter the following values:
152152

153-
- Cold Multiplier = 3
154-
- Minimum Load = 1
155-
- Target Utilization = 0.9
156153
- Workergroup Name = 'Workergroup'
157154
- Select Endpoint = 'vLLM-Qwen3-8B'
158155

@@ -167,12 +164,12 @@ Before we start, there are a few things you will need:
167164
Run the following command to create your Workergroup:
168165

169166
```sh CLI Command
170-
vastai create workergroup --endpoint_name "vLLM-DeepSeek" --template_hash "$TEMPLATE_HASH" --test_workers 5
167+
vastai create workergroup --endpoint_name "vLLM-DeepSeek" --template_hash "$TEMPLATE_HASH" --gpu_ram 24
171168
```
172169

173170
`endpoint_name`: The name of the Endpoint.
174171
`template_hash`: The hash code of our custom vLLM (Serverless) template.
175-
`test_workers`: The minimum number of workers to create while initializing the Workergroup. This allows the Workergroup to get performance estimates before serving the Endpoint, and also creates workers which are fully loaded and "stopped" (aka "cold").
172+
`gpu_ram`: The amount of memory (in GB) that you expect your template to load onto the GPU (i.e. model weights).
176173

177174
<Warning>
178175
You will need to replace "$TEMPLATE\_HASH" with the template hash copied from step 1.
@@ -217,17 +214,12 @@ We have now successfully created a vLLM + Qwen3-8B Serverless Engine! It is read
217214

218215
# Using the Serverless Engine
219216

220-
To fully understand this section, it is recommended to read the [PyWorker Overview](/documentation/serverless/overview). The overview shows how all the pieces related to the serverless engine work together.
221-
222-
The Vast vLLM (Serverless) template we used in the last section already has a client (client.py) written for it. To use this client, we must run commands in a terminal, since there is no UI available for this section. The client, along with all other files the GPU worker is cloning during initialization, can be found in the [Vast.ai Github repo](https://github.com/vast-ai/pyworker/tree/main). For this section, simply clone the entire repo using:
223-
224-
&#x20;`git clone https://github.com/vast-ai/pyworker.git`
225-
226-
As the User, we want all the files under 'User' to be in our file system. The GPU workers that the system initializes will have the files and entities under 'GPU Worker'.
217+
To make requests to your endpoint, first install the `vastai_sdk` from pip.
218+
```sh Bash
219+
pip install vastai_sdk
220+
```
227221

228-
<Frame caption="Files and Entities for the user and GPU worker">
229-
![Files and Entities for the user and GPU worker](/images/getting-started-serverless-8.webp)
230-
</Frame>
222+
Make sure you have configured the `VAST_API_KEY` environment variable with your Serverless API key.
231223

232224
## API Keys
233225

@@ -260,155 +252,34 @@ The `show endpoints` command will return a JSON blob like this:
260252
}
261253
```
262254

263-
<Accordion title="Install the TLS certificate \[Optional]">
264-
## Install the TLS certificate \[Optional]
265-
266-
All of Vast.ai's pre-made Serverless templates use SSL by default. If you want to disable it, you can add `-e USE_SSL=false` to the Docker options in your copy of the template. The Serverless Engine will automatically adjust the instance URL to enable or disable SSL as needed.
267-
268-
1. Download Vast AI's certificate from [here](https://console.vast.ai/static/jvastai_root.cer).
269-
2. In the Python environment where you're running the client script, execute the following command: `python3 -m certifi`
270-
3. The command in step 2 will print the path to a file where certificates are stored. Append Vast AI's certificate to that file using the following command: `cat jvastai_root.cer >> PATH/TO/CERT/STORE`
271-
4. You may need to run the above command with `sudo` if you are not running Python in a virtual environment.
272-
273-
<Note>
274-
This process only adds Vast AI's TLS certificate as a trusted certificate for Python clients.&#x20;
255+
## Usage
275256

276-
For non-Python clients, you'll need to add the certificate to the trusted certificates for that specific client. If you encounter any issues, feel free to contact us on support chat for assistance.
277-
</Note>
257+
Create a Python script to send a request to your endpoint:
258+
```python icon="python" Python
259+
from vastai import Serverless
260+
import asyncio
278261

279-
</Accordion>
262+
async def main():
263+
async with Serverless() as client:
264+
endpoint = await client.get_endpoint(name="my-endpoint")
280265

281-
<Steps>
282-
<Step title="Running client.py">
283-
In client.py, we are first sending a POST request to the [`/route/`](/documentation/serverless/route) endpoint. This sends a request to the serverless engine asking for a ready worker, with a payload that looks like:
284-
285-
```javascript Javascript icon="js"
286-
route_payload = {
287-
"endpoint": "endpoint_name",
288-
"api_key": "your_serverless_api_key",
289-
"cost": COST,
290-
}
291-
```
292-
293-
The `cost` input here tells the serverless engine how much workload to expect for this request, and is *not&#x20;*&#x72;elated to credits on a Vast.ai account. The engine will reply with a valid worker address, where client.py then calls the `/v1/completions` endpoint with the authentication data returned by the serverless engine and the user's model input text as the payload.
294-
295-
```json JSON
296-
{
297-
"auth_data": {
298-
"cost": 256.0,
299-
"endpoint": "endpoint_name",
300-
"reqnum": "req_num",
301-
"signature": "signature",
302-
"url": "worker_address"
303-
},
304-
"payload": {
305-
"input": {
266+
payload = {
267+
"input" : {
306268
"model": "Qwen/Qwen3-8B",
307-
"prompt": "The capital of USA is",
308-
"temperature": 0.7,
309-
"max_tokens": 256,
310-
"top_k": 20,
311-
"top_p": 0.4,
312-
"stream": false}
269+
"prompt" : "Who are you?",
270+
"max_tokens" : 100,
271+
"temperature" : 0.7
313272
}
314273
}
315-
}
316-
```
317-
318-
The worker hosting the Qwen3-8B model will return the model results to the client, and print them to the user.&#x20;
319-
320-
To quickly run a basic test of the serverless engine with vLLM, navigate to the `pyworker` directory and run:
321-
322-
```none CLI Command
323-
pip install -r requirements.txt && \
324-
python3 -m workers.openai.client -k "$YOUR_USER_API_KEY" -e "vLLM-Qwen3-8B" --model "Qwen/Qwen3-8B" --completion
325-
```
326-
327-
<Warning>
328-
client.py is configured to work with a vast.ai API key, not a Serverless API key. Make sure to set the `API_KEY` variable in your environment, or replace it by pasting in your actual key. You only need to install the requirements.txt file on the first run.
329-
</Warning>
274+
275+
response = await endpoint.request("/v1/completions", payload)
276+
print(response["response"]["choices"][0]["text"])
330277

331-
This should result in a "Ready" worker with the Qwen3-8B model printing a Completion Demo to your terminal window. If we enter the same command without --completion, you will see all of the test modes vLLM has. Because we are testing with Qwen3-8B, all test modes will provide a response (not all LLMs are equipped to use tools).
332-
333-
```bash CLI Command
334-
python3 -m workers.openai.client -k "$YOUR_USER_API_KEY" -e "vLLM-Qwen3-8B" --model "Qwen/Qwen3-8B"
335-
336-
Please specify exactly one test mode:
337-
--completion : Test completions endpoint
338-
--chat : Test chat completions endpoint (non-streaming)
339-
--chat-stream : Test chat completions endpoint with streaming
340-
--tools : Test function calling with ls tool (non-streaming)
341-
--interactive : Start interactive streaming chat session
342-
```
343-
</Step>
344-
345-
<Step title="Monitoring Groups">
346-
There are several endpoints we can use to monitor the status of the serverless engine. To fetch all [Endpoint logs](/documentation/serverless/logs), run the following cURL command:
347-
348-
```bash Bash
349-
curl https://run.vast.ai/get_endpoint_logs/ \
350-
-X POST \
351-
-d '{"endpoint" : "vLLM-Qwen3-8B", "api_key" : "$YOUR_SERVERLESS_API_KEY"}' \
352-
-H 'Content-Type: application/json'
353-
```
354-
355-
Similarily, to fetch all [Workergroup logs](/documentation/serverless/logs), execute:
356-
357-
```bash Bash
358-
curl https://run.vast.ai/get_workergroup_logs/ \
359-
-X POST \
360-
-d '{"id" : WORKERGROUP_ID, "api_key" : "$YOUR_SERVERLESS_API_KEY"}' \
361-
-H 'Content-Type: application/json'
362-
```
363-
364-
All Endpoints and Workergroups continuously track their performance over time, which is sent to the serverless engine as metrics. To see Workergroup metrics, run the following:
365-
366-
```bash Bash
367-
curl -X POST "https://console.vast.ai/api/v0/serverless/metrics/" \
368-
-H "Content-Type: application/json" \
369-
-d '{
370-
"start_date": 1749672382.157,
371-
"end_date": 1749680792.188,
372-
"step": 500,
373-
"type": "autogroup",
374-
"metrics": [
375-
"capacity",
376-
"curload",
377-
"nworkers",
378-
"nrdy_workers_",
379-
"reliable",
380-
"reqrate",
381-
"totreqs",
382-
"perf",
383-
"nrdy_soon_workers_",
384-
"model_disk_usage",
385-
"reqs_working"
386-
],
387-
"resource_id": '"${workergroup_id}"'
388-
}'
389-
390-
```
391-
392-
These metrics are displayed in a Workergroup's UI page.
393-
</Step>
394-
395-
<Step title="Load Testing">
396-
In the Github repo that we cloned earlier, there is a load testing script called `workers/openai/test_load.py`. The *-n&#x20;*&#x66;lag indicates the total number of requests to send to the serverless engine, and the *-rps&#x20;*&#x66;lag indicates the rate (requests/second). The script will print out statistics that show metrics like:
397-
398-
- Total requests currently being generated
399-
- Number of successful generations
400-
- Number of errors
401-
- Total number of workers used during the test
402-
403-
To run this script, make sure the python packages from `requirements.txt` are installed, and execute the following command:
404-
405-
```sh SH
406-
python3 -m workers.openai.test_load -n 100 -rps 1 -k "$YOUR_USER_API_KEY" -e "vLLM-Qwen3-8B" --model "Qwen/Qwen3-8B"
407-
```
408-
</Step>
409-
</Steps>
278+
if __name__ == "__main__":
279+
asyncio.run(main())
280+
```
410281

411-
This is everything you need to start, test, and monitor a vLLM + Qwen3-8B Serverless engine! There are other Vast pre-made [serverless templates](/documentation/templates/quickstart), like the ComfyUI Image Generation model, that can be setup in a similar fashion.&#x20;
282+
This is everything you need to start a vLLM + Qwen3-8B Serverless engine! There are other Vast pre-made [serverless templates](/documentation/templates/quickstart), like the ComfyUI Image Generation model, that can be setup in a similar fashion.&#x20;
412283

413284

414285

documentation/serverless/inside-a-serverless-gpu.mdx

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,8 @@ The authentication information returned by [https://run.vast.ai/route/ ](/docume
6060
"cost": 256,
6161
"endpoint": "Your-Endpoint-Name",
6262
"reqnum": 1234567890,
63-
"url": "http://worker-ip-address:port"
63+
"url": "http://worker-ip-address:port",
64+
"request_idx": 10203040
6465
},
6566
"payload": {
6667
"inputs": "What is the answer to the universe?",

documentation/serverless/logs.mdx

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,20 @@ For both types of groups, there are four levels of logs with decreasing levels o
2929
Each log level has a fixed size, and once it is full, the log is wiped and overwritten with new log messages. It is good practice to check these regularly while debugging.
3030
</Warning>
3131

32+
# Using the CLI
33+
34+
You can use the vastai CLI to quickly check endpoint and worker group logs at different log levels.
35+
36+
## Endpoint logs
37+
```cli CLI Command
38+
vastai get endpt-logs <endpoint_id> --level (0-3)
39+
```
40+
41+
## Workergroup logs
42+
```cli CLI Command
43+
vastai get wrkgrp-logs <worker_group_id> --level (0-3)
44+
```
45+
3246
# POST https\://run.vast.ai/get\_endpoint\_logs/
3347

3448
## Inputs

documentation/serverless/overview.mdx

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,8 @@ The Vast PyWorker wraps the backend code of the model instance you are running.
5454
"cost": 256,
5555
"endpoint": "Your-TGI-Endpoint-Name",
5656
"reqnum": 1234567890,
57-
"url": "http://worker-ip-address:port"
57+
"url": "http://worker-ip-address:port",
58+
"request_idx": 10203040
5859
},
5960
"payload": {
6061
"inputs": "What is the answer to the universe?",
@@ -90,6 +91,7 @@ If you are building a custom PyWorker for your own use case, to be able to integ
9091

9192
- Send a message to the serverless system when the backend server is ready (e.g., after model installation).
9293
- Periodically send performance metrics to the serverless system to optimize usage and performance.
94+
- Periodically send completed request indices to the serverless system to track request lifetimes.
9395
- Report any errors to the serverless system.
9496

9597
For example implementations, reference the [Vast PyWorker repository](https://github.com/vast-ai/pyworker/).

documentation/serverless/route.mdx

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ description: Learn how to use the /route/ endpoint to retrieve a GPU instance ad
2020
}} />
2121

2222
The `/route/` endpoint calls on the serverless engine to retrieve a GPU instance address within your Endpoint.
23+
Request lifetimes are tracked with a `request_idx`. If you wish to retry a request that failed without incurring additional load, you may use the `request_idx` to do so.
2324

2425
# POST https\://run.vast.ai/route/
2526

@@ -28,12 +29,13 @@ The `/route/` endpoint calls on the serverless engine to retrieve a GPU instance
2829
- `endpoint`(string): Name of the Endpoint.
2930
- `api_key`(string): The Vast API key associated with the account that controls the Endpoint. The key can also be placed in the header as an Authorization: Bearer.
3031
- `cost`(float): The estimated compute resources for the request. The units of this cost are defined by the PyWorker. The serverless engine uses the cost as an estimate of the request's workload, and can scale GPU instances to ensure the Endpoint has the proper compute capacity.
31-
32+
- `request_idx`(int): A unique request index that tracks the lifetime of a single request. You don't need it for the first request, but you must pass one in to retry a request.
3233
```json JSON icon="js"
3334
{
3435
"endpoint": "YOUR_ENDPOINT_NAME",
3536
"api_key": "YOUR_VAST_API_KEY",
36-
"cost": 242.0
37+
"cost": 242.0,
38+
"request_idx": 2421 # Only if retrying
3739
}
3840
```
3941

@@ -46,6 +48,7 @@ The `/route/` endpoint calls on the serverless engine to retrieve a GPU instance
4648
- `signature`(string): The signature is a cryptographic string that authenticates the url, cost, and reqnum fields in the response, proving they originated from the server. Clients can use this signature, along with the server's public key, to verify that these specific details have not been tampered with.
4749
- `endpoint`(string): Same as the input parameter.
4850
- `cost`(float): Same as the input parameter.
51+
- `request_idx`(int): If it's a new request, check this field to get your request_idx. Use this in calls to route if you wish to "retry" this request (in case of failure).
4952
- `__request_id `(string): The \_\_request\_id is a unique string identifier generated by the server for each individual API request it receives. This ID is created at the start of processing the request and included in the response, allowing for distinct tracking and logging of every transaction.
5053

5154
```json JSON icon="js"

documentation/serverless/serverless-parameters.mdx

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -69,10 +69,6 @@ If not specified during endpoint creation, the default value is 0.9.
6969

7070
# Workergroup Parameters
7171

72-
The following parameters can be specified specifically for a Workergroup and override Endpoint parameters. The Endpoint parameters will continue to apply for other Workergroups contained in it, unless specifically set.&#x20;
73-
74-
- cold\_workers
75-
7672
The parameters below are specific to only Workergroups, not Endpoints.
7773

7874
## gpu\_ram

0 commit comments

Comments
 (0)