You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-`api_key`(string): The Vast API key associated with the account that controls the Endpoint. The key can also be placed in the header as an Authorization: Bearer.
108
108
-`endpoint_name`(string): The name of the Endpoint that the Workergroup will be created under.
109
-
109
+
OR
110
+
- 'endpoint_id' (integer): The id of the Endpoint that the Workergroup will be created under.
110
111
AND one of the following:
111
112
112
-
113
-
114
113
-`template_hash` (string): The hexadecimal string that identifies a particular template. 
115
114
116
115
OR
@@ -125,24 +124,15 @@ OR
125
124
-`launch_args` (string): A command-line style string containing additional parameters for instance creation that will be parsed and applied when the autoscaler creates new workers. This allows you to customize instance configuration beyond what's specified in templates.
126
125
127
126
128
-
129
127
**Optional** (Default values will be assigned if not specified):
130
128
131
-
-`min_load`(integer): A minimum baseline load (measured in tokens/second for LLMs) that the serverless engine will assume your Endpoint needs to handle, regardless of actual measured traffic. Default value is 1.0.
132
-
-`target_util` (float): A ratio that determines how much spare capacity (headroom) the serverless engine maintains. Default value is 0.9.
133
-
-`cold_mult`(float): A multiplier applied to your target capacity for longer-term planning (1+ hours). This parameter controls how much extra capacity the serverless engine will plan for in the future compared to immediate needs. Default value is 3.0.
134
-
-`test_workers` (integer): The number of different physical machines that a Workergroup should test during its initial "exploration" phase to gather performance data before transitioning to normal demand-based scaling. Default value is 3.
135
129
-`gpu_ram` (integer): The amount of GPU memory (VRAM) in gigabytes that your model or workload requires to run. This parameter tells the serverless engine how much GPU memory your model needs. Default value is 24.
-**Set **`test_workers`** high**: Create more instances initially for Worker Groups with anticipated high load.
54
53
-**Adjust **`cold_workers`: Keep enough workers around to prevent them from being destroyed during low initial load.
55
54
-**Increase **`cold_mult`: Quickly create instances by predicting higher future load based on current high load. Adjust back down once enough instances are created.
56
55
-**Check **`max_workers`: Ensure this parameter is set high enough to create the necessary number of workers.
`template_hash`: The hash code of our custom vLLM (Serverless) template.
175
-
`test_workers`: The minimum number of workers to create while initializing the Workergroup. This allows the Workergroup to get performance estimates before serving the Endpoint, and also creates workers which are fully loaded and "stopped" (aka "cold").
172
+
`gpu_ram`: The amount of memory (in GB) that you expect your template to load onto the GPU (i.e. model weights).
176
173
177
174
<Warning>
178
175
You will need to replace "$TEMPLATE\_HASH" with the template hash copied from step 1.
@@ -217,17 +214,12 @@ We have now successfully created a vLLM + Qwen3-8B Serverless Engine! It is read
217
214
218
215
# Using the Serverless Engine
219
216
220
-
To fully understand this section, it is recommended to read the [PyWorker Overview](/documentation/serverless/overview). The overview shows how all the pieces related to the serverless engine work together.
221
-
222
-
The Vast vLLM (Serverless) template we used in the last section already has a client (client.py) written for it. To use this client, we must run commands in a terminal, since there is no UI available for this section. The client, along with all other files the GPU worker is cloning during initialization, can be found in the [Vast.ai Github repo](https://github.com/vast-ai/pyworker/tree/main). For this section, simply clone the entire repo using:
As the User, we want all the files under 'User' to be in our file system. The GPU workers that the system initializes will have the files and entities under 'GPU Worker'.
217
+
To make requests to your endpoint, first install the `vastai_sdk` from pip.
218
+
```sh Bash
219
+
pip install vastai_sdk
220
+
```
227
221
228
-
<Framecaption="Files and Entities for the user and GPU worker">
229
-

230
-
</Frame>
222
+
Make sure you have configured the `VAST_API_KEY` environment variable with your Serverless API key.
231
223
232
224
## API Keys
233
225
@@ -260,155 +252,34 @@ The `show endpoints` command will return a JSON blob like this:
260
252
}
261
253
```
262
254
263
-
<Accordiontitle="Install the TLS certificate \[Optional]">
264
-
## Install the TLS certificate \[Optional]
265
-
266
-
All of Vast.ai's pre-made Serverless templates use SSL by default. If you want to disable it, you can add `-e USE_SSL=false` to the Docker options in your copy of the template. The Serverless Engine will automatically adjust the instance URL to enable or disable SSL as needed.
267
-
268
-
1. Download Vast AI's certificate from [here](https://console.vast.ai/static/jvastai_root.cer).
269
-
2. In the Python environment where you're running the client script, execute the following command: `python3 -m certifi`
270
-
3. The command in step 2 will print the path to a file where certificates are stored. Append Vast AI's certificate to that file using the following command: `cat jvastai_root.cer >> PATH/TO/CERT/STORE`
271
-
4. You may need to run the above command with `sudo` if you are not running Python in a virtual environment.
272
-
273
-
<Note>
274
-
This process only adds Vast AI's TLS certificate as a trusted certificate for Python clients. 
255
+
## Usage
275
256
276
-
For non-Python clients, you'll need to add the certificate to the trusted certificates for that specific client. If you encounter any issues, feel free to contact us on support chat for assistance.
277
-
</Note>
257
+
Create a Python script to send a request to your endpoint:
In client.py, we are first sending a POST request to the [`/route/`](/documentation/serverless/route) endpoint. This sends a request to the serverless engine asking for a ready worker, with a payload that looks like:
284
-
285
-
```javascript Javascript icon="js"
286
-
route_payload = {
287
-
"endpoint":"endpoint_name",
288
-
"api_key":"your_serverless_api_key",
289
-
"cost":COST,
290
-
}
291
-
```
292
-
293
-
The `cost` input here tells the serverless engine how much workload to expect for this request, and is *not *related to credits on a Vast.ai account. The engine will reply with a valid worker address, where client.py then calls the `/v1/completions` endpoint with the authentication data returned by the serverless engine and the user's model input text as the payload.
294
-
295
-
```json JSON
296
-
{
297
-
"auth_data": {
298
-
"cost": 256.0,
299
-
"endpoint": "endpoint_name",
300
-
"reqnum": "req_num",
301
-
"signature": "signature",
302
-
"url": "worker_address"
303
-
},
304
-
"payload": {
305
-
"input": {
266
+
payload = {
267
+
"input" : {
306
268
"model": "Qwen/Qwen3-8B",
307
-
"prompt": "The capital of USA is",
308
-
"temperature": 0.7,
309
-
"max_tokens": 256,
310
-
"top_k": 20,
311
-
"top_p": 0.4,
312
-
"stream": false}
269
+
"prompt" : "Who are you?",
270
+
"max_tokens" : 100,
271
+
"temperature" : 0.7
313
272
}
314
273
}
315
-
}
316
-
```
317
-
318
-
The worker hosting the Qwen3-8B model will return the model results to the client, and print them to the user. 
319
-
320
-
To quickly run a basic test of the serverless engine with vLLM, navigate to the `pyworker` directory and run:
client.py is configured to work with a vast.ai API key, not a Serverless API key. Make sure to set the `API_KEY` variable in your environment, or replace it by pasting in your actual key. You only need to install the requirements.txt file on the first run.
This should result in a "Ready" worker with the Qwen3-8B model printing a Completion Demo to your terminal window. If we enter the same command without --completion, you will see all of the test modes vLLM has. Because we are testing with Qwen3-8B, all test modes will provide a response (not all LLMs are equipped to use tools).
There are several endpoints we can use to monitor the status of the serverless engine. To fetch all [Endpoint logs](/documentation/serverless/logs), run the following cURL command:
All Endpoints and Workergroups continuously track their performance over time, which is sent to the serverless engine as metrics. To see Workergroup metrics, run the following:
365
-
366
-
```bash Bash
367
-
curl -X POST "https://console.vast.ai/api/v0/serverless/metrics/" \
368
-
-H "Content-Type: application/json" \
369
-
-d '{
370
-
"start_date": 1749672382.157,
371
-
"end_date": 1749680792.188,
372
-
"step": 500,
373
-
"type": "autogroup",
374
-
"metrics": [
375
-
"capacity",
376
-
"curload",
377
-
"nworkers",
378
-
"nrdy_workers_",
379
-
"reliable",
380
-
"reqrate",
381
-
"totreqs",
382
-
"perf",
383
-
"nrdy_soon_workers_",
384
-
"model_disk_usage",
385
-
"reqs_working"
386
-
],
387
-
"resource_id": '"${workergroup_id}"'
388
-
}'
389
-
390
-
```
391
-
392
-
These metrics are displayed in a Workergroup's UI page.
393
-
</Step>
394
-
395
-
<Steptitle="Load Testing">
396
-
In the Github repo that we cloned earlier, there is a load testing script called `workers/openai/test_load.py`. The *-n *flag indicates the total number of requests to send to the serverless engine, and the *-rps *flag indicates the rate (requests/second). The script will print out statistics that show metrics like:
397
-
398
-
- Total requests currently being generated
399
-
- Number of successful generations
400
-
- Number of errors
401
-
- Total number of workers used during the test
402
-
403
-
To run this script, make sure the python packages from `requirements.txt` are installed, and execute the following command:
This is everything you need to start, test, and monitor a vLLM + Qwen3-8B Serverless engine! There are other Vast pre-made [serverless templates](/documentation/templates/quickstart), like the ComfyUI Image Generation model, that can be setup in a similar fashion. 
282
+
This is everything you need to start a vLLM + Qwen3-8B Serverless engine! There are other Vast pre-made [serverless templates](/documentation/templates/quickstart), like the ComfyUI Image Generation model, that can be setup in a similar fashion. 
Copy file name to clipboardExpand all lines: documentation/serverless/logs.mdx
+14Lines changed: 14 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,6 +29,20 @@ For both types of groups, there are four levels of logs with decreasing levels o
29
29
Each log level has a fixed size, and once it is full, the log is wiped and overwritten with new log messages. It is good practice to check these regularly while debugging.
30
30
</Warning>
31
31
32
+
# Using the CLI
33
+
34
+
You can use the vastai CLI to quickly check endpoint and worker group logs at different log levels.
35
+
36
+
## Endpoint logs
37
+
```cli CLI Command
38
+
vastai get endpt-logs <endpoint_id> --level (0-3)
39
+
```
40
+
41
+
## Workergroup logs
42
+
```cli CLI Command
43
+
vastai get wrkgrp-logs <worker_group_id> --level (0-3)
Copy file name to clipboardExpand all lines: documentation/serverless/route.mdx
+5-2Lines changed: 5 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,6 +20,7 @@ description: Learn how to use the /route/ endpoint to retrieve a GPU instance ad
20
20
}} />
21
21
22
22
The `/route/` endpoint calls on the serverless engine to retrieve a GPU instance address within your Endpoint.
23
+
Request lifetimes are tracked with a `request_idx`. If you wish to retry a request that failed without incurring additional load, you may use the `request_idx` to do so.
23
24
24
25
# POST https\://run.vast.ai/route/
25
26
@@ -28,12 +29,13 @@ The `/route/` endpoint calls on the serverless engine to retrieve a GPU instance
28
29
-`endpoint`(string): Name of the Endpoint.
29
30
-`api_key`(string): The Vast API key associated with the account that controls the Endpoint. The key can also be placed in the header as an Authorization: Bearer.
30
31
-`cost`(float): The estimated compute resources for the request. The units of this cost are defined by the PyWorker. The serverless engine uses the cost as an estimate of the request's workload, and can scale GPU instances to ensure the Endpoint has the proper compute capacity.
31
-
32
+
-`request_idx`(int): A unique request index that tracks the lifetime of a single request. You don't need it for the first request, but you must pass one in to retry a request.
32
33
```json JSON icon="js"
33
34
{
34
35
"endpoint": "YOUR_ENDPOINT_NAME",
35
36
"api_key": "YOUR_VAST_API_KEY",
36
-
"cost": 242.0
37
+
"cost": 242.0,
38
+
"request_idx": 2421# Only if retrying
37
39
}
38
40
```
39
41
@@ -46,6 +48,7 @@ The `/route/` endpoint calls on the serverless engine to retrieve a GPU instance
46
48
-`signature`(string): The signature is a cryptographic string that authenticates the url, cost, and reqnum fields in the response, proving they originated from the server. Clients can use this signature, along with the server's public key, to verify that these specific details have not been tampered with.
47
49
-`endpoint`(string): Same as the input parameter.
48
50
-`cost`(float): Same as the input parameter.
51
+
-`request_idx`(int): If it's a new request, check this field to get your request_idx. Use this in calls to route if you wish to "retry" this request (in case of failure).
49
52
-`__request_id `(string): The \_\_request\_id is a unique string identifier generated by the server for each individual API request it receives. This ID is created at the start of processing the request and included in the response, allowing for distinct tracking and logging of every transaction.
Copy file name to clipboardExpand all lines: documentation/serverless/serverless-parameters.mdx
-4Lines changed: 0 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -69,10 +69,6 @@ If not specified during endpoint creation, the default value is 0.9.
69
69
70
70
# Workergroup Parameters
71
71
72
-
The following parameters can be specified specifically for a Workergroup and override Endpoint parameters. The Endpoint parameters will continue to apply for other Workergroups contained in it, unless specifically set. 
73
-
74
-
- cold\_workers
75
-
76
72
The parameters below are specific to only Workergroups, not Endpoints.
0 commit comments