Skip to content
Merged
16 changes: 1 addition & 15 deletions documentation/serverless/create-endpoints-and-workergroups.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -109,8 +109,6 @@ curl --location 'https://console.vast.ai/api/v0/endptjobs/' \

AND one of the following:



- `template_hash` (string): The hexadecimal string that identifies a particular template. 

OR
Expand All @@ -128,21 +126,13 @@ OR

**Optional** (Default values will be assigned if not specified):

- `min_load`(integer): A minimum baseline load (measured in tokens/second for LLMs) that the serverless engine will assume your Endpoint needs to handle, regardless of actual measured traffic. Default value is 1.0.
- `target_util` (float): A ratio that determines how much spare capacity (headroom) the serverless engine maintains. Default value is 0.9.
- `cold_mult`(float): A multiplier applied to your target capacity for longer-term planning (1+ hours). This parameter controls how much extra capacity the serverless engine will plan for in the future compared to immediate needs. Default value is 3.0.
- `test_workers` (integer): The number of different physical machines that a Workergroup should test during its initial "exploration" phase to gather performance data before transitioning to normal demand-based scaling. Default value is 3.
- `gpu_ram` (integer): The amount of GPU memory (VRAM) in gigabytes that your model or workload requires to run. This parameter tells the serverless engine how much GPU memory your model needs. Default value is 24.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just remove this one too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including the lines below too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpu_ram actually is used on the per-workergroup level, no need to remove it. Unless I'm misreading your comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, but it looks weird to include that in the cli commands. Its almost always in the template. I was just pointing it out to simplify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check me if I'm wrong, but either we get the gpu_ram field for autogroups from the CLI here, or it defaults to 8. Check create_autojobs(request) in client.py and create__workergroup(args) in vast.py. I can't find any example of sourcing this parameter from the template.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


```json JSON icon="js"
{
"api_key": "YOUR_VAST_API_KEY",
"endpoint_name": "YOUR_ENDPOINT_NAME",
"template_hash": "YOUR_TEMPLATE_HASH",
"min_load": 1.0,
"target_util": 0.9,
"cold_mult": 3.0,
"test_workers": 3,
"gpu_ram": 24
}
```
Expand Down Expand Up @@ -185,10 +175,6 @@ curl --location 'https://console.vast.ai/api/v0/workergroups/' \
--data '{
"endpoint_name": "MY_ENDPOINT",
"template_hash": "MY_TEMPLATE_HASH",
"min_load": 10,
"target_util": 0.9,
"cold_mult": 3.0,
"test_workers": 3,
"gpu_ram": 24
}'
```
Expand All @@ -197,7 +183,7 @@ curl --location 'https://console.vast.ai/api/v0/workergroups/' \


```none Terminal
vastai create workergroup --endpoint_name "MY_ENDPOINT" --template_hash "MY_TEMPLATE_HASH" --min_load 10 --target_util 0.9 --cold_mult 3.0 --test_workers 3 --gpu_ram 24
vastai create workergroup --endpoint_name "MY_ENDPOINT" --template_hash "MY_TEMPLATE_HASH" --gpu_ram 24
```


Expand Down
1 change: 0 additions & 1 deletion documentation/serverless/debugging.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,6 @@ echo "${WORKSPACE_DIR:-/workspace}/pyworker.log"

To handle high load on instances:

- **Set **`test_workers`** high**: Create more instances initially for Worker Groups with anticipated high load.
- **Adjust **`cold_workers`: Keep enough workers around to prevent them from being destroyed during low initial load.
- **Increase **`cold_mult`: Quickly create instances by predicting higher future load based on current high load. Adjust back down once enough instances are created.
- **Check **`max_workers`: Ensure this parameter is set high enough to create the necessary number of workers.
Expand Down
183 changes: 27 additions & 156 deletions documentation/serverless/getting-started-with-serverless.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -150,9 +150,6 @@ Before we start, there are a few things you will need:

For our simple setup, we can enter the following values:

- Cold Multiplier = 3
- Minimum Load = 1
- Target Utilization = 0.9
- Workergroup Name = 'Workergroup'
- Select Endpoint = 'vLLM-Qwen3-8B'

Expand All @@ -167,12 +164,12 @@ Before we start, there are a few things you will need:
Run the following command to create your Workergroup:

```sh CLI Command
vastai create workergroup --endpoint_name "vLLM-DeepSeek" --template_hash "$TEMPLATE_HASH" --test_workers 5
vastai create workergroup --endpoint_name "vLLM-DeepSeek" --template_hash "$TEMPLATE_HASH" --gpu_ram 24
```

`endpoint_name`: The name of the Endpoint.
`template_hash`: The hash code of our custom vLLM (Serverless) template.
`test_workers`: The minimum number of workers to create while initializing the Workergroup. This allows the Workergroup to get performance estimates before serving the Endpoint, and also creates workers which are fully loaded and "stopped" (aka "cold").
`gpu_ram`: The amount of memory (in GB) that you expect your template to load onto the GPU (i.e. model weights).

<Warning>
You will need to replace "$TEMPLATE\_HASH" with the template hash copied from step 1.
Expand Down Expand Up @@ -217,17 +214,12 @@ We have now successfully created a vLLM + Qwen3-8B Serverless Engine! It is read

# Using the Serverless Engine

To fully understand this section, it is recommended to read the [PyWorker Overview](/documentation/serverless/overview). The overview shows how all the pieces related to the serverless engine work together.

The Vast vLLM (Serverless) template we used in the last section already has a client (client.py) written for it. To use this client, we must run commands in a terminal, since there is no UI available for this section. The client, along with all other files the GPU worker is cloning during initialization, can be found in the [Vast.ai Github repo](https://github.com/vast-ai/pyworker/tree/main). For this section, simply clone the entire repo using:

&#x20;`git clone https://github.com/vast-ai/pyworker.git`

As the User, we want all the files under 'User' to be in our file system. The GPU workers that the system initializes will have the files and entities under 'GPU Worker'.
To make requests to your endpoint, first install the `vastai_sdk` from pip.
```sh Bash
pip install vastai_sdk
```

<Frame caption="Files and Entities for the user and GPU worker">
![Files and Entities for the user and GPU worker](/images/getting-started-serverless-8.webp)
</Frame>
Make sure you have configured the `VAST_API_KEY` environment variable with your Serverless API key.

## API Keys

Expand Down Expand Up @@ -260,155 +252,34 @@ The `show endpoints` command will return a JSON blob like this:
}
```

<Accordion title="Install the TLS certificate \[Optional]">
## Install the TLS certificate \[Optional]

All of Vast.ai's pre-made Serverless templates use SSL by default. If you want to disable it, you can add `-e USE_SSL=false` to the Docker options in your copy of the template. The Serverless Engine will automatically adjust the instance URL to enable or disable SSL as needed.

1. Download Vast AI's certificate from [here](https://console.vast.ai/static/jvastai_root.cer).
2. In the Python environment where you're running the client script, execute the following command: `python3 -m certifi`
3. The command in step 2 will print the path to a file where certificates are stored. Append Vast AI's certificate to that file using the following command: `cat jvastai_root.cer >> PATH/TO/CERT/STORE`
4. You may need to run the above command with `sudo` if you are not running Python in a virtual environment.

<Note>
This process only adds Vast AI's TLS certificate as a trusted certificate for Python clients.&#x20;
## Usage

For non-Python clients, you'll need to add the certificate to the trusted certificates for that specific client. If you encounter any issues, feel free to contact us on support chat for assistance.
</Note>
Create a Python script to send a request to your endpoint:
```python icon="python" Python
from vastai import Serverless
import asyncio

</Accordion>
async def main():
async with Serverless() as client:
endpoint = await client.get_endpoint(name="my-endpoint")

<Steps>
<Step title="Running client.py">
In client.py, we are first sending a POST request to the [`/route/`](/documentation/serverless/route) endpoint. This sends a request to the serverless engine asking for a ready worker, with a payload that looks like:

```javascript Javascript icon="js"
route_payload = {
"endpoint": "endpoint_name",
"api_key": "your_serverless_api_key",
"cost": COST,
}
```

The `cost` input here tells the serverless engine how much workload to expect for this request, and is *not&#x20;*&#x72;elated to credits on a Vast.ai account. The engine will reply with a valid worker address, where client.py then calls the `/v1/completions` endpoint with the authentication data returned by the serverless engine and the user's model input text as the payload.

```json JSON
{
"auth_data": {
"cost": 256.0,
"endpoint": "endpoint_name",
"reqnum": "req_num",
"signature": "signature",
"url": "worker_address"
},
"payload": {
"input": {
payload = {
"input" : {
"model": "Qwen/Qwen3-8B",
"prompt": "The capital of USA is",
"temperature": 0.7,
"max_tokens": 256,
"top_k": 20,
"top_p": 0.4,
"stream": false}
"prompt" : "Who are you?",
"max_tokens" : 100,
"temperature" : 0.7
}
}
}
```

The worker hosting the Qwen3-8B model will return the model results to the client, and print them to the user.&#x20;

To quickly run a basic test of the serverless engine with vLLM, navigate to the `pyworker` directory and run:

```none CLI Command
pip install -r requirements.txt && \
python3 -m workers.openai.client -k "$YOUR_USER_API_KEY" -e "vLLM-Qwen3-8B" --model "Qwen/Qwen3-8B" --completion
```

<Warning>
client.py is configured to work with a vast.ai API key, not a Serverless API key. Make sure to set the `API_KEY` variable in your environment, or replace it by pasting in your actual key. You only need to install the requirements.txt file on the first run.
</Warning>

response = await endpoint.request("/v1/completions", payload)
print(response["response"]["choices"][0]["text"])

This should result in a "Ready" worker with the Qwen3-8B model printing a Completion Demo to your terminal window. If we enter the same command without --completion, you will see all of the test modes vLLM has. Because we are testing with Qwen3-8B, all test modes will provide a response (not all LLMs are equipped to use tools).

```bash CLI Command
python3 -m workers.openai.client -k "$YOUR_USER_API_KEY" -e "vLLM-Qwen3-8B" --model "Qwen/Qwen3-8B"

Please specify exactly one test mode:
--completion : Test completions endpoint
--chat : Test chat completions endpoint (non-streaming)
--chat-stream : Test chat completions endpoint with streaming
--tools : Test function calling with ls tool (non-streaming)
--interactive : Start interactive streaming chat session
```
</Step>

<Step title="Monitoring Groups">
There are several endpoints we can use to monitor the status of the serverless engine. To fetch all [Endpoint logs](/documentation/serverless/logs), run the following cURL command:

```bash Bash
curl https://run.vast.ai/get_endpoint_logs/ \
-X POST \
-d '{"endpoint" : "vLLM-Qwen3-8B", "api_key" : "$YOUR_SERVERLESS_API_KEY"}' \
-H 'Content-Type: application/json'
```

Similarily, to fetch all [Workergroup logs](/documentation/serverless/logs), execute:

```bash Bash
curl https://run.vast.ai/get_workergroup_logs/ \
-X POST \
-d '{"id" : WORKERGROUP_ID, "api_key" : "$YOUR_SERVERLESS_API_KEY"}' \
-H 'Content-Type: application/json'
```

All Endpoints and Workergroups continuously track their performance over time, which is sent to the serverless engine as metrics. To see Workergroup metrics, run the following:

```bash Bash
curl -X POST "https://console.vast.ai/api/v0/serverless/metrics/" \
-H "Content-Type: application/json" \
-d '{
"start_date": 1749672382.157,
"end_date": 1749680792.188,
"step": 500,
"type": "autogroup",
"metrics": [
"capacity",
"curload",
"nworkers",
"nrdy_workers_",
"reliable",
"reqrate",
"totreqs",
"perf",
"nrdy_soon_workers_",
"model_disk_usage",
"reqs_working"
],
"resource_id": '"${workergroup_id}"'
}'

```

These metrics are displayed in a Workergroup's UI page.
</Step>

<Step title="Load Testing">
In the Github repo that we cloned earlier, there is a load testing script called `workers/openai/test_load.py`. The *-n&#x20;*&#x66;lag indicates the total number of requests to send to the serverless engine, and the *-rps&#x20;*&#x66;lag indicates the rate (requests/second). The script will print out statistics that show metrics like:

- Total requests currently being generated
- Number of successful generations
- Number of errors
- Total number of workers used during the test

To run this script, make sure the python packages from `requirements.txt` are installed, and execute the following command:

```sh SH
python3 -m workers.openai.test_load -n 100 -rps 1 -k "$YOUR_USER_API_KEY" -e "vLLM-Qwen3-8B" --model "Qwen/Qwen3-8B"
```
</Step>
</Steps>
if __name__ == "__main__":
asyncio.run(main())
```

This is everything you need to start, test, and monitor a vLLM + Qwen3-8B Serverless engine! There are other Vast pre-made [serverless templates](/documentation/templates/quickstart), like the ComfyUI Image Generation model, that can be setup in a similar fashion.&#x20;
This is everything you need to start a vLLM + Qwen3-8B Serverless engine! There are other Vast pre-made [serverless templates](/documentation/templates/quickstart), like the ComfyUI Image Generation model, that can be setup in a similar fashion.&#x20;



3 changes: 2 additions & 1 deletion documentation/serverless/inside-a-serverless-gpu.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,8 @@ The authentication information returned by [https://run.vast.ai/route/ ](/docume
"cost": 256,
"endpoint": "Your-Endpoint-Name",
"reqnum": 1234567890,
"url": "http://worker-ip-address:port"
"url": "http://worker-ip-address:port",
"request_idx": 10203040
},
"payload": {
"inputs": "What is the answer to the universe?",
Expand Down
14 changes: 14 additions & 0 deletions documentation/serverless/logs.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,20 @@ For both types of groups, there are four levels of logs with decreasing levels o
Each log level has a fixed size, and once it is full, the log is wiped and overwritten with new log messages. It is good practice to check these regularly while debugging.
</Warning>

# Using the CLI

You can use the vastai CLI to quickly check endpoint and worker group logs at different log levels.

## Endpoint logs
```cli CLI Command
vastai get endpt-logs <endpoint_id> --level (0-3)
```

## Workergroup logs
```cli CLI Command
vastai get wrkgrp-logs <worker_group_id> --level (0-3)
```

# POST https\://run.vast.ai/get\_endpoint\_logs/

## Inputs
Expand Down
4 changes: 3 additions & 1 deletion documentation/serverless/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,8 @@ The Vast PyWorker wraps the backend code of the model instance you are running.
"cost": 256,
"endpoint": "Your-TGI-Endpoint-Name",
"reqnum": 1234567890,
"url": "http://worker-ip-address:port"
"url": "http://worker-ip-address:port",
"request_idx": 10203040
},
"payload": {
"inputs": "What is the answer to the universe?",
Expand Down Expand Up @@ -90,6 +91,7 @@ If you are building a custom PyWorker for your own use case, to be able to integ

- Send a message to the serverless system when the backend server is ready (e.g., after model installation).
- Periodically send performance metrics to the serverless system to optimize usage and performance.
- Periodically send completed request indices to the serverless system to track request lifetimes.
- Report any errors to the serverless system.

For example implementations, reference the [Vast PyWorker repository](https://github.com/vast-ai/pyworker/).
Expand Down
7 changes: 5 additions & 2 deletions documentation/serverless/route.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ description: Learn how to use the /route/ endpoint to retrieve a GPU instance ad
}} />

The `/route/` endpoint calls on the serverless engine to retrieve a GPU instance address within your Endpoint.
Request lifetimes are tracked with a `request_idx`. If you wish to retry a request that failed without incurring additional load, you may use the `request_idx` to do so.

# POST https\://run.vast.ai/route/

Expand All @@ -28,12 +29,13 @@ The `/route/` endpoint calls on the serverless engine to retrieve a GPU instance
- `endpoint`(string): Name of the Endpoint.
- `api_key`(string): The Vast API key associated with the account that controls the Endpoint. The key can also be placed in the header as an Authorization: Bearer.
- `cost`(float): The estimated compute resources for the request. The units of this cost are defined by the PyWorker. The serverless engine uses the cost as an estimate of the request's workload, and can scale GPU instances to ensure the Endpoint has the proper compute capacity.

- `request_idx`(int): A unique request index that tracks the lifetime of a single request. You don't need it for the first request, but you must pass one in to retry a request.
```json JSON icon="js"
{
"endpoint": "YOUR_ENDPOINT_NAME",
"api_key": "YOUR_VAST_API_KEY",
"cost": 242.0
"cost": 242.0,
"request_idx": 2421 # Only if retrying
}
```

Expand All @@ -46,6 +48,7 @@ The `/route/` endpoint calls on the serverless engine to retrieve a GPU instance
- `signature`(string): The signature is a cryptographic string that authenticates the url, cost, and reqnum fields in the response, proving they originated from the server. Clients can use this signature, along with the server's public key, to verify that these specific details have not been tampered with.
- `endpoint`(string): Same as the input parameter.
- `cost`(float): Same as the input parameter.
- `request_idx`(int): If it's a new request, check this field to get your request_idx. Use this in calls to route if you wish to "retry" this request (in case of failure).
- `__request_id `(string): The \_\_request\_id is a unique string identifier generated by the server for each individual API request it receives. This ID is created at the start of processing the request and included in the response, allowing for distinct tracking and logging of every transaction.

```json JSON icon="js"
Expand Down
6 changes: 1 addition & 5 deletions documentation/serverless/serverless-parameters.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -69,15 +69,11 @@ If not specified during endpoint creation, the default value is 0.9.

# Workergroup Parameters

The following parameters can be specified specifically for a Workergroup and override Endpoint parameters. The Endpoint parameters will continue to apply for other Workergroups contained in it, unless specifically set.&#x20;

- cold\_workers

The parameters below are specific to only Workergroups, not Endpoints.

## gpu\_ram

The amount of GPU memory (VRAM) in gigabytes that your model or workload requires to run. This parameter tells the serverless engine how much GPU memory your model needs.
The amount of GPU memory (VRAM) in gigabytes that your model or workload requires to run. This parameter tells the serverless engine how much GPU memory your model needs, and is primarily used to detect unusually long model load times.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can remove this


If not specified during workergroup creation, the default value is 24.

Expand Down
Loading