Merge pull request #33 from vast-ai/update-serverless-docs-lucas

LucasArmandVast · web-flow · commit 694ea3482a75 · 2025-11-04T17:34:02.000-08:00
Update Serverless docs for deprecations and new SDK
diff --git a/documentation/serverless/create-endpoints-and-workergroups.mdx b/documentation/serverless/create-endpoints-and-workergroups.mdx
@@ -106,11 +106,10 @@ curl --location 'https://console.vast.ai/api/v0/endptjobs/' \
 
 - `api_key`(string): The Vast API key associated with the account that controls the Endpoint. The key can also be placed in the header as an Authorization: Bearer.
 - `endpoint_name`(string): The name of the Endpoint that the Workergroup will be created under.
-
+OR
+- 'endpoint_id' (integer): The id of the Endpoint that the Workergroup will be created under.
 AND one of the following:
 
-
-
 - `template_hash` (string): The hexadecimal string that identifies a particular template.&#x20;
 
 OR
@@ -125,24 +124,15 @@ OR
 - `launch_args` (string): A command-line style string containing additional parameters for instance creation that will be parsed and applied when the autoscaler creates new workers. This allows you to customize instance configuration beyond what's specified in templates.
 
 
-
 **Optional** (Default values will be assigned if not specified):
 
-- `min_load`(integer): A minimum baseline load (measured in tokens/second for LLMs) that the serverless engine will assume your Endpoint needs to handle, regardless of actual measured traffic. Default value is 1.0.
-- `target_util` (float): A ratio that determines how much spare capacity (headroom) the serverless engine maintains. Default value is 0.9.
-- `cold_mult`(float): A multiplier applied to your target capacity for longer-term planning (1+ hours). This parameter controls how much extra capacity the serverless engine will plan for in the future compared to immediate needs. Default value is 3.0.
-- `test_workers` (integer): The number of different physical machines that a Workergroup should test during its initial "exploration" phase to gather performance data before transitioning to normal demand-based scaling. Default value is 3.
 - `gpu_ram` (integer): The amount of GPU memory (VRAM) in gigabytes that your model or workload requires to run. This parameter tells the serverless engine how much GPU memory your model needs. Default value is 24.
 
 ```json JSON icon="js"
 {
     "api_key": "YOUR_VAST_API_KEY",
     "endpoint_name": "YOUR_ENDPOINT_NAME",
     "template_hash": "YOUR_TEMPLATE_HASH",
-    "min_load": 1.0,
-    "target_util": 0.9,
-    "cold_mult": 3.0,
-    "test_workers": 3,
     "gpu_ram": 24
 }
 ```
@@ -185,10 +175,6 @@ curl --location 'https://console.vast.ai/api/v0/workergroups/' \
 --data '{
   "endpoint_name": "MY_ENDPOINT",
   "template_hash": "MY_TEMPLATE_HASH",
-  "min_load": 10,
-  "target_util": 0.9,
-  "cold_mult": 3.0,
-  "test_workers": 3,
   "gpu_ram": 24
 }'
 ```
@@ -197,7 +183,7 @@ curl --location 'https://console.vast.ai/api/v0/workergroups/' \
 
 
   ```none Terminal
-  vastai create workergroup --endpoint_name "MY_ENDPOINT" --template_hash "MY_TEMPLATE_HASH" --min_load 10 --target_util 0.9 --cold_mult 3.0 --test_workers 3 --gpu_ram 24
+  vastai create workergroup --endpoint_name "MY_ENDPOINT" --template_hash "MY_TEMPLATE_HASH" --gpu_ram 24
   ```
 
 
diff --git a/documentation/serverless/debugging.mdx b/documentation/serverless/debugging.mdx
@@ -50,7 +50,6 @@ echo "${WORKSPACE_DIR:-/workspace}/pyworker.log"
 
 To handle high load on instances:
 
-- **Set&#x20;**`test_workers`**&#x20;high**: Create more instances initially for Worker Groups with anticipated high load.
 - **Adjust&#x20;**`cold_workers`: Keep enough workers around to prevent them from being destroyed during low initial load.
 - **Increase&#x20;**`cold_mult`: Quickly create instances by predicting higher future load based on current high load. Adjust back down once enough instances are created.
 - **Check&#x20;**`max_workers`: Ensure this parameter is set high enough to create the necessary number of workers.
diff --git a/documentation/serverless/getting-started-with-serverless.mdx b/documentation/serverless/getting-started-with-serverless.mdx
@@ -150,9 +150,6 @@ Before we start, there are a few things you will need:
 
         For our simple setup, we can enter the following values:
 
-        - Cold Multiplier = 3
-        - Minimum Load = 1
-        - Target Utilization = 0.9
         - Workergroup Name = 'Workergroup'
         - Select Endpoint = 'vLLM-Qwen3-8B'
 
@@ -167,12 +164,12 @@ Before we start, there are a few things you will need:
         Run the following command to create your Workergroup:
 
         ```sh CLI Command
-        vastai create workergroup --endpoint_name "vLLM-DeepSeek" --template_hash "$TEMPLATE_HASH" --test_workers 5
+        vastai create workergroup --endpoint_name "vLLM-DeepSeek" --template_hash "$TEMPLATE_HASH" --gpu_ram 24
         ```
 
         `endpoint_name`: The name of the Endpoint.
         `template_hash`: The hash code of our custom vLLM (Serverless) template.
-        `test_workers`: The minimum number of workers to create while initializing the Workergroup. This allows the Workergroup to get performance estimates before serving the Endpoint, and also creates workers which are fully loaded and "stopped" (aka "cold").
+        `gpu_ram`: The amount of memory (in GB) that you expect your template to load onto the GPU (i.e. model weights).
 
         <Warning>
           You will need to replace "$TEMPLATE\_HASH" with the template hash copied from step 1.
@@ -217,17 +214,12 @@ We have now successfully created a vLLM + Qwen3-8B Serverless Engine! It is read
 
 # Using the Serverless Engine
 
-To fully understand this section, it is recommended to read the [PyWorker Overview](/documentation/serverless/overview). The overview shows how all the pieces related to the serverless engine work together.
-
-The Vast vLLM (Serverless) template we used in the last section already has a client (client.py) written for it. To use this client, we must run commands in a terminal, since there is no UI available for this section. The client, along with all other files the GPU worker is cloning during initialization, can be found in the [Vast.ai Github repo](https://github.com/vast-ai/pyworker/tree/main). For this section, simply clone the entire repo using:
-
-&#x20;`git clone https://github.com/vast-ai/pyworker.git`
-
-As the User, we want all the files under 'User' to be in our file system. The GPU workers that the system initializes will have the files and entities under 'GPU Worker'.
+To make requests to your endpoint, first install the `vastai_sdk` from pip.
+```sh Bash
+pip install vastai_sdk
+```
 
-<Frame caption="Files and Entities for the user and GPU worker">
-![Files and Entities for the user and GPU worker](/images/getting-started-serverless-8.webp)
-</Frame>
+Make sure you have configured the `VAST_API_KEY` environment variable with your Serverless API key.
 
 ## API Keys
 
@@ -260,155 +252,34 @@ The `show endpoints` command will return a JSON blob like this:
  }
 ```
 
-<Accordion title="Install the TLS certificate \[Optional]">
-  ## Install the TLS certificate \[Optional]
-
-  All of Vast.ai's pre-made Serverless templates use SSL by default. If you want to disable it, you can add `-e USE_SSL=false` to the Docker options in your copy of the template. The Serverless Engine will automatically adjust the instance URL to enable or disable SSL as needed.
-
-  1. Download Vast AI's certificate from [here](https://console.vast.ai/static/jvastai_root.cer).
-  2. In the Python environment where you're running the client script, execute the following command: `python3 -m certifi`
-  3. The command in step 2 will print the path to a file where certificates are stored. Append Vast AI's certificate to that file using the following command: `cat jvastai_root.cer >> PATH/TO/CERT/STORE`
-  4. You may need to run the above command with `sudo` if you are not running Python in a virtual environment.
-
-  <Note>
-    This process only adds Vast AI's TLS certificate as a trusted certificate for Python clients.&#x20;
+## Usage
 
-    For non-Python clients, you'll need to add the certificate to the trusted certificates for that specific client. If you encounter any issues, feel free to contact us on support chat for assistance.
-  </Note>
+Create a Python script to send a request to your endpoint:
+```python icon="python" Python
+from vastai import Serverless
+import asyncio
 
-</Accordion>
+async def main():
+    async with Serverless() as client:
+        endpoint = await client.get_endpoint(name="my-endpoint")
 
-<Steps>
-  <Step title="Running client.py">
-    In client.py, we are first sending a POST request to the [`/route/`](/documentation/serverless/route) endpoint. This sends a request to the serverless engine asking for a ready worker, with a payload that looks like:
-
-    ```javascript Javascript icon="js"
-    route_payload = {
-            "endpoint": "endpoint_name",
-            "api_key":  "your_serverless_api_key",
-            "cost": COST,
-        }
-    ```
-
-    The `cost` input here tells the serverless engine how much workload to expect for this request, and is *not&#x20;*&#x72;elated to credits on a Vast.ai account. The engine will reply with a valid worker address, where client.py then calls the `/v1/completions` endpoint with the authentication data returned by the serverless engine and the user's model input text as the payload.
-
-    ```json JSON
-    {
-        "auth_data": {
-            "cost": 256.0,
-            "endpoint": "endpoint_name",
-            "reqnum": "req_num",
-            "signature": "signature",
-            "url": "worker_address"
-        },
-        "payload": {
-            "input": {
+        payload = {
+            "input" : {
                 "model": "Qwen/Qwen3-8B",
-                "prompt": "The capital of USA is",
-                "temperature": 0.7,
-                "max_tokens": 256,
-                "top_k": 20,
-                "top_p": 0.4, 
-                "stream": false}
+                "prompt" : "Who are you?",
+                "max_tokens" : 100,
+                "temperature" : 0.7
             }
         }
-    }
-    ```
-
-    The worker hosting the Qwen3-8B model will return the model results to the client, and print them to the user.&#x20;
-
-    To quickly run a basic test of the serverless engine with vLLM, navigate to the `pyworker` directory and run:
-
-      ```none CLI Command
-      pip install -r requirements.txt && \
-      python3 -m workers.openai.client -k "$YOUR_USER_API_KEY" -e "vLLM-Qwen3-8B" --model "Qwen/Qwen3-8B" --completion
-      ```
-
-    <Warning>
-      client.py is configured to work with a vast.ai API key, not a Serverless API key. Make sure to set the `API_KEY` variable in your environment, or replace it by pasting in your actual key. You only need to install the requirements.txt file on the first run.
-    </Warning>
+        
+        response = await endpoint.request("/v1/completions", payload)
+        print(response["response"]["choices"][0]["text"])
 
-    This should result in a "Ready" worker with the Qwen3-8B model printing a Completion Demo to your terminal window. If we enter the same command without --completion, you will see all of the test modes vLLM has. Because we are testing with Qwen3-8B, all test modes will provide a response (not all LLMs are equipped to use tools).
-
-      ```bash CLI Command
-      python3 -m workers.openai.client -k "$YOUR_USER_API_KEY" -e "vLLM-Qwen3-8B" --model "Qwen/Qwen3-8B"
-
-      Please specify exactly one test mode:
-        --completion    : Test completions endpoint
-        --chat          : Test chat completions endpoint (non-streaming)
-        --chat-stream   : Test chat completions endpoint with streaming
-        --tools         : Test function calling with ls tool (non-streaming)
-        --interactive   : Start interactive streaming chat session
-      ```
-  </Step>
-
-  <Step title="Monitoring Groups">
-    There are several endpoints we can use to monitor the status of the serverless engine. To fetch all [Endpoint logs](/documentation/serverless/logs), run the following cURL command:
-
-    ```bash Bash
-    curl https://run.vast.ai/get_endpoint_logs/ \
-    -X POST \
-    -d '{"endpoint" : "vLLM-Qwen3-8B", "api_key" : "$YOUR_SERVERLESS_API_KEY"}' \
-    -H 'Content-Type: application/json'
-    ```
-
-    Similarily, to fetch all [Workergroup logs](/documentation/serverless/logs), execute:
-
-    ```bash Bash
-    curl https://run.vast.ai/get_workergroup_logs/ \
-    -X POST \
-    -d '{"id" : WORKERGROUP_ID, "api_key" : "$YOUR_SERVERLESS_API_KEY"}' \
-    -H 'Content-Type: application/json'
-    ```
-
-    All Endpoints and Workergroups continuously track their performance over time, which is sent to the serverless engine as metrics. To see Workergroup metrics, run the following:
-
-    ```bash Bash
-    curl -X POST "https://console.vast.ai/api/v0/serverless/metrics/" \
-         -H "Content-Type: application/json" \
-         -d '{
-               "start_date": 1749672382.157,
-               "end_date":   1749680792.188,
-               "step":       500,
-               "type":       "autogroup",
-               "metrics": [
-                 "capacity",
-                 "curload",
-                 "nworkers",
-                 "nrdy_workers_",
-                 "reliable",
-                 "reqrate",
-                 "totreqs",
-                 "perf",
-                 "nrdy_soon_workers_",
-                 "model_disk_usage",
-                 "reqs_working"
-               ],
-               "resource_id": '"${workergroup_id}"'
-             }'
-
-    ```
-
-    These metrics are displayed in a Workergroup's UI page.
-  </Step>
-
-  <Step title="Load Testing">
-    In the Github repo that we cloned earlier, there is a load testing script called `workers/openai/test_load.py`. The *-n&#x20;*&#x66;lag indicates the total number of requests to send to the serverless engine, and the *-rps&#x20;*&#x66;lag indicates the rate (requests/second). The script will print out statistics that show metrics like:
-
-    - Total requests currently being generated
-    - Number of successful generations
-    - Number of errors
-    - Total number of workers used during the test
-
-    To run this script, make sure the python packages from `requirements.txt` are installed, and execute the following command:
-
-    ```sh SH
-    python3 -m workers.openai.test_load -n 100 -rps 1 -k "$YOUR_USER_API_KEY" -e "vLLM-Qwen3-8B" --model "Qwen/Qwen3-8B"
-    ```
-  </Step>
-</Steps>
+if __name__ == "__main__":
+    asyncio.run(main())
+```
 
-This is everything you need to start, test, and monitor a vLLM + Qwen3-8B Serverless engine! There are other Vast pre-made [serverless templates](/documentation/templates/quickstart), like the ComfyUI Image Generation model, that can be setup in a similar fashion.&#x20;
+This is everything you need to start a vLLM + Qwen3-8B Serverless engine! There are other Vast pre-made [serverless templates](/documentation/templates/quickstart), like the ComfyUI Image Generation model, that can be setup in a similar fashion.&#x20;
 
 
 
diff --git a/documentation/serverless/inside-a-serverless-gpu.mdx b/documentation/serverless/inside-a-serverless-gpu.mdx
@@ -60,7 +60,8 @@ The authentication information returned by [https://run.vast.ai/route/ ](/docume
       "cost": 256,
       "endpoint": "Your-Endpoint-Name",
       "reqnum": 1234567890,
-      "url": "http://worker-ip-address:port"
+      "url": "http://worker-ip-address:port",
+      "request_idx": 10203040
     },
     "payload": {
       "inputs": "What is the answer to the universe?",
diff --git a/documentation/serverless/logs.mdx b/documentation/serverless/logs.mdx
@@ -29,6 +29,20 @@ For both types of groups, there are four levels of logs with decreasing levels o
   Each log level has a fixed size, and once it is full, the log is wiped and overwritten with new log messages. It is good practice to check these regularly while debugging.
 </Warning>
 
+# Using the CLI
+
+You can use the vastai CLI to quickly check endpoint and worker group logs at different log levels.
+
+## Endpoint logs
+```cli CLI Command
+vastai get endpt-logs <endpoint_id> --level (0-3)
+```
+
+## Workergroup logs
+```cli CLI Command
+vastai get wrkgrp-logs <worker_group_id> --level (0-3)
+```
+
 # POST https\://run.vast.ai/get\_endpoint\_logs/
 
 ## Inputs
diff --git a/documentation/serverless/overview.mdx b/documentation/serverless/overview.mdx
@@ -54,7 +54,8 @@ The Vast PyWorker wraps the backend code of the model instance you are running.
       "cost": 256,
       "endpoint": "Your-TGI-Endpoint-Name",
       "reqnum": 1234567890,
-      "url": "http://worker-ip-address:port"
+      "url": "http://worker-ip-address:port",
+      "request_idx": 10203040
     },
     "payload": {
       "inputs": "What is the answer to the universe?",
@@ -90,6 +91,7 @@ If you are building a custom PyWorker for your own use case, to be able to integ
 
 - Send a message to the serverless system when the backend server is ready (e.g., after model installation).
 - Periodically send performance metrics to the serverless system to optimize usage and performance.
+- Periodically send completed request indices to the serverless system to track request lifetimes.
 - Report any errors to the serverless system.
 
 For example implementations, reference the [Vast PyWorker repository](https://github.com/vast-ai/pyworker/).
diff --git a/documentation/serverless/route.mdx b/documentation/serverless/route.mdx
@@ -20,6 +20,7 @@ description: Learn how to use the /route/ endpoint to retrieve a GPU instance ad
 }} />
 
 The `/route/` endpoint calls on the serverless engine to retrieve a GPU instance address within your Endpoint.
+Request lifetimes are tracked with a `request_idx`. If you wish to retry a request that failed without incurring additional load, you may use the `request_idx` to do so.
 
 # POST https\://run.vast.ai/route/
 
@@ -28,12 +29,13 @@ The `/route/` endpoint calls on the serverless engine to retrieve a GPU instance
 - `endpoint`(string): Name of the Endpoint.
 - `api_key`(string): The Vast API key associated with the account that controls the Endpoint. The key can also be placed in the header as an Authorization: Bearer.
 - `cost`(float): The estimated compute resources for the request. The units of this cost are defined by the PyWorker. The serverless engine uses the cost as an estimate of the request's workload, and can scale GPU instances to ensure the Endpoint has the proper compute capacity.
-
+- `request_idx`(int): A unique request index that tracks the lifetime of a single request. You don't need it for the first request, but you must pass one in to retry a request.
 ```json JSON icon="js"
 {
     "endpoint": "YOUR_ENDPOINT_NAME",
     "api_key": "YOUR_VAST_API_KEY",
-    "cost": 242.0
+    "cost": 242.0,
+    "request_idx": 2421 # Only if retrying
 }
 ```
 
@@ -46,6 +48,7 @@ The `/route/` endpoint calls on the serverless engine to retrieve a GPU instance
 - `signature`(string): The signature is a cryptographic string that authenticates the url, cost, and reqnum fields in the response, proving they originated from the server. Clients can use this signature, along with the server's public key, to verify that these specific details have not been tampered with.
 - `endpoint`(string): Same as the input parameter.
 - `cost`(float): Same as the input parameter.
+- `request_idx`(int): If it's a new request, check this field to get your request_idx. Use this in calls to route if you wish to "retry" this request (in case of failure).
 - `__request_id `(string): The \_\_request\_id is a unique string identifier generated by the server for each individual API request it receives. This ID is created at the start of processing the request and included in the response, allowing for distinct tracking and logging of every transaction.
 
 ```json JSON icon="js"
diff --git a/documentation/serverless/serverless-parameters.mdx b/documentation/serverless/serverless-parameters.mdx
@@ -69,10 +69,6 @@ If not specified during endpoint creation, the default value is 0.9.
 
 # Workergroup Parameters
 
-The following parameters can be specified specifically for a Workergroup and override Endpoint parameters. The Endpoint parameters will continue to apply for other Workergroups contained in it, unless specifically set.&#x20;
-
-- cold\_workers
-
 The parameters below are specific to only Workergroups, not Endpoints.
 
 ## gpu\_ram
diff --git a/documentation/serverless/text-generation-inference-tgi.mdx b/documentation/serverless/text-generation-inference-tgi.mdx
diff --git a/documentation/serverless/vllm.mdx b/documentation/serverless/vllm.mdx
diff --git a/documentation/serverless/worker-list.mdx b/documentation/serverless/worker-list.mdx