Skip to content
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions doc/source/serve/advanced-guides/advanced-autoscaling.md
Original file line number Diff line number Diff line change
Expand Up @@ -670,6 +670,135 @@ In your policy, access custom metrics via:
* **`ctx.aggregated_metrics[metric_name]`** — A time-weighted average computed from the raw metric values for each replica.


## External scaling webhook

:::{warning}
This API is in alpha and may change before becoming stable.
:::

Ray Serve exposes a REST API endpoint that you can use to dynamically scale your deployments from outside the Ray cluster. This endpoint gives you flexibility to implement custom scaling logic based on any metrics or signals you choose, such as external monitoring systems, business metrics, or predictive models.

The external scaling webhook provides programmatic control over the number of replicas for any deployment in your Ray Serve application. Unlike Ray Serve's built-in autoscaling, which scales based on queue depth and ongoing requests, this webhook allows you to scale based on any external criteria you define.

### Enable external scaler

Before using the external scaling webhook, enable it in your application configuration by setting `external_scaler_enabled: true`:

```{literalinclude} ../doc_code/external_scaler_config.yaml
---
start-after: __external_scaler_config_begin__
end-before: __external_scaler_config_end__
emphasize-lines: 5
language: yaml
---
```

:::{warning}
External scaling and built-in autoscaling are mutually exclusive. You can't use both for the same application. If you set `external_scaler_enabled: true`, you **must not** configure `autoscaling_config` on any deployment in that application. Attempting to use both results in an error.
:::

### API endpoint

The external scaling webhook requires authentication using a bearer token. You can obtain this token from the Ray dashboard UI (typically at `http://localhost:8265`) in the Serve section.

Scale a deployment by sending a POST request with the target number of replicas:

```bash
curl -X POST http://localhost:8000/api/v1/applications/{application_name}/deployments/{deployment_name}/scale \
-H "Authorization: Bearer <your_token>" \
-H "Content-Type: application/json" \
-d '{"target_num_replicas": 5}'
```

Replace `{application_name}` and `{deployment_name}` with your application and deployment names, and `<your_token>` with the authentication token from the Ray dashboard.

The request body must conform to the [`ScaleDeploymentRequest`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.schema.ScaleDeploymentRequest.html) schema. The `target_num_replicas` field (integer, required) specifies the target number of replicas for the deployment and must be a non-negative integer.

### Important considerations

Understanding how the external scaler interacts with your deployments helps you build reliable scaling logic:

#### Idempotent API calls

The scaling API is idempotent. You can safely call it multiple times with the same `target_num_replicas` value without side effects. This makes it safe to run your scaling logic on a schedule or in response to repeated metric updates.

#### Interaction with serve deploy

When you upgrade your service with `serve deploy`, the number of replicas you set through the external scaler API stays intact. This behavior matches what you'd expect from Ray Serve's built-in autoscaler—deployment updates don't reset replica counts.

#### Query current replica count

You can get the current number of replicas for any deployment by querying the GET `/applications` API:

```bash
curl -X GET http://localhost:8000/api/v1/applications \
-H "Authorization: Bearer <your_token>"
```

The response follows the [`ServeInstanceDetails`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.schema.ServeInstanceDetails.html) schema, which includes an `applications` field containing a dictionary with application names as keys. Each application includes detailed information about all its deployments, including current replica counts. Use this information to make informed scaling decisions. For example, you might scale up gradually by adding a percentage of existing replicas rather than jumping to a fixed number.

#### Initial replica count

When you deploy an application for the first time, Ray Serve creates the number of replicas specified in the `num_replicas` field of your deployment configuration. The external scaler can then adjust this count dynamically based on your scaling logic.

### Example: Predictive scaling

This example shows how to implement predictive scaling based on historical patterns or forecasts. You can preemptively scale up before anticipated traffic spikes by running an external script that adjusts replica counts based on time of day.

#### Define the deployment

The following example creates a simple text processing deployment that you can scale externally:

```{literalinclude} ../doc_code/external_scaler_predictive.py
:language: python
:start-after: __serve_example_begin__
:end-before: __serve_example_end__
```

#### Configure external scaling

Create a configuration file with `external_scaler_enabled: true`:

```{literalinclude} ../doc_code/external_scaler_predictive.yaml
:language: yaml
:start-after: __config_begin__
:end-before: __config_end__
```

#### Implement the scaling logic

The following script implements predictive scaling based on time of day and historical traffic patterns:

```{literalinclude} ../doc_code/external_scaler_predictive_client.py
:language: python
:start-after: __client_script_begin__
:end-before: __client_script_end__
```

#### Run the example

Follow these steps to run the complete example:

1. Start the Ray Serve application:

```bash
serve run external_scaler_predictive:app
```

2. Get the authentication token from the Ray dashboard at `http://localhost:8265`. Navigate to the Serve section and copy the token.

3. Edit `external_scaler_predictive_client.py` and update the `AUTH_TOKEN` value with your token from step 2.

4. Run the predictive scaling client in a separate terminal:

```bash
python external_scaler_predictive_client.py
```

The scaling client continuously adjusts the number of replicas based on the time of day:
- Business hours (9 AM - 5 PM): 10 replicas
- Off-peak hours: 3 replicas

### Application level autoscaling

By default, each deployment in Ray Serve autoscales independently. When you have multiple deployments that need to scale in a coordinated way—such as deployments that share backend resources, have dependencies on each other, or need load-aware routing—you can define an **application-level autoscaling policy**. This policy makes scaling decisions for all deployments within an application simultaneously.
Expand Down
10 changes: 10 additions & 0 deletions doc/source/serve/doc_code/external_scaler_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# __external_scaler_config_begin__
applications:
- name: my-app
import_path: my_module:app
external_scaler_enabled: true
deployments:
- name: my-deployment
num_replicas: 1
# __external_scaler_config_end__

37 changes: 37 additions & 0 deletions doc/source/serve/doc_code/external_scaler_predictive.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# __serve_example_begin__
import time
from ray import serve


@serve.deployment(num_replicas=3, external_scaler_enabled=True)
class TextProcessor:
"""A simple text processing deployment that can be scaled externally."""
def __init__(self):
self.request_count = 0

def __call__(self, text: str) -> dict:
# Simulate text processing work
time.sleep(0.1)
self.request_count += 1
return {
"processed_text": text.upper(),
"length": len(text),
"request_count": self.request_count,
}


app = TextProcessor.bind()
# __serve_example_end__

if __name__ == "__main__":
import requests

serve.run(app)

# Test the deployment
resp = requests.post(
"http://localhost:8000/",
json="hello world"
)
print(f"Response: {resp.json()}")

81 changes: 81 additions & 0 deletions doc/source/serve/doc_code/external_scaler_predictive_client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# __client_script_begin__
import logging
import time
from datetime import datetime
import requests

APPLICATION_NAME = "text-processor-app"
DEPLOYMENT_NAME = "TextProcessor"
AUTH_TOKEN = "YOUR_TOKEN_HERE" # Get from Ray dashboard at http://localhost:8265
SERVE_ENDPOINT = "http://localhost:8000"
SCALING_INTERVAL = 300 # Check every 5 minutes

logger = logging.getLogger(__name__)


def get_current_replicas(app_name: str, deployment_name: str, token: str) -> int:
"""Get current replica count. Returns -1 on error.
Response schema: https://docs.ray.io/en/latest/serve/api/doc/ray.serve.schema.ServeInstanceDetails.html
"""
try:
resp = requests.get(
f"{SERVE_ENDPOINT}/api/v1/applications",
headers={"Authorization": f"Bearer {token}"},
timeout=10
)
if resp.status_code != 200:
logger.error(f"Failed to get applications: {resp.status_code}")
return -1

apps = resp.json().get("applications", {})
if app_name not in apps:
logger.error(f"Application {app_name} not found")
return -1

for deployment in apps[app_name].get("deployments", []):
if deployment["name"] == deployment_name:
return deployment["target_num_replicas"]

logger.error(f"Deployment {deployment_name} not found")
return -1
except requests.exceptions.RequestException as e:
logger.error(f"Request failed: {e}")
return -1


def scale_deployment(app_name: str, deployment_name: str, token: str):
"""Scale deployment based on time of day."""
hour = datetime.now().hour
current = get_current_replicas(app_name, deployment_name, token)
target = 10 if 9 <= hour < 17 else 3 # Peak hours: 9am-5pm

delta = target - current
if delta == 0:
logger.info(f"Already at target ({current} replicas)")
return

action = "Adding" if delta > 0 else "Removing"
logger.info(f"{action} {abs(delta)} replicas ({current} -> {target})")

try:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Deployment Scaling Fails on Replica Retrieval Error

The scale_deployment function does not check if get_current_replicas returns -1 (error condition). When current replicas cannot be retrieved, the function should return early instead of continuing to compute delta and attempt scaling. Currently, if current is -1, the delta calculation (target - (-1)) produces an incorrect value, and the function proceeds to send an API request even though the current replica count is unknown.

Fix in Cursor Fix in Web

resp = requests.post(
f"{SERVE_ENDPOINT}/api/v1/applications/{app_name}/deployments/{deployment_name}/scale",
headers={"Authorization": f"Bearer {token}", "Content-Type": "application/json"},
json={"target_num_replicas": target},
timeout=10
)
if resp.status_code == 200:
logger.info("Successfully scaled deployment")
else:
logger.error(f"Scale failed: {resp.status_code} - {resp.text}")
except requests.exceptions.RequestException as e:
logger.error(f"Request failed: {e}")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Deployment Scaling Fails on Replica Retrieval Error

The scale_deployment() function does not validate that get_current_replicas() succeeded before using the result. If get_current_replicas() returns -1 (indicating an error), the function will still attempt to scale the deployment, calculating an incorrect delta and proceeding with an invalid scaling decision. The function should check if current == -1 and return early or log an error instead of attempting to scale.

Fix in Cursor Fix in Web


if __name__ == "__main__":
logger.info(f"Starting predictive scaling for {APPLICATION_NAME}/{DEPLOYMENT_NAME}")
while True:
scale_deployment(APPLICATION_NAME, DEPLOYMENT_NAME, AUTH_TOKEN)
time.sleep(SCALING_INTERVAL)
# __client_script_end__
4 changes: 3 additions & 1 deletion doc/source/serve/production-guide/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,8 @@ applications:
- name: ...
route_prefix: ...
import_path: ...
runtime_env: ...
runtime_env: ...
external_scaler_enabled: ...
deployments:
- name: ...
num_replicas: ...
Expand Down Expand Up @@ -99,6 +100,7 @@ These are the fields per `application`:
- **`route_prefix`**: An application can be called via HTTP at the specified route prefix. It defaults to `/`. The route prefix for each application must be unique.
- **`import_path`**: The path to your top-level Serve deployment (or the same path passed to `serve run`). The most minimal config file consists of only an `import_path`.
- **`runtime_env`**: Defines the environment that the application runs in. Use this parameter to package application dependencies such as `pip` packages (see {ref}`Runtime Environments <runtime-environments>` for supported fields). The `import_path` must be available _within_ the `runtime_env` if it's specified. The Serve config's `runtime_env` can only use [remote URIs](remote-uris) in its `working_dir` and `py_modules`; it can't use local zip files or directories. [More details on runtime env](serve-runtime-env).
- **`external_scaler_enabled`**: Enables the external scaling webhook, which lets you scale deployments from outside the Ray cluster using a REST API. When enabled, you can't use built-in autoscaling (`autoscaling_config`) for any deployment in this application. Defaults to `False`. See [External Scaling Webhook](serve-external-scale-webhook) for details.
- **`deployments (optional)`**: A list of deployment options that allows you to override the `@serve.deployment` settings specified in the deployment graph code. Each entry in this list must include the deployment `name`, which must match one in the code. If this section is omitted, Serve launches all deployments in the graph with the parameters specified in the code. See how to [configure serve deployment options](serve-configure-deployment).
- **`args`**: Arguments that are passed to the [application builder](serve-app-builder-guide).

Expand Down