From 1564cdffab8eba622c4438f8855083d109f42126 Mon Sep 17 00:00:00 2001 From: Russell Bryant Date: Tue, 21 Apr 2026 13:01:55 -0400 Subject: [PATCH 1/2] [Doc] Add missing API endpoints to security documentation Several endpoints were missing from the API key authentication limitations section: RLHF weight-manipulation endpoints, generative scoring, abort_requests, and various /v1 sub-paths. Also corrects the operational control endpoints section from "always enabled" to conditional on the "generate" task being supported, and updates the security implications to reflect the new attack surface. Co-Authored-By: Claude Opus 4.6 (1M context) Signed-off-by: Russell Bryant --- docs/usage/security.md | 25 ++++++++++++++++++++----- 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/docs/usage/security.md b/docs/usage/security.md index 4879ddbf64ef..62a88070185d 100644 --- a/docs/usage/security.md +++ b/docs/usage/security.md @@ -138,14 +138,22 @@ When `--api-key` is configured, the following `/v1` endpoints require Bearer tok - `/v1/models` - List available models - `/v1/chat/completions` - Chat completions +- `/v1/chat/completions/batch` - Batch chat completions +- `/v1/chat/completions/render` - Render chat completion requests - `/v1/completions` - Text completions +- `/v1/completions/render` - Render completion requests - `/v1/embeddings` - Generate embeddings - `/v1/audio/transcriptions` - Audio transcription - `/v1/audio/translations` - Audio translation - `/v1/messages` - Anthropic-compatible messages API -- `/v1/responses` - Response management +- `/v1/messages/count_tokens` - Count tokens for Anthropic messages +- `/v1/responses` - Create a response +- `/v1/responses/{response_id}` - Retrieve a response +- `/v1/responses/{response_id}/cancel` - Cancel a response - `/v1/score` - Scoring API - `/v1/rerank` - Reranking API +- `/v1/load_lora_adapter` - Load a LoRA adapter (can alter model behavior) +- `/v1/unload_lora_adapter` - Unload a LoRA adapter (can alter model behavior) ### Unprotected Endpoints (No API Key Required) @@ -155,16 +163,23 @@ The following endpoints **do not require authentication** even when `--api-key` - `/invocations` - SageMaker-compatible endpoint (routes to the same inference functions as `/v1` endpoints) - `/inference/v1/generate` - Generate completions +- `/generative_scoring` - Generative scoring API - `/pooling` - Pooling API - `/classify` - Classification API - `/score` - Scoring API (non-`/v1` variant) - `/rerank` - Reranking API (non-`/v1` variant) -**Operational control endpoints (always enabled):** +**Operational control endpoints (only when `"generate"` task is supported):** - `/pause` - Pause generation (causes denial of service) - `/resume` - Resume generation +- `/is_paused` - Check if generation is paused - `/scale_elastic_ep` - Trigger scaling operations +- `/is_scaling_elastic_ep` - Check if scaling is in progress +- `/init_weight_transfer_engine` - Initialize weight transfer engine for RLHF +- `/update_weights` - Update model weights (can alter model behavior) +- `/get_world_size` - Get distributed world size +- `/abort_requests` - Abort in-flight requests (only when `--tokens-only` is also set) **Utility endpoints:** @@ -207,9 +222,9 @@ These endpoints are only available when profiling is enabled and should only be An attacker who can reach the vLLM HTTP server can: -1. **Bypass authentication** by using non-`/v1` endpoints like `/invocations`, `/inference/v1/generate`, `/pooling`, `/classify`, `/score`, or `/rerank` to run arbitrary inference without credentials -2. **Cause denial of service** by calling `/pause` or `/scale_elastic_ep` without a token -3. **Access operational controls** to manipulate server state (e.g., pausing generation) +1. **Bypass authentication** by using non-`/v1` endpoints like `/invocations`, `/inference/v1/generate`, `/generative_scoring`, `/pooling`, `/classify`, `/score`, or `/rerank` to run arbitrary inference without credentials +2. **Cause denial of service** by calling `/pause`, `/scale_elastic_ep`, or `/abort_requests` without a token +3. **Access operational controls** to manipulate server state (e.g., pausing generation, updating model weights via `/update_weights`) 4. **If `--enable-tokenizer-info-endpoint` is set:** Access sensitive tokenizer configuration including chat templates, which may reveal prompt engineering strategies or other implementation details 5. **If `VLLM_SERVER_DEV_MODE=1` is set:** Execute arbitrary RPC commands via `/collective_rpc`, reset caches, put the engine to sleep, and access detailed server configuration From de0e1cd50d20b33f0d0ecb303160bf88adc348cc Mon Sep 17 00:00:00 2001 From: Russell Bryant Date: Tue, 28 Apr 2026 15:39:48 -0400 Subject: [PATCH 2/2] [Doc] Add dynamic LoRA loading security guidance The LoRA load/unload API endpoints are not enabled by default and require explicit opt-in via --enable-lora and VLLM_ALLOW_RUNTIME_LORA_UPDATING=True. Add a dedicated section warning that dynamic LoRA loading should not be exposed to untrusted clients, and annotate the endpoint list entries with their activation requirements. Co-Authored-By: Claude Opus 4.6 (1M context) Signed-off-by: Russell Bryant --- docs/usage/security.md | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/docs/usage/security.md b/docs/usage/security.md index 62a88070185d..300cabbfcc19 100644 --- a/docs/usage/security.md +++ b/docs/usage/security.md @@ -152,8 +152,8 @@ When `--api-key` is configured, the following `/v1` endpoints require Bearer tok - `/v1/responses/{response_id}/cancel` - Cancel a response - `/v1/score` - Scoring API - `/v1/rerank` - Reranking API -- `/v1/load_lora_adapter` - Load a LoRA adapter (can alter model behavior) -- `/v1/unload_lora_adapter` - Unload a LoRA adapter (can alter model behavior) +- `/v1/load_lora_adapter` - Load a LoRA adapter (can alter model behavior; only available when `--enable-lora` is set and `VLLM_ALLOW_RUNTIME_LORA_UPDATING=True`) +- `/v1/unload_lora_adapter` - Unload a LoRA adapter (can alter model behavior; only available when `--enable-lora` is set and `VLLM_ALLOW_RUNTIME_LORA_UPDATING=True`) ### Unprotected Endpoints (No API Key Required) @@ -303,6 +303,12 @@ To disable the Python code interpreter specifically, omit `code_interpreter` fro **Consider a custom implementation**: The GPT-OSS Python tool is a reference implementation. For production deployments, consider implementing a custom code execution sandbox with stricter isolation guarantees. See the [GPT-OSS documentation](https://github.com/openai/gpt-oss?tab=readme-ov-file#python) for guidance. +## Dynamic LoRA Loading + +vLLM supports dynamically loading and unloading LoRA adapters at runtime via the `/v1/load_lora_adapter` and `/v1/unload_lora_adapter` API endpoints. This functionality is **not enabled by default** — it requires both `--enable-lora` and the environment variable `VLLM_ALLOW_RUNTIME_LORA_UPDATING=True` to be set. + +**Warning:** Dynamic LoRA loading is not a secure operation and should not be enabled in deployments exposed to untrusted clients. If you must enable dynamic LoRA loading, restrict access to the `/v1/load_lora_adapter` and `/v1/unload_lora_adapter` endpoints to trusted administrators only, using a reverse proxy or network-level access controls. Do not expose these endpoints to end users. For details on configuring LoRA adapters, see the [LoRA Adapters documentation](../features/lora.md). + ## Reporting Security Vulnerabilities If you believe you have found a security vulnerability in vLLM, please report it following the project's security policy. For more information on how to report security issues and the project's security policy, please see the [vLLM Security Policy](https://github.com/vllm-project/vllm/blob/main/SECURITY.md).