forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 9
[Merge]PR-25233 From Original vLLM repo by fake0fan #97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,75 @@ | ||
| # Disaggregated Encoder | ||
|
|
||
| A **disaggregated encoder** runs the vision-encoder stage of a multimodal LLM in a process that is separate from the pre-fill / decoder stage. Deploying these two stages in independent vLLM instances brings three practical benefits: | ||
|
|
||
| 1. **Independent, fine-grained scaling** | ||
| 2. **Lower time-to-first-token (TTFT)** | ||
| 3. **Cross-process reuse and caching of encoder outputs** | ||
|
|
||
| Design doc: <https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE> | ||
|
|
||
| --- | ||
|
|
||
| ## 1 Motivation | ||
|
|
||
| ### 1. Independent, fine-grained scaling | ||
|
|
||
| * Vision encoders are lightweight, while language models are orders of magnitude larger. | ||
| * The language model can be parallelised without affecting the encoder fleet. | ||
| * Encoder nodes can be added or removed independently. | ||
|
|
||
| ### 2. Lower time-to-first-token (TTFT) | ||
|
|
||
| * Language-only requests bypass the vision encoder entirely. | ||
| * Encoder output is injected only at required attention layers, shortening the pre-fill critical path. | ||
|
|
||
| ### 3. Cross-process reuse and caching | ||
|
|
||
| * In-process encoders confine reuse to a single worker. | ||
| * A remote, shared cache lets any worker retrieve existing embeddings, eliminating redundant computation. | ||
|
|
||
| --- | ||
|
|
||
| ## 2 Usage Example | ||
|
|
||
| The current reference pathway is **SharedStorageConnector**. | ||
| Below ready-to-run scripts shows the workflow: | ||
|
|
||
| 1 Encoder instance + 1 PD instance: | ||
| `examples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_encoder_example.sh` | ||
|
|
||
| 1 Encoder instance + 1 Prefill instance + 1 Decode instance: | ||
| `examples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_epd_example.sh` | ||
|
|
||
| --- | ||
|
|
||
| ## 3 Test Script | ||
|
|
||
| Please refer to the directories `tests/v1/ec_connector` | ||
|
|
||
| ## 4 Development | ||
|
|
||
| Disaggregated encoding is implemented by running two parts: | ||
|
|
||
| * **Encoder instance** – a vLLM instance to performs vision encoding. | ||
| * **Prefill/Decode (PD) instance(s)** – runs language pre-fill and decode. | ||
| * PD can be in either a single normal instance with `disagg_encoder_example.sh` (E->PD) or in disaggregated instances with `disagg_epd_example.sh` (E->P->D) | ||
|
|
||
| A connector transfers encoder-cache (EC) embeddings from the encoder instance to the PD instance. | ||
| All related code is under `vllm/distributed/ec_transfer`. | ||
|
|
||
| ### Key abstractions | ||
|
|
||
| * **ECConnector** – interface for retrieving EC caches produced by the encoder. | ||
| * *Scheduler role* – checks cache existence and schedules loads. | ||
| * *Worker role* – loads the embeddings into memory. | ||
|
|
||
| Here is a figure illustrating disaggregate encoder flow: | ||
|
|
||
|  | ||
|
|
||
| For the PD disaggregation part, the Prefill instance receive cache exactly the same as the disaggregate encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfer KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execute of the PDinstance. | ||
|
|
||
| `docs/features/disagg_prefill.md` shows the brief idea about the disaggregated prefill (v0) | ||
|
|
||
| We create the example setup with the **NixlConnector** from `vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py` and referred to the `tests/v1/kv_connector/nixl_integration/toy_proxy_server.py` to facilitate the kv transfer between P and D; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| # Disaggregated Encoder | ||
|
|
||
| This example contains scripts that demonstrate the disaggregated encoder (EPD) features of vLLM. | ||
|
|
||
| Please refer to [Disaggregated Encoder Feature](../../../docs/features/disagg_encoder.md) for the detailed explanation for the EPD features. | ||
|
|
||
| ## Files | ||
|
|
||
| - `disagg_epd_proxy.py` - Proxy to demonstrates XeYpZd (X encode instances, Y prefill instances, Z decode instances); Currently stable for 1e1p1d. | ||
| - `disagg_1e1p1d_example.sh` - Setup 1e1p1d and run VisionArena benchmark. | ||
| - `disagg_1e1pd_example.sh` - Setup 1e1pd and run VisionArena benchmark. | ||
|
|
||
| Detailed explanations are commnented in the scripts. |
193 changes: 193 additions & 0 deletions
193
examples/online_serving/disaggregated_encoder/disagg_1e1p1d_example.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,193 @@ | ||
| #!/bin/bash | ||
| set -euo pipefail | ||
|
|
||
| declare -a PIDS=() | ||
|
|
||
| ############################################################################### | ||
| # Configuration -- override via env before running | ||
| ############################################################################### | ||
| MODEL="${MODEL:-Qwen/Qwen2.5-VL-3B-Instruct}" | ||
| LOG_PATH="${LOG_PATH:-./logs}" | ||
| mkdir -p $LOG_PATH | ||
|
|
||
| ENCODE_PORT="${ENCODE_PORT:-19534}" | ||
| PREFILL_PORT="${PREFILL_PORT:-19535}" | ||
| DECODE_PORT="${DECODE_PORT:-19536}" | ||
| PROXY_PORT="${PROXY_PORT:-10001}" | ||
|
|
||
| GPU_E="${GPU_E:-2}" | ||
| GPU_P="${GPU_P:-2}" | ||
| GPU_D="${GPU_D:-3}" | ||
|
|
||
| EC_SHARED_STORAGE_PATH="${EC_SHARED_STORAGE_PATH:-/tmp/ec_cache}" | ||
| TIMEOUT_SECONDS="${TIMEOUT_SECONDS:-12000}" # wait_for_server timeout | ||
|
|
||
| NUM_PROMPTS="${NUM_PROMPTS:-100}" # number of prompts to send in benchmark | ||
|
|
||
| export UCX_TLS=all | ||
| export UCX_NET_DEVICES=all | ||
|
|
||
| ############################################################################### | ||
| # Helpers | ||
| ############################################################################### | ||
| START_TIME=$(date +"%Y%m%d_%H%M%S") | ||
| ENC_LOG=$LOG_PATH/encoder_${START_TIME}.log | ||
| P_LOG=$LOG_PATH/p_${START_TIME}.log | ||
| D_LOG=$LOG_PATH/d_${START_TIME}.log | ||
| PROXY_LOG=$LOG_PATH/proxy_${START_TIME}.log | ||
|
|
||
| wait_for_server() { | ||
| local port=$1 | ||
| timeout "$TIMEOUT_SECONDS" bash -c " | ||
| until curl -s localhost:$port/v1/chat/completions > /dev/null; do | ||
| sleep 1 | ||
| done" && return 0 || return 1 | ||
| } | ||
|
|
||
| # Cleanup function | ||
| cleanup() { | ||
| echo "Stopping everything…" | ||
| trap - INT TERM USR1 # prevent re-entrancy | ||
|
|
||
| # Kill all tracked PIDs | ||
| for pid in "${PIDS[@]}"; do | ||
| if kill -0 "$pid" 2>/dev/null; then | ||
| echo "Killing process $pid" | ||
| kill "$pid" 2>/dev/null | ||
| fi | ||
| done | ||
|
|
||
| # Wait a moment for graceful shutdown | ||
| sleep 2 | ||
|
|
||
| # Force kill any remaining processes | ||
| for pid in "${PIDS[@]}"; do | ||
| if kill -0 "$pid" 2>/dev/null; then | ||
| echo "Force killing process $pid" | ||
| kill -9 "$pid" 2>/dev/null | ||
| fi | ||
| done | ||
|
|
||
| # Kill the entire process group as backup | ||
| kill -- -$$ 2>/dev/null | ||
|
|
||
| echo "All processes stopped." | ||
| exit 0 | ||
| } | ||
|
|
||
| trap cleanup INT | ||
| trap cleanup USR1 | ||
| trap cleanup TERM | ||
|
|
||
| # clear previous cache | ||
| echo "remove previous ec cache folder" | ||
| rm -rf $EC_SHARED_STORAGE_PATH | ||
|
|
||
| echo "make ec cache folder" | ||
| mkdir -p $EC_SHARED_STORAGE_PATH | ||
|
|
||
| ############################################################################### | ||
| # Encoder worker | ||
| ############################################################################### | ||
| CUDA_VISIBLE_DEVICES="$GPU_E" vllm serve "$MODEL" \ | ||
| --gpu-memory-utilization 0.0 \ | ||
| --port "$ENCODE_PORT" \ | ||
| --enable-request-id-headers \ | ||
| --no-enable-prefix-caching \ | ||
| --max-num-seqs 128 \ | ||
| --max-num-batched-tokens 4096 \ | ||
| --ec-transfer-config '{ | ||
| "ec_connector": "ECSharedStorageConnector", | ||
| "ec_role": "ec_producer", | ||
| "ec_connector_extra_config": { | ||
| "shared_storage_path": "'"$EC_SHARED_STORAGE_PATH"'" | ||
| } | ||
| }' \ | ||
| >"${ENC_LOG}" 2>&1 & | ||
|
|
||
| PIDS+=($!) | ||
|
|
||
| ############################################################################### | ||
| # Prefill worker | ||
| ############################################################################### | ||
| CUDA_VISIBLE_DEVICES="$GPU_P" \ | ||
| UCX_NET_DEVICES=all \ | ||
| VLLM_NIXL_SIDE_CHANNEL_PORT=5559 \ | ||
| vllm serve "$MODEL" \ | ||
| --gpu-memory-utilization 0.7 \ | ||
| --port "$PREFILL_PORT" \ | ||
| --enable-request-id-headers \ | ||
| --max-num-seqs 128 \ | ||
| --ec-transfer-config '{ | ||
| "ec_connector": "ECSharedStorageConnector", | ||
| "ec_role": "ec_consumer", | ||
| "ec_connector_extra_config": { | ||
| "shared_storage_path": "'"$EC_SHARED_STORAGE_PATH"'" | ||
| } | ||
| }' \ | ||
| --kv-transfer-config '{ | ||
| "kv_connector": "NixlConnector", | ||
| "kv_role": "kv_producer" | ||
| }' \ | ||
| >"${P_LOG}" 2>&1 & | ||
|
|
||
| PIDS+=($!) | ||
|
|
||
| ############################################################################### | ||
| # Decode worker | ||
| ############################################################################### | ||
| CUDA_VISIBLE_DEVICES="$GPU_D" \ | ||
| UCX_NET_DEVICES=all \ | ||
| VLLM_NIXL_SIDE_CHANNEL_PORT=6000 \ | ||
| vllm serve "$MODEL" \ | ||
| --gpu-memory-utilization 0.7 \ | ||
| --port "$DECODE_PORT" \ | ||
| --enable-request-id-headers \ | ||
| --max-num-seqs 128 \ | ||
| --kv-transfer-config '{ | ||
| "kv_connector": "NixlConnector", | ||
| "kv_role": "kv_consumer" | ||
| }' \ | ||
| >"${D_LOG}" 2>&1 & | ||
|
|
||
| PIDS+=($!) | ||
|
|
||
| # Wait for workers | ||
| wait_for_server $ENCODE_PORT | ||
| wait_for_server $PREFILL_PORT | ||
| wait_for_server $DECODE_PORT | ||
|
|
||
| ############################################################################### | ||
| # Proxy | ||
| ############################################################################### | ||
| python disagg_epd_proxy.py \ | ||
| --host "0.0.0.0" \ | ||
| --port "$PROXY_PORT" \ | ||
| --encode-servers-urls "http://localhost:$ENCODE_PORT" \ | ||
| --prefill-servers-urls "http://localhost:$PREFILL_PORT" \ | ||
| --decode-servers-urls "http://localhost:$DECODE_PORT" \ | ||
| >"${PROXY_LOG}" 2>&1 & | ||
|
|
||
| PIDS+=($!) | ||
|
|
||
| wait_for_server $PROXY_PORT | ||
| echo "All services are up!" | ||
|
|
||
| ############################################################################### | ||
| # Benchmark | ||
| vllm bench serve \ | ||
| --model $MODEL \ | ||
| --backend openai-chat \ | ||
| --endpoint /v1/chat/completions \ | ||
| --dataset-name hf \ | ||
| --dataset-path lmarena-ai/VisionArena-Chat \ | ||
| --seed 0 \ | ||
| --num-prompts $NUM_PROMPTS \ | ||
| --port $PROXY_PORT | ||
|
|
||
| PIDS+=($!) | ||
| ############################################################################### | ||
|
|
||
| # cleanup | ||
| echo "cleanup..." | ||
| cleanup | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
vllm bench servecommand runs in the foreground. After it completes,$!will contain the process ID of the last background command, which is the proxy server. This line incorrectly adds the proxy's PID to thePIDSarray a second time. This is redundant and can be removed.