Skip to content

[Fix][MoRI] Align MoRI-IO message format with P2pNcclConnector and vllm-router#39565

Merged
tjtanaa merged 14 commits into
vllm-project:mainfrom
simondanielsson:fix/align-moriio-messages
Apr 22, 2026
Merged

[Fix][MoRI] Align MoRI-IO message format with P2pNcclConnector and vllm-router#39565
tjtanaa merged 14 commits into
vllm-project:mainfrom
simondanielsson:fix/align-moriio-messages

Conversation

@simondanielsson
Copy link
Copy Markdown
Contributor

@simondanielsson simondanielsson commented Apr 11, 2026

Purpose

Fixes #38692.

This PR aligns the message formats of the MoRI-IO KV Connector with the P2pNcclConnector, making MoRI-IO itself compatible with vllm-router with minimal changes required on the router side.

The changes made are:

  • embed peer connection information (ZMQ addresses) directly into the request_id.
    • This change eliminates the need for the router to explicitly pass host and port details in kv_transfer_params, aligning the implementation with the P2PNCCL connector's approach.
  • The toy proxy server and the MoRI-IO connector have been updated to support this new registration and address resolution logic.

The benefits of this PR are two-fold:

  1. Allows for use of vllm-router in conjunction with MoRI-IO connector.
  2. Aligned message logic/format between existing connectors (specifically, more closely aligned with the P2pNcclConnector)

This already works with the toy proxy. To make MoRI connector work with vllm-router, we also need these two PRs on the router side:

Codeveloped with: @mpashkovskii

Test Plan

We'll compare using vllm bench serve and accuracy using GSM8k. Reproducer scripts can be found in this temporary branch: mpashkovskii#4

Example below how you vllm bench serve w/ 1P1D on 2 nodes using DSR1, using MoRIIOConnector and vllm-router:

Build vllm from source on this branch, and include broadcom NIC drivers OR simply pull these images I already built on this branch using this Dockerfile.

# Built from vllm PR https://github.com/vllm-project/vllm/pull/39565, commit 65ffb26915f8b08f1fa787d2ccbf531bad214e3c
docker pull ghcr.io/simondanielsson/vllm-rocm-moriio:dev
# or with AINIC: ghcr.io/simondanielsson/vllm-rocm-moriio:ainic 

Also pull the router image, or build (see instructions here):

# Basic router support + streaming, i.e. both PRs https://github.com/vllm-project/router/pull/138 and https://github.com/vllm-project/router/pull/114
docker pull ghcr.io/simondanielsson/vllm-router:dev-streaming-cn-cjy

Then checkout the branch containing reproducer scripts, and run

# On the prefill node
$ IS_PREFILL=1 PREFILL_IP=<prefill_ip> DECODE_IP=<decode_ip> USE_BENCH=1 ./examples/online_serving/disaggregated_serving/moriio_pd_demo/run_pd_demo_2node.sh

# On the decode node
$ IS_PREFILL=0 PREFILL_IP=<prefill_ip> DECODE_IP=<decode_ip> USE_BENCH=1 ./examples/online_serving/disaggregated_serving/moriio_pd_demo/run_pd_demo_2node.sh

This will launch one vllm prefill instance and the vllm router on the prefill node, and a vllm decode instance on the decode node, and run vllm bench serve.

Test Result

  • vllm-router more robust under high concurrency benchmarks
  • no accuracy issues in either router

2 nodes.

1P1D on two 8xMI300X nodes, DeepSeek-R1-0528 TP8EP8, MoRI-IO KV connector, 1k/1k ISL/OSL.

Note: concurrencies 16 and 32 use eager mode, concurrency >=64 use PIECEWISE compilation mode one the decode instance.

Concurrency Router Failed / Total Req Throughput (req/s) TTFT P50 / P99 (ms) TPOT P50 / P99 (ms) ITL P50 / P99 (ms)
16 vllm-router 0 / 160 0.20 325.29 / 947.76 79.09 / 79.97 79.11 / 83.35
16 moriio_toy_proxy_server 0 / 160 0.20 427.73 / 1033.51 79.18 / 80.00 79.09 / 83.72
32 vllm-router 0 / 320 0.40 414.94 / 1688.46 79.81 / 81.63 80.46 / 85.20
32 moriio_toy_proxy_server 2 / 320 0.40 408.68 / 1858.97 78.88 / 79.68 78.57 / 84.66
64 vllm-router 0 / 640 3.76 282.33 / 2457.77 16.15 / 16.55 15.97 / 20.15
64 moriio_toy_proxy_server 18 / 640 3.67 357.74 / 2580.78 16.19 / 16.50 16.11 / 20.30
See full bench results

Concurrency: 16

$ IS_PREFILL=1 PREFILL_IP=10.21.9.47 DECODE_IP=10.21.9.29 USE_BENCH=1 ./examples/online_serving/disaggregated_serving/moriio_pd_demo/run_pd_demo_2node.sh
# and same command on the decode node but with IS_PREFILL=0

======================================================
  Router: vllm-router (2-node)
  Model : deepseek-ai/DeepSeek-R1-0528
  Date  : Mon Apr 13 11:21:01 UTC 2026
======================================================
INFO 04-13 11:21:06 [datasets.py:700] Sampling input_len from [999, 999] and output_len from [1000, 1000]
Starting initial single prompt test run...
Waiting for endpoint to become up in 3000 seconds
 |          | 02:59 elapsed, 213345:52:53 remaining
Initial test run completed.
Warming up with 32 requests...
100%|██████████| 32/32 [02:39<00:00,  4.97s/it]
Warmup run completed.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|██████████| 160/160 [13:15<00:00,  4.97s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     160
Failed requests:                         0
Maximum request concurrency:             16
Benchmark duration (s):                  795.59
Total input tokens:                      159840
Total generated tokens:                  160000
Request throughput (req/s):              0.20
Output token throughput (tok/s):         201.11
Peak output token throughput (tok/s):    224.00
Peak concurrent requests:                32.00
Total token throughput (tok/s):          402.02
---------------Time to First Token----------------
Mean TTFT (ms):                          409.85
Median TTFT (ms):                        325.29
P99 TTFT (ms):                           947.76
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          79.15
Median TPOT (ms):                        79.09
P99 TPOT (ms):                           79.97
---------------Inter-token Latency----------------
Mean ITL (ms):                           79.15
Median ITL (ms):                         79.11
P99 ITL (ms):                            83.35
==================================================

======================================================
  Router: moriio_toy_proxy_server.py (2-node)
  Model : deepseek-ai/DeepSeek-R1-0528
  Date  : Mon Apr 13 11:44:12 UTC 2026
======================================================
INFO 04-13 11:44:17 [datasets.py:700] Sampling input_len from [999, 999] and output_len from [1000, 1000]
Starting initial single prompt test run...
Waiting for endpoint to become up in 3000 seconds
 |          | 03:00 elapsed, 259228:59:23 remaining
Initial test run completed.
Warming up with 32 requests...
100%|██████████| 32/32 [02:39<00:00,  4.99s/it]
Warmup run completed.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|██████████| 160/160 [13:12<00:00,  4.95s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     160
Failed requests:                         0
Maximum request concurrency:             16
Benchmark duration (s):                  792.62
Total input tokens:                      159840
Total generated tokens:                  159840
Request throughput (req/s):              0.20
Output token throughput (tok/s):         201.66
Peak output token throughput (tok/s):    224.00
Peak concurrent requests:                31.00
Total token throughput (tok/s):          403.32
---------------Time to First Token----------------
Mean TTFT (ms):                          536.26
Median TTFT (ms):                        427.73
P99 TTFT (ms):                           1033.51
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          78.80
Median TPOT (ms):                        79.18
P99 TPOT (ms):                           80.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           78.80
Median ITL (ms):                         79.09
P99 ITL (ms):                            83.72
==================================================

Concurrency: 32

$ BENCH_MAX_CONCURRENCY=32 IS_PREFILL=1 PREFILL_IP=10.21.9.29 DECODE_IP=10.21.9.47 USE_BENCH=1   ./examples/online_serving/disaggregated_serving/moriio_pd_demo/run_pd_demo_2node.sh

======================================================
  Router: vllm-router (2-node)
  Model : deepseek-ai/DeepSeek-R1-0528
  Date  : Mon Apr 13 12:57:51 UTC 2026
======================================================
INFO 04-13 12:57:56 [datasets.py:700] Sampling input_len from [999, 999] and output_len from [1000, 1000]
WARNING: vllm bench serve no longer sets temperature==0 (greedy) in requests by default. The default will be determined on the server side and can be model/API specific. For the old behavior, include --temperature=0.
Starting initial single prompt test run...
Waiting for endpoint to become up in 3000 seconds
 |          | 02:59 elapsed, 206383:15:01 remaining
Initial test run completed.
Warming up with 64 requests...
100%|██████████| 64/64 [02:42<00:00,  2.54s/it]
Warmup run completed.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 32
100%|██████████| 320/320 [13:21<00:00,  2.50s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     320
Failed requests:                         0
Maximum request concurrency:             32
Benchmark duration (s):                  804.67
Total input tokens:                      319680
Total generated tokens:                  320000
Request throughput (req/s):              0.40
Output token throughput (tok/s):         397.68
Peak output token throughput (tok/s):    448.00
Peak concurrent requests:                58.00
Total token throughput (tok/s):          794.96
---------------Time to First Token----------------
Mean TTFT (ms):                          635.15
Median TTFT (ms):                        414.94
P99 TTFT (ms):                           1688.46
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          79.72
Median TPOT (ms):                        79.81
P99 TPOT (ms):                           81.63
---------------Inter-token Latency----------------
Mean ITL (ms):                           79.72
Median ITL (ms):                         80.46
P99 ITL (ms):                            85.20
==================================================


======================================================
  Router: moriio_toy_proxy_server.py (2-node)
  Model : deepseek-ai/DeepSeek-R1-0528
  Date  : Mon Apr 13 13:48:29 UTC 2026
======================================================
INFO 04-13 13:48:34 [datasets.py:700] Sampling input_len from [999, 999] and output_len from [1000, 1000]
WARNING: vllm bench serve no longer sets temperature==0 (greedy) in requests by default. The default will be determined on the server side and can be model/API specific. For the old behavior, include --temperature=0.
Starting initial single prompt test run...
Waiting for endpoint to become up in 3000 seconds
 |          | 02:58 elapsed, 190041:51:38 remaining
  0%|          | 0/64 [00:00<?, ?it/s]Initial test run completed.
Warming up with 64 requests...
100%|██████████| 64/64 [02:43<00:00,  2.55s/it]
Warmup run completed.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 32
100%|██████████| 320/320 [13:15<00:00,  2.49s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     318
Failed requests:                         2
Maximum request concurrency:             32
Benchmark duration (s):                  795.64
Total input tokens:                      317682
Total generated tokens:                  317682
Request throughput (req/s):              0.40
Output token throughput (tok/s):         399.28
Peak output token throughput (tok/s):    448.00
Peak concurrent requests:                63.00
Total token throughput (tok/s):          798.56
---------------Time to First Token----------------
Mean TTFT (ms):                          768.73
Median TTFT (ms):                        408.68
P99 TTFT (ms):                           1858.97
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          78.74
Median TPOT (ms):                        78.88
P99 TPOT (ms):                           79.68
---------------Inter-token Latency----------------
Mean ITL (ms):                           78.74
Median ITL (ms):                         78.57
P99 ITL (ms):                            84.66
==================================================

Concurrency: 64 + PIECEWISE cudagraphs in decode instance:

======================================================
  Router: vllm-router (2-node)
  Model : deepseek-ai/DeepSeek-R1-0528
  Date  : Tue Apr 14 08:44:48 UTC 2026
======================================================
INFO 04-14 08:44:54 [datasets.py:700] Sampling input_len from [999, 999] and output_len from [1000, 1000]
Starting initial single prompt test run...
Waiting for endpoint to become up in 3000 seconds
 |          | 00:21 elapsed, 26404:37:16 remaining
Initial test run completed.
Warming up with 128 requests...
100%|██████████| 128/128 [00:37<00:00,  3.45it/s]
Warmup run completed.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 64
100%|██████████| 640/640 [02:50<00:00,  3.76it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     640
Failed requests:                         0
Maximum request concurrency:             64
Benchmark duration (s):                  170.21
Total input tokens:                      639360
Total generated tokens:                  640000
Request throughput (req/s):              3.76
Output token throughput (tok/s):         3760.14
Peak output token throughput (tok/s):    4138.00
Peak concurrent requests:                102.00
Total token throughput (tok/s):          7516.52
---------------Time to First Token----------------
Mean TTFT (ms):                          586.75
Median TTFT (ms):                        282.33
P99 TTFT (ms):                           2457.77
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          16.11
Median TPOT (ms):                        16.15
P99 TPOT (ms):                           16.55
---------------Inter-token Latency----------------
Mean ITL (ms):                           16.11
Median ITL (ms):                         15.97
P99 ITL (ms):                            20.15
==================================================

======================================================
  Router: moriio_toy_proxy_server.py (2-node)
  Model : deepseek-ai/DeepSeek-R1-0528
  Date  : Tue Apr 14 08:53:57 UTC 2026
======================================================
INFO 04-14 08:54:02 [datasets.py:700] Sampling input_len from [999, 999] and output_len from [1000, 1000]
Starting initial single prompt test run...
Waiting for endpoint to become up in 3000 seconds
 |          | 00:21 elapsed, 21507:17:38 remaining
Initial test run completed.
Warming up with 128 requests...
100%|██████████| 128/128 [00:38<00:00,  3.35it/s]
Warmup run completed.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 64
100%|██████████| 640/640 [02:49<00:00,  3.78it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     622
Failed requests:                         18
Maximum request concurrency:             64
Benchmark duration (s):                  169.32
Total input tokens:                      621378
Total generated tokens:                  621378
Request throughput (req/s):              3.67
Output token throughput (tok/s):         3669.77
Peak output token throughput (tok/s):    4160.00
Peak concurrent requests:                96.00
Total token throughput (tok/s):          7339.55
---------------Time to First Token----------------
Mean TTFT (ms):                          597.72
Median TTFT (ms):                        357.74
P99 TTFT (ms):                           2580.78
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          16.20
Median TPOT (ms):                        16.19
P99 TPOT (ms):                           16.50
---------------Inter-token Latency----------------
Mean ITL (ms):                           16.20
Median ITL (ms):                         16.11
P99 ITL (ms):                            20.30
==================================================

1 node

1P1D on two MI300X devices, Qwen3-8b, MoRI-IO Connector, 1k/1k ISL/OSL

Concurrency Router Req Throughput (req/s) TTFT P50 / P99 (ms) TPOT P50 / P99 (ms) ITL P50 / P99 (ms)
16 vllm-router 1.92 106.48 / 546.63 8.13 / 8.15 8.15 / 8.69
16 moriio_toy_proxy_server 1.92 112.88 / 513.65 8.09 / 8.12 8.09 / 8.69
See full bench results

Concurrency: 16

$ MODEL=Qwen/Qwen3-8B PREFILL_GPU=0 DECODE_GPU=1 USE_BENCH=1 ./examples/online_serving/disaggregated_serving/moriio_pd_demo/run_pd_demo.sh


...
======================================================
  Router: vllm-router
  Date  : Fri Apr 10 13:58:13 UTC 2026
======================================================
INFO 04-10 13:58:18 [datasets.py:700] Sampling input_len from [1000, 1000] and output_len from [1000, 1000]
Starting initial single prompt test run...
Waiting for endpoint to become up in 3000 seconds
 |          | 00:08 elapsed, 10376:05:17 remaining
Initial test run completed.
Warming up with 32 requests...
100%|██████████| 32/32 [00:18<00:00,  1.71it/s]
Warmup run completed.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|██████████| 160/160 [01:17<00:00,  2.07it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     160
Failed requests:                         0
Maximum request concurrency:             16
Benchmark duration (s):                  83.43
Total input tokens:                      160000
Total generated tokens:                  160000
Request throughput (req/s):              1.92
Output token throughput (tok/s):         1917.68
Peak output token throughput (tok/s):    2032.00
Peak concurrent requests:                32.00
Total token throughput (tok/s):          3835.36
---------------Time to First Token----------------
Mean TTFT (ms):                          168.19
Median TTFT (ms):                        106.48
P99 TTFT (ms):                           546.63
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          8.12
Median TPOT (ms):                        8.13
P99 TPOT (ms):                           8.15
---------------Inter-token Latency----------------

Mean ITL (ms):                           8.12
Median ITL (ms):                         8.15
P99 ITL (ms):                            8.69
==================================================

======================================================
  Router: moriio_toy_proxy_server.py
  Date  : Fri Apr 10 14:01:03 UTC 2026
======================================================
INFO 04-10 14:01:08 [datasets.py:700] Sampling input_len from [1000, 1000] and output_len from [1000, 1000]
Starting initial single prompt test run...
Waiting for endpoint to become up in 3000 seconds
 |          | 00:08 elapsed, 8595:29:20 remaining
Initial test run completed.
Warming up with 32 requests...
100%|██████████| 32/32 [00:19<00:00,  1.68it/s]
Warmup run completed.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|██████████| 160/160 [01:17<00:00,  2.07it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     159
Failed requests:                         1
Maximum request concurrency:             16
Benchmark duration (s):                  82.82
Total input tokens:                      159000
Total generated tokens:                  158841
Request throughput (req/s):              1.92
Output token throughput (tok/s):         1917.86
Peak output token throughput (tok/s):    2032.00
Peak concurrent requests:                31.00
Total token throughput (tok/s):          3837.64
---------------Time to First Token----------------
Mean TTFT (ms):                          154.47
Median TTFT (ms):                        112.88
P99 TTFT (ms):                           513.65
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          8.09
Median TPOT (ms):                        8.09
P99 TPOT (ms):                           8.12
---------------Inter-token Latency----------------
Mean ITL (ms):                           8.09
Median ITL (ms):                         8.09
P99 ITL (ms):                            8.69
==================================================

GSM8K

# on the prefill node
$ IS_PREFILL=1 PREFILL_IP=10.21.9.29 DECODE_IP=10.21.9.47 USE_GSM8K=1 ./examples/online_serving/disaggregated_serving/moriio_pd_demo/run_pd_demo_2node.sh
# on the decode node
$ IS_PREFILL=0 PREFILL_IP=10.21.9.29 DECODE_IP=10.21.9.47 USE_GSM8K=1 ./examples/online_serving/disaggregated_serving/moriio_pd_demo/run_pd_demo_2node.sh
======================================================
  GSM8K evaluation (lm_eval) via MoRIIO PD-disaggregation (2-node)
  Router: vllm-router
  Model : deepseek-ai/DeepSeek-R1-0528
  Date  : Tue Apr 14 10:23:13 UTC 2026
======================================================
2026-04-14:10:23:21 INFO     [_cli.run:376] Selected Tasks: ['gsm8k']
...
local-completions ({'model': 'deepseek-ai/DeepSeek-R1-0528', 'base_url': 'http://127.0.0.1:8080/v1/completions', 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match||0.9575|±  |0.0056|
|     |       |strict-match    |     5|exact_match||0.9575|±  |0.0056|



# on prefill node
$ IS_PREFILL=1 PREFILL_IP=10.21.9.29 DECODE_IP=10.21.9.47 USE_GSM8K=1 GSM8K_PHASE2_ONLY=1  ./examples/online_serving/disaggregated_serving/moriio_pd_demo/run_pd_demo_2node.sh
# and on the decode
$ USE_GSM8K=1 IS_PREFILL=0 PREFILL_IP=10.21.9.29 DECODE_IP=10.21.9.47  ./examples/online_serving/disaggregated_serving/moriio_pd_demo/run_pd_demo_2node.sh
======================================================
  GSM8K evaluation (lm_eval) via MoRIIO PD-disaggregation (2-node)
  Router: moriio_toy_proxy_server.py
  Model : deepseek-ai/DeepSeek-R1-0528
  Date  : Tue Apr 14 11:40:34 UTC 2026
======================================================
2026-04-14:11:40:42 INFO     [_cli.run:376] Selected Tasks: ['gsm8k']
...
local-completions ({'model': 'deepseek-ai/DeepSeek-R1-0528', 'base_url': 'http://127.0.0.1:10001/v1/completions', 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match||0.9591|±  |0.0055|
|     |       |strict-match    |     5|exact_match||0.9538|±  |0.0058|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mpashkovskii and others added 5 commits April 11, 2026 10:17
…fault ports

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
…rs for router compatibility

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Similar to the P2pNcclConnector, we embed the nonstandard transfer
fields into the zmq_address which is part of the request_id. These
fields include remote_host, remote_port, remote_handshake_port, and
remote_notify_port which were previously required to be sent by the
router. That would require special logic in the router just for this
specific KV connector, so instead we follow the logic in
P2pNcclConnector and put any specific metadata inside the request ID.

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 11, 2026

Documentation preview: https://vllm--39565.org.readthedocs.build/en/39565/

@mergify mergify Bot added documentation Improvements or additions to documentation kv-connector labels Apr 11, 2026
@simondanielsson
Copy link
Copy Markdown
Contributor Author

Re-opened from #38813

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors MoRI-IO disaggregated serving to embed ZMQ connection metadata within the request_id, allowing the connector to derive peer information without explicit routing parameters. Changes include updating the toy proxy registration logic and adding parsing utilities in the common module. Feedback highlights potential service instability, specifically noting that unhandled exceptions in the background discovery thread could terminate the process and that malformed ZMQ addresses could cause engine crashes during integer conversion.

Comment thread vllm/distributed/kv_transfer/kv_connector/v1/moriio/moriio_common.py Outdated
…to defaults

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
@simondanielsson
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the MoRI-IO disaggregated serving proxy and connector to communicate connection details through the request ID, aligning with vLLM's routing architecture. It introduces ZMQ address parsing utilities and updates the registration protocol to use a more robust format. Feedback identifies a potential KeyError in the toy proxy's WRITE mode due to a missing transfer ID and suggests updating registration logic to handle instance restarts correctly by refreshing existing entries.

Comment thread examples/online_serving/disaggregated_serving/moriio_toy_proxy_server.py Outdated
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
@simondanielsson
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the MoRI-IO disaggregated serving communication by embedding peer connection information (ZMQ addresses) directly into the request_id. This change eliminates the need for the router to explicitly pass host and port details in kv_transfer_params, aligning the implementation with the P2P-NCCL connector's approach. The toy proxy server and the MoRI-IO connector have been updated to support this new registration and address resolution logic. I have no feedback to provide.

@tjtanaa tjtanaa requested a review from heheda12345 April 22, 2026 02:06
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 22, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @simondanielsson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Apr 22, 2026
…ssages

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
req_data_copy["kv_transfer_params"].update(
{
"do_remote_decode": True,
"do_remote_prefill": False,
Copy link
Copy Markdown
Contributor Author

@simondanielsson simondanielsson Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explanation: These were MoRI-specific fields. For uniformity we instead we embed them into the zmq_address which is then injected into the request id, similar to P2pNccl

Copy link
Copy Markdown
Member

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tjtanaa tjtanaa added rocm Related to AMD ROCm ready ONLY add when PR is ready to merge/full CI is needed labels Apr 22, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Apr 22, 2026
@tjtanaa tjtanaa merged commit ac58e2a into vllm-project:main Apr 22, 2026
59 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD Apr 22, 2026
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Apr 27, 2026
…lm-router (vllm-project#39565)

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Co-authored-by: Matvei Pashkovskii <mpashkov@amd.com>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Lafunamor pushed a commit to Lafunamor/vllm that referenced this pull request May 1, 2026
…lm-router (vllm-project#39565)

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Co-authored-by: Matvei Pashkovskii <mpashkov@amd.com>
Signed-off-by: Adrian <info@zzit.ch>
Copilot AI pushed a commit to hongbolv/vllm that referenced this pull request May 7, 2026
…lm-router (vllm-project#39565)

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Co-authored-by: Matvei Pashkovskii <mpashkov@amd.com>
Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026
…lm-router (vllm-project#39565)

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Co-authored-by: Matvei Pashkovskii <mpashkov@amd.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…lm-router (vllm-project#39565)

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Co-authored-by: Matvei Pashkovskii <mpashkov@amd.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…lm-router (vllm-project#39565)

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Co-authored-by: Matvei Pashkovskii <mpashkov@amd.com>
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
…lm-router (vllm-project#39565)

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Co-authored-by: Matvei Pashkovskii <mpashkov@amd.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
…lm-router (vllm-project#39565)

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Co-authored-by: Matvei Pashkovskii <mpashkov@amd.com>
brian-dellabetta pushed a commit to neuralmagic/vllm that referenced this pull request May 29, 2026
…lm-router (vllm-project#39565)

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Co-authored-by: Matvei Pashkovskii <mpashkov@amd.com>
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
…lm-router (vllm-project#39565)

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Co-authored-by: Matvei Pashkovskii <mpashkov@amd.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation kv-connector ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: parity with CUDA & parity with rocm sglang: vLLM router doesn't current support MoRI kvcache connector

3 participants