[Fix][MoRI] Align MoRI-IO message format with P2pNcclConnector and vllm-router by simondanielsson · Pull Request #39565 · vllm-project/vllm

simondanielsson · 2026-04-11T08:40:29Z

Purpose

This PR aligns the message formats of the MoRI-IO KV Connector with the P2pNcclConnector, making MoRI-IO itself compatible with vllm-router with minimal changes required on the router side.

The changes made are:

embed peer connection information (ZMQ addresses) directly into the request_id.
- This change eliminates the need for the router to explicitly pass host and port details in kv_transfer_params, aligning the implementation with the P2PNCCL connector's approach.
The toy proxy server and the MoRI-IO connector have been updated to support this new registration and address resolution logic.

The benefits of this PR are two-fold:

Allows for use of vllm-router in conjunction with MoRI-IO connector.
Aligned message logic/format between existing connectors (specifically, more closely aligned with the P2pNcclConnector)

This already works with the toy proxy. To make MoRI connector work with vllm-router, we also need these two PRs on the router side:

[Fix][MoRI] Add MoRI-IO connector support router#138: required after vllm 0.18.0+
feat: support stream response in the process_vllm_two_stage_request_discovered router#114: required for streaming outputs (incl. using vllm bench serve)

Codeveloped with: @mpashkovskii

Test Plan

We'll compare using vllm bench serve and accuracy using GSM8k. Reproducer scripts can be found in this temporary branch: mpashkovskii#4

Example below how you vllm bench serve w/ 1P1D on 2 nodes using DSR1, using MoRIIOConnector and vllm-router:

Build vllm from source on this branch, and include broadcom NIC drivers OR simply pull these images I already built on this branch using this Dockerfile.

# Built from vllm PR https://github.com/vllm-project/vllm/pull/39565, commit 65ffb26915f8b08f1fa787d2ccbf531bad214e3c
docker pull ghcr.io/simondanielsson/vllm-rocm-moriio:dev
# or with AINIC: ghcr.io/simondanielsson/vllm-rocm-moriio:ainic

Also pull the router image, or build (see instructions here):

# Basic router support + streaming, i.e. both PRs https://github.com/vllm-project/router/pull/138 and https://github.com/vllm-project/router/pull/114
docker pull ghcr.io/simondanielsson/vllm-router:dev-streaming-cn-cjy

Then checkout the branch containing reproducer scripts, and run

# On the prefill node
$ IS_PREFILL=1 PREFILL_IP=<prefill_ip> DECODE_IP=<decode_ip> USE_BENCH=1 ./examples/online_serving/disaggregated_serving/moriio_pd_demo/run_pd_demo_2node.sh

# On the decode node
$ IS_PREFILL=0 PREFILL_IP=<prefill_ip> DECODE_IP=<decode_ip> USE_BENCH=1 ./examples/online_serving/disaggregated_serving/moriio_pd_demo/run_pd_demo_2node.sh

This will launch one vllm prefill instance and the vllm router on the prefill node, and a vllm decode instance on the decode node, and run vllm bench serve.

Test Result

vllm-router more robust under high concurrency benchmarks
no accuracy issues in either router

2 nodes.

1P1D on two 8xMI300X nodes, DeepSeek-R1-0528 TP8EP8, MoRI-IO KV connector, 1k/1k ISL/OSL.

Note: concurrencies 16 and 32 use eager mode, concurrency >=64 use PIECEWISE compilation mode one the decode instance.

Concurrency	Router	Failed / Total	Req Throughput (req/s)	TTFT P50 / P99 (ms)	TPOT P50 / P99 (ms)	ITL P50 / P99 (ms)
16	vllm-router	0 / 160	0.20	325.29 / 947.76	79.09 / 79.97	79.11 / 83.35
16	moriio_toy_proxy_server	0 / 160	0.20	427.73 / 1033.51	79.18 / 80.00	79.09 / 83.72
32	vllm-router	0 / 320	0.40	414.94 / 1688.46	79.81 / 81.63	80.46 / 85.20
32	moriio_toy_proxy_server	2 / 320	0.40	408.68 / 1858.97	78.88 / 79.68	78.57 / 84.66
64	vllm-router	0 / 640	3.76	282.33 / 2457.77	16.15 / 16.55	15.97 / 20.15
64	moriio_toy_proxy_server	18 / 640	3.67	357.74 / 2580.78	16.19 / 16.50	16.11 / 20.30

See full bench results

Concurrency: 16

$ IS_PREFILL=1 PREFILL_IP=10.21.9.47 DECODE_IP=10.21.9.29 USE_BENCH=1 ./examples/online_serving/disaggregated_serving/moriio_pd_demo/run_pd_demo_2node.sh
# and same command on the decode node but with IS_PREFILL=0

======================================================
  Router: vllm-router (2-node)
  Model : deepseek-ai/DeepSeek-R1-0528
  Date  : Mon Apr 13 11:21:01 UTC 2026
======================================================
INFO 04-13 11:21:06 [datasets.py:700] Sampling input_len from [999, 999] and output_len from [1000, 1000]
Starting initial single prompt test run...
Waiting for endpoint to become up in 3000 seconds
 |          | 02:59 elapsed, 213345:52:53 remaining
Initial test run completed.
Warming up with 32 requests...
100%|██████████| 32/32 [02:39<00:00,  4.97s/it]
Warmup run completed.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|██████████| 160/160 [13:15<00:00,  4.97s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     160
Failed requests:                         0
Maximum request concurrency:             16
Benchmark duration (s):                  795.59
Total input tokens:                      159840
Total generated tokens:                  160000
Request throughput (req/s):              0.20
Output token throughput (tok/s):         201.11
Peak output token throughput (tok/s):    224.00
Peak concurrent requests:                32.00
Total token throughput (tok/s):          402.02
---------------Time to First Token----------------
Mean TTFT (ms):                          409.85
Median TTFT (ms):                        325.29
P99 TTFT (ms):                           947.76
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          79.15
Median TPOT (ms):                        79.09
P99 TPOT (ms):                           79.97
---------------Inter-token Latency----------------
Mean ITL (ms):                           79.15
Median ITL (ms):                         79.11
P99 ITL (ms):                            83.35
==================================================

======================================================
  Router: moriio_toy_proxy_server.py (2-node)
  Model : deepseek-ai/DeepSeek-R1-0528
  Date  : Mon Apr 13 11:44:12 UTC 2026
======================================================
INFO 04-13 11:44:17 [datasets.py:700] Sampling input_len from [999, 999] and output_len from [1000, 1000]
Starting initial single prompt test run...
Waiting for endpoint to become up in 3000 seconds
 |          | 03:00 elapsed, 259228:59:23 remaining
Initial test run completed.
Warming up with 32 requests...
100%|██████████| 32/32 [02:39<00:00,  4.99s/it]
Warmup run completed.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|██████████| 160/160 [13:12<00:00,  4.95s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     160
Failed requests:                         0
Maximum request concurrency:             16
Benchmark duration (s):                  792.62
Total input tokens:                      159840
Total generated tokens:                  159840
Request throughput (req/s):              0.20
Output token throughput (tok/s):         201.66
Peak output token throughput (tok/s):    224.00
Peak concurrent requests:                31.00
Total token throughput (tok/s):          403.32
---------------Time to First Token----------------
Mean TTFT (ms):                          536.26
Median TTFT (ms):                        427.73
P99 TTFT (ms):                           1033.51
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          78.80
Median TPOT (ms):                        79.18
P99 TPOT (ms):                           80.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           78.80
Median ITL (ms):                         79.09
P99 ITL (ms):                            83.72
==================================================

Concurrency: 32

$ BENCH_MAX_CONCURRENCY=32 IS_PREFILL=1 PREFILL_IP=10.21.9.29 DECODE_IP=10.21.9.47 USE_BENCH=1   ./examples/online_serving/disaggregated_serving/moriio_pd_demo/run_pd_demo_2node.sh

======================================================
  Router: vllm-router (2-node)
  Model : deepseek-ai/DeepSeek-R1-0528
  Date  : Mon Apr 13 12:57:51 UTC 2026
======================================================
INFO 04-13 12:57:56 [datasets.py:700] Sampling input_len from [999, 999] and output_len from [1000, 1000]
WARNING: vllm bench serve no longer sets temperature==0 (greedy) in requests by default. The default will be determined on the server side and can be model/API specific. For the old behavior, include --temperature=0.
Starting initial single prompt test run...
Waiting for endpoint to become up in 3000 seconds
 |          | 02:59 elapsed, 206383:15:01 remaining
Initial test run completed.
Warming up with 64 requests...
100%|██████████| 64/64 [02:42<00:00,  2.54s/it]
Warmup run completed.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 32
100%|██████████| 320/320 [13:21<00:00,  2.50s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     320
Failed requests:                         0
Maximum request concurrency:             32
Benchmark duration (s):                  804.67
Total input tokens:                      319680
Total generated tokens:                  320000
Request throughput (req/s):              0.40
Output token throughput (tok/s):         397.68
Peak output token throughput (tok/s):    448.00
Peak concurrent requests:                58.00
Total token throughput (tok/s):          794.96
---------------Time to First Token----------------
Mean TTFT (ms):                          635.15
Median TTFT (ms):                        414.94
P99 TTFT (ms):                           1688.46
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          79.72
Median TPOT (ms):                        79.81
P99 TPOT (ms):                           81.63
---------------Inter-token Latency----------------
Mean ITL (ms):                           79.72
Median ITL (ms):                         80.46
P99 ITL (ms):                            85.20
==================================================


======================================================
  Router: moriio_toy_proxy_server.py (2-node)
  Model : deepseek-ai/DeepSeek-R1-0528
  Date  : Mon Apr 13 13:48:29 UTC 2026
======================================================
INFO 04-13 13:48:34 [datasets.py:700] Sampling input_len from [999, 999] and output_len from [1000, 1000]
WARNING: vllm bench serve no longer sets temperature==0 (greedy) in requests by default. The default will be determined on the server side and can be model/API specific. For the old behavior, include --temperature=0.
Starting initial single prompt test run...
Waiting for endpoint to become up in 3000 seconds
 |          | 02:58 elapsed, 190041:51:38 remaining
  0%|          | 0/64 [00:00<?, ?it/s]Initial test run completed.
Warming up with 64 requests...
100%|██████████| 64/64 [02:43<00:00,  2.55s/it]
Warmup run completed.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 32
100%|██████████| 320/320 [13:15<00:00,  2.49s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     318
Failed requests:                         2
Maximum request concurrency:             32
Benchmark duration (s):                  795.64
Total input tokens:                      317682
Total generated tokens:                  317682
Request throughput (req/s):              0.40
Output token throughput (tok/s):         399.28
Peak output token throughput (tok/s):    448.00
Peak concurrent requests:                63.00
Total token throughput (tok/s):          798.56
---------------Time to First Token----------------
Mean TTFT (ms):                          768.73
Median TTFT (ms):                        408.68
P99 TTFT (ms):                           1858.97
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          78.74
Median TPOT (ms):                        78.88
P99 TPOT (ms):                           79.68
---------------Inter-token Latency----------------
Mean ITL (ms):                           78.74
Median ITL (ms):                         78.57
P99 ITL (ms):                            84.66
==================================================

Concurrency: 64 + PIECEWISE cudagraphs in decode instance:

======================================================
  Router: vllm-router (2-node)
  Model : deepseek-ai/DeepSeek-R1-0528
  Date  : Tue Apr 14 08:44:48 UTC 2026
======================================================
INFO 04-14 08:44:54 [datasets.py:700] Sampling input_len from [999, 999] and output_len from [1000, 1000]
Starting initial single prompt test run...
Waiting for endpoint to become up in 3000 seconds
 |          | 00:21 elapsed, 26404:37:16 remaining
Initial test run completed.
Warming up with 128 requests...
100%|██████████| 128/128 [00:37<00:00,  3.45it/s]
Warmup run completed.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 64
100%|██████████| 640/640 [02:50<00:00,  3.76it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     640
Failed requests:                         0
Maximum request concurrency:             64
Benchmark duration (s):                  170.21
Total input tokens:                      639360
Total generated tokens:                  640000
Request throughput (req/s):              3.76
Output token throughput (tok/s):         3760.14
Peak output token throughput (tok/s):    4138.00
Peak concurrent requests:                102.00
Total token throughput (tok/s):          7516.52
---------------Time to First Token----------------
Mean TTFT (ms):                          586.75
Median TTFT (ms):                        282.33
P99 TTFT (ms):                           2457.77
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          16.11
Median TPOT (ms):                        16.15
P99 TPOT (ms):                           16.55
---------------Inter-token Latency----------------
Mean ITL (ms):                           16.11
Median ITL (ms):                         15.97
P99 ITL (ms):                            20.15
==================================================

======================================================
  Router: moriio_toy_proxy_server.py (2-node)
  Model : deepseek-ai/DeepSeek-R1-0528
  Date  : Tue Apr 14 08:53:57 UTC 2026
======================================================
INFO 04-14 08:54:02 [datasets.py:700] Sampling input_len from [999, 999] and output_len from [1000, 1000]
Starting initial single prompt test run...
Waiting for endpoint to become up in 3000 seconds
 |          | 00:21 elapsed, 21507:17:38 remaining
Initial test run completed.
Warming up with 128 requests...
100%|██████████| 128/128 [00:38<00:00,  3.35it/s]
Warmup run completed.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 64
100%|██████████| 640/640 [02:49<00:00,  3.78it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     622
Failed requests:                         18
Maximum request concurrency:             64
Benchmark duration (s):                  169.32
Total input tokens:                      621378
Total generated tokens:                  621378
Request throughput (req/s):              3.67
Output token throughput (tok/s):         3669.77
Peak output token throughput (tok/s):    4160.00
Peak concurrent requests:                96.00
Total token throughput (tok/s):          7339.55
---------------Time to First Token----------------
Mean TTFT (ms):                          597.72
Median TTFT (ms):                        357.74
P99 TTFT (ms):                           2580.78
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          16.20
Median TPOT (ms):                        16.19
P99 TPOT (ms):                           16.50
---------------Inter-token Latency----------------
Mean ITL (ms):                           16.20
Median ITL (ms):                         16.11
P99 ITL (ms):                            20.30
==================================================

1 node

1P1D on two MI300X devices, Qwen3-8b, MoRI-IO Connector, 1k/1k ISL/OSL

Concurrency	Router	Req Throughput (req/s)	TTFT P50 / P99 (ms)	TPOT P50 / P99 (ms)	ITL P50 / P99 (ms)
16	vllm-router	1.92	106.48 / 546.63	8.13 / 8.15	8.15 / 8.69
16	moriio_toy_proxy_server	1.92	112.88 / 513.65	8.09 / 8.12	8.09 / 8.69

See full bench results

Concurrency: 16

$ MODEL=Qwen/Qwen3-8B PREFILL_GPU=0 DECODE_GPU=1 USE_BENCH=1 ./examples/online_serving/disaggregated_serving/moriio_pd_demo/run_pd_demo.sh


...
======================================================
  Router: vllm-router
  Date  : Fri Apr 10 13:58:13 UTC 2026
======================================================
INFO 04-10 13:58:18 [datasets.py:700] Sampling input_len from [1000, 1000] and output_len from [1000, 1000]
Starting initial single prompt test run...
Waiting for endpoint to become up in 3000 seconds
 |          | 00:08 elapsed, 10376:05:17 remaining
Initial test run completed.
Warming up with 32 requests...
100%|██████████| 32/32 [00:18<00:00,  1.71it/s]
Warmup run completed.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|██████████| 160/160 [01:17<00:00,  2.07it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     160
Failed requests:                         0
Maximum request concurrency:             16
Benchmark duration (s):                  83.43
Total input tokens:                      160000
Total generated tokens:                  160000
Request throughput (req/s):              1.92
Output token throughput (tok/s):         1917.68
Peak output token throughput (tok/s):    2032.00
Peak concurrent requests:                32.00
Total token throughput (tok/s):          3835.36
---------------Time to First Token----------------
Mean TTFT (ms):                          168.19
Median TTFT (ms):                        106.48
P99 TTFT (ms):                           546.63
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          8.12
Median TPOT (ms):                        8.13
P99 TPOT (ms):                           8.15
---------------Inter-token Latency----------------

Mean ITL (ms):                           8.12
Median ITL (ms):                         8.15
P99 ITL (ms):                            8.69
==================================================

======================================================
  Router: moriio_toy_proxy_server.py
  Date  : Fri Apr 10 14:01:03 UTC 2026
======================================================
INFO 04-10 14:01:08 [datasets.py:700] Sampling input_len from [1000, 1000] and output_len from [1000, 1000]
Starting initial single prompt test run...
Waiting for endpoint to become up in 3000 seconds
 |          | 00:08 elapsed, 8595:29:20 remaining
Initial test run completed.
Warming up with 32 requests...
100%|██████████| 32/32 [00:19<00:00,  1.68it/s]
Warmup run completed.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|██████████| 160/160 [01:17<00:00,  2.07it/s]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     159
Failed requests:                         1
Maximum request concurrency:             16
Benchmark duration (s):                  82.82
Total input tokens:                      159000
Total generated tokens:                  158841
Request throughput (req/s):              1.92
Output token throughput (tok/s):         1917.86
Peak output token throughput (tok/s):    2032.00
Peak concurrent requests:                31.00
Total token throughput (tok/s):          3837.64
---------------Time to First Token----------------
Mean TTFT (ms):                          154.47
Median TTFT (ms):                        112.88
P99 TTFT (ms):                           513.65
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          8.09
Median TPOT (ms):                        8.09
P99 TPOT (ms):                           8.12
---------------Inter-token Latency----------------
Mean ITL (ms):                           8.09
Median ITL (ms):                         8.09
P99 ITL (ms):                            8.69
==================================================

GSM8K

# on the prefill node
$ IS_PREFILL=1 PREFILL_IP=10.21.9.29 DECODE_IP=10.21.9.47 USE_GSM8K=1 ./examples/online_serving/disaggregated_serving/moriio_pd_demo/run_pd_demo_2node.sh
# on the decode node
$ IS_PREFILL=0 PREFILL_IP=10.21.9.29 DECODE_IP=10.21.9.47 USE_GSM8K=1 ./examples/online_serving/disaggregated_serving/moriio_pd_demo/run_pd_demo_2node.sh
======================================================
  GSM8K evaluation (lm_eval) via MoRIIO PD-disaggregation (2-node)
  Router: vllm-router
  Model : deepseek-ai/DeepSeek-R1-0528
  Date  : Tue Apr 14 10:23:13 UTC 2026
======================================================
2026-04-14:10:23:21 INFO     [_cli.run:376] Selected Tasks: ['gsm8k']
...
local-completions ({'model': 'deepseek-ai/DeepSeek-R1-0528', 'base_url': 'http://127.0.0.1:8080/v1/completions', 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9575|±  |0.0056|
|     |       |strict-match    |     5|exact_match|↑  |0.9575|±  |0.0056|



# on prefill node
$ IS_PREFILL=1 PREFILL_IP=10.21.9.29 DECODE_IP=10.21.9.47 USE_GSM8K=1 GSM8K_PHASE2_ONLY=1  ./examples/online_serving/disaggregated_serving/moriio_pd_demo/run_pd_demo_2node.sh
# and on the decode
$ USE_GSM8K=1 IS_PREFILL=0 PREFILL_IP=10.21.9.29 DECODE_IP=10.21.9.47  ./examples/online_serving/disaggregated_serving/moriio_pd_demo/run_pd_demo_2node.sh
======================================================
  GSM8K evaluation (lm_eval) via MoRIIO PD-disaggregation (2-node)
  Router: moriio_toy_proxy_server.py
  Model : deepseek-ai/DeepSeek-R1-0528
  Date  : Tue Apr 14 11:40:34 UTC 2026
======================================================
2026-04-14:11:40:42 INFO     [_cli.run:376] Selected Tasks: ['gsm8k']
...
local-completions ({'model': 'deepseek-ai/DeepSeek-R1-0528', 'base_url': 'http://127.0.0.1:10001/v1/completions', 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9591|±  |0.0055|
|     |       |strict-match    |     5|exact_match|↑  |0.9538|±  |0.0058|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…fault ports Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

…rs for router compatibility Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

Similar to the P2pNcclConnector, we embed the nonstandard transfer fields into the zmq_address which is part of the request_id. These fields include remote_host, remote_port, remote_handshake_port, and remote_notify_port which were previously required to be sent by the router. That would require special logic in the router just for this specific KV connector, so instead we follow the logic in P2pNcclConnector and put any specific metadata inside the request ID. Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

mergify · 2026-04-11T08:41:25Z

Documentation preview: https://vllm--39565.org.readthedocs.build/en/39565/

simondanielsson · 2026-04-11T08:41:27Z

Re-opened from #38813

gemini-code-assist

Code Review

This pull request refactors MoRI-IO disaggregated serving to embed ZMQ connection metadata within the request_id, allowing the connector to derive peer information without explicit routing parameters. Changes include updating the toy proxy registration logic and adding parsing utilities in the common module. Feedback highlights potential service instability, specifically noting that unhandled exceptions in the background discovery thread could terminate the process and that malformed ZMQ addresses could cause engine crashes during integer conversion.

…to defaults Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

simondanielsson · 2026-04-11T09:01:00Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the MoRI-IO disaggregated serving proxy and connector to communicate connection details through the request ID, aligning with vLLM's routing architecture. It introduces ZMQ address parsing utilities and updates the registration protocol to use a more robust format. Feedback identifies a potential KeyError in the toy proxy's WRITE mode due to a missing transfer ID and suggests updating registration logic to handle instance restarts correctly by refreshing existing entries.

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

simondanielsson · 2026-04-11T09:25:32Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the MoRI-IO disaggregated serving communication by embedding peer connection information (ZMQ addresses) directly into the request_id. This change eliminates the need for the router to explicitly pass host and port details in kv_transfer_params, aligning the implementation with the P2P-NCCL connector's approach. The toy proxy server and the MoRI-IO connector have been updated to support this new registration and address resolution logic. I have no feedback to provide.

…ssages

mergify · 2026-04-22T02:06:48Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @simondanielsson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…ssages Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

simondanielsson · 2026-04-22T07:57:03Z

    req_data_copy["kv_transfer_params"].update(
        {
            "do_remote_decode": True,
            "do_remote_prefill": False,


Explanation: These were MoRI-specific fields. For uniformity we instead we embed them into the zmq_address which is then injected into the request id, similar to P2pNccl

…ssages

tjtanaa

LGTM

…lm-router (vllm-project#39565) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Co-authored-by: Matvei Pashkovskii <mpashkov@amd.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

…lm-router (vllm-project#39565) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Co-authored-by: Matvei Pashkovskii <mpashkov@amd.com> Signed-off-by: Adrian <info@zzit.ch>

…lm-router (vllm-project#39565) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Co-authored-by: Matvei Pashkovskii <mpashkov@amd.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>

…lm-router (vllm-project#39565) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Co-authored-by: Matvei Pashkovskii <mpashkov@amd.com>

…lm-router (vllm-project#39565) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Co-authored-by: Matvei Pashkovskii <mpashkov@amd.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

mpashkovskii and others added 5 commits April 11, 2026 10:17

[Fix] Align MoRIIO registration format with vLLM router and handle de…

0168926

…fault ports Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

[Fix] Enhance MoRIIOConnectorScheduler to return KV transfer paramete…

a0efc3d

…rs for router compatibility Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: set moriio ping interval to 3s from 5s

da139fd

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: add potential '-<seq>-<hex>' suffix to _DECODE_ZMQ_RE regex

1639adf

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

simondanielsson requested review from ApostaC, NickLucche and orozery as code owners April 11, 2026 08:40

mergify Bot added documentation Improvements or additions to documentation kv-connector labels Apr 11, 2026

gemini-code-assist Bot reviewed Apr 11, 2026

View reviewed changes

Comment thread examples/online_serving/disaggregated_serving/moriio_toy_proxy_server.py

Comment thread vllm/distributed/kv_transfer/kv_connector/v1/moriio/moriio_common.py Outdated

simondanielsson added 2 commits April 11, 2026 10:56

fix: raise error if zmq_address is malformed instead of falling back …

7ef0d71

…to defaults Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: discard registrations with incorrect transfer mode

54f4112

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

gemini-code-assist Bot reviewed Apr 11, 2026

View reviewed changes

Comment thread examples/online_serving/disaggregated_serving/moriio_toy_proxy_server.py

Comment thread examples/online_serving/disaggregated_serving/moriio_toy_proxy_server.py Outdated

This was referenced Apr 11, 2026

[Fix][MoRI] Add MoRI-IO connector support vllm-project/router#138

Merged

[Reproducer] Align MoRI-IO message format with P2pNcclConnector and vllm-router mpashkovskii/vllm#4

Draft

simondanielsson added 2 commits April 11, 2026 11:12

fix: refresh existing entries in router upon matching http_address

93ef546

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

fix: pass transfer_id in decode request

50578b9

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

simondanielsson mentioned this pull request Apr 11, 2026

[Bug]: parity with CUDA & parity with rocm sglang: vLLM router doesn't current support MoRI kvcache connector #38692

Closed

1 task

fix: log unrecognized message formats in toy proxy

ab58983

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

gemini-code-assist Bot reviewed Apr 11, 2026

View reviewed changes

simondanielsson mentioned this pull request Apr 13, 2026

feat: support stream response in the process_vllm_two_stage_request_discovered vllm-project/router#114

Merged

Merge remote-tracking branch 'upstream/main' into fix/align-moriio-me…

619af87

…ssages

tjtanaa requested a review from heheda12345 April 22, 2026 02:06

mergify Bot added the needs-rebase label Apr 22, 2026

Merge remote-tracking branch 'upstream/main' into fix/align-moriio-me…

65ffb26

…ssages Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

simondanielsson requested a review from xuechendi as a code owner April 22, 2026 07:09

mergify Bot removed the needs-rebase label Apr 22, 2026

simondanielsson commented Apr 22, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into fix/align-moriio-me…

0d8518d

…ssages

tjtanaa approved these changes Apr 22, 2026

View reviewed changes

tjtanaa added rocm Related to AMD ROCm ready ONLY add when PR is ready to merge/full CI is needed labels Apr 22, 2026

github-project-automation Bot added this to AMD Apr 22, 2026

github-project-automation Bot moved this to Todo in AMD Apr 22, 2026

Merge branch 'main' into fix/align-moriio-messages

bb4720c

tjtanaa merged commit ac58e2a into vllm-project:main Apr 22, 2026
59 checks passed

github-project-automation Bot moved this from Todo to Done in AMD Apr 22, 2026

rasmith mentioned this pull request Apr 28, 2026

[CI][AMD][BugFix] Update request URL in test_moriio_connector to match vllm-router compatibility changes #41076

Merged

1 task

simondanielsson mentioned this pull request Apr 30, 2026

[ROCm][MoRI] WRITE mode support (layerwise xfer) vllm-project/router#157

Merged

3 tasks

crazyguitar mentioned this pull request May 24, 2026

[Bugfix][ROCm][P/D][MoRIIO] Read-mode KV-release + best_of_n fixes #43541

Open

Uh oh!

Conversation

simondanielsson commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Example below how you vllm bench serve w/ 1P1D on 2 nodes using DSR1, using MoRIIOConnector and vllm-router:

Test Result

2 nodes.

1 node

GSM8K

Uh oh!

mergify Bot commented Apr 11, 2026

Uh oh!

simondanielsson commented Apr 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

simondanielsson commented Apr 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

simondanielsson commented Apr 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify Bot commented Apr 22, 2026

Uh oh!

simondanielsson Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjtanaa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

simondanielsson commented Apr 11, 2026 •

edited

Loading

simondanielsson Apr 22, 2026 •

edited

Loading