Skip to content

[ROCm][P/D][MORI][BugFix] Add transfer_id for moriio_connector so moriio_connector to restore P/D functionality#34907

Merged
DarkLight1337 merged 20 commits intovllm-project:mainfrom
rasmith:ransmith_moriio
Mar 16, 2026
Merged

[ROCm][P/D][MORI][BugFix] Add transfer_id for moriio_connector so moriio_connector to restore P/D functionality#34907
DarkLight1337 merged 20 commits intovllm-project:mainfrom
rasmith:ransmith_moriio

Conversation

@rasmith
Copy link
Contributor

@rasmith rasmith commented Feb 19, 2026

Purpose

This PR introduced random suffixes to completion ids to better differentiate services in logs. That PR was able to update the other KV connector implementations, but the modifications for those KV connectors could not be applied to the MORI KV connector.

The result is that the MORI KV connector hangs indefinitely, because it is waiting for a completion id that will never arrive.

This PR was opened to fix the problem, but the author has ceased working on it. It seems that this was done in favor of this PR. However, the second PR attempts to modify the scheduler which is undesirable.

The first PR introduced a transaction id, which seems the most natural solution, and maintains some consistency with the other KV connector solutions. However, one of the issues with getting the MORI KV connector to work is that the decode worker needs to receive the completion id and furthermore needs to avoid any race conditions when associating a transaction id with a completion id.

This PR stashes the transfer_id_to_request_id relationship in the MoRIIOConnectorMetadata, which the scheduler will store in the SchedulerOutput. The MoRIIOConnectorMetadata stored in the SchedulerOutput will then be received by start_load_kv in the MoRIIOConnectorWorker, which finally allows both the scheduler and worker processes to maintain the same relationship of transfer id and completion id pairings.

Test Plan

I tested 1P1D on a single machine and with two machines to check that functionality had been restored. I used the following justfile:

# Setting this allows creating a symlink to Justfile from another dir
set working-directory := "/vllm-upstream"

# Needed for the proxy server
vllm-directory := "/vllm-upstream"

TP_SIZE := "1"
PREFILL_GPUS := "3"
DECODE_GPUS := "4"

MEMORY_UTIL := "0.99"

ISL := "1024"
OSL := "5"
RATIO := "0"
PORT := "10001"
CONCURRENCY := "1"
PROMPTS := "1"

DEFAULT_READ_MODE := "0"
PROXY_IP := "10.7.79.60"

port PORT: 
  @python port_allocator.py {{PORT}}


mori_serve_prefill READ_MODE=DEFAULT_READ_MODE:
  CUDA_VISIBLE_DEVICES={{PREFILL_GPUS}} \
  VLLM_MORIIO_CONNECTOR_READ_MODE={{READ_MODE}} \
  VLLM_USE_V1=1 \
  VLLM_ROCM_USE_AITER=1 \
  vllm serve {{MODEL}}        \
   -tp {{TP_SIZE}}  \
   --port 20005     \
   --no-enable-prefix-caching \
   --max-num-batched-tokens 4096         \
   --distributed-executor-backend mp         \
   --gpu_memory_utilization 0.85         \
   --max-model-len 8200         \
   --enforce-eager \
   --trust-remote-code \
   --compilation-config='{"cudagraph_capture_sizes": [1, 256]}' \
   --kv-transfer-config '{"kv_connector":"MoRIIOConnector","kv_role":"kv_producer","kv_connector_extra_config":{"proxy_ip":"{{PROXY_IP}}","http_port":"20005","proxy_ping_port":"36367"}}' 


mori_serve_decode READ_MODE=DEFAULT_READ_MODE:
  CUDA_VISIBLE_DEVICES={{DECODE_GPUS}} \
  VLLM_MORIIO_CONNECTOR_READ_MODE={{READ_MODE}} \
  VLLM_USE_V1=1 \
  VLLM_ROCM_USE_AITER=1 \
  vllm serve {{MODEL}} \
     -tp {{TP_SIZE}}  \
     --port 40005     \
     --no-enable-prefix-caching \
     --max-num-batched-tokens 4096         \
     --distributed-executor-backend mp         \
     --gpu_memory_utilization 0.85         \
     --max-model-len 8200         \
     --enforce-eager \
     --trust-remote-code \
     --compilation-config='{"cudagraph_capture_sizes": [1, 256]}' \
     --kv-transfer-config '{"kv_connector":"MoRIIOConnector","kv_role":"kv_consumer","kv_connector_extra_config":{"proxy_ip":"{{PROXY_IP}}","http_port":"40005","proxy_ping_port":"36367"}}' 

mori_serve_prefill_solo READ_MODE=DEFAULT_READ_MODE:
  CUDA_VISIBLE_DEVICES={{PREFILL_GPUS}} \
  VLLM_MORIIO_CONNECTOR_READ_MODE={{READ_MODE}} \
  VLLM_USE_V1=1 \
  VLLM_ROCM_USE_AITER=1 \
  vllm serve {{MODEL}}        \
   -tp {{TP_SIZE}}  \
   --port 20005     \
   --no-enable-prefix-caching \
   --max-num-batched-tokens 4096         \
   --distributed-executor-backend mp         \
   --gpu_memory_utilization 0.85         \
   --max-model-len 8200         \
   --enforce-eager \
   --trust-remote-code \
   --compilation-config='{"cudagraph_capture_sizes": [1, 256]}' \
   --kv-transfer-config '{"kv_connector":"MoRIIOConnector","kv_role":"kv_producer","kv_connector_extra_config":{"proxy_ip":"{{PROXY_IP}}","http_port":"20005","proxy_ping_port":"36367", "handshake_port":"6301","notify_port":"61005"}}' 

mori_serve_decode_solo READ_MODE=DEFAULT_READ_MODE:
  CUDA_VISIBLE_DEVICES={{DECODE_GPUS}} \
  VLLM_MORIIO_CONNECTOR_READ_MODE={{READ_MODE}} \
  VLLM_USE_V1=1 \
  VLLM_ROCM_USE_AITER=1 \
  vllm serve {{MODEL}} \
     -tp {{TP_SIZE}}  \
     --port 40005     \
     --no-enable-prefix-caching \
     --max-num-batched-tokens 4096         \
     --distributed-executor-backend mp         \
     --gpu_memory_utilization 0.85         \
     --max-model-len 8200         \
     --enforce-eager \
     --trust-remote-code \
     --compilation-config='{"cudagraph_capture_sizes": [1, 256]}' \
     --kv-transfer-config '{"kv_connector":"MoRIIOConnector","kv_role":"kv_consumer","kv_connector_extra_config":{"proxy_ip":"{{PROXY_IP}}","http_port":"40005","proxy_ping_port":"36367","handshake_port":"8301","notify_port":"62005"}}' 

mori_proxy:
  python examples/online_serving/disaggregated_serving/moriio_toy_proxy_server.py

send_request:
  curl -X POST http://localhost:10001/v1/completions \
    -H "Content-Type: application/json" \
    -d '{ \
      "model": "{{MODEL}}", \
      "prompt": "Give me a recipe for tomato soup", \
      "max_tokens": 150, \
      "temperature": 0.7 \
    }'

Essential Elements of an Effective PR Description Checklist
  • [ X] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • [ X] The test plan, such as providing test command.
  • [ X] The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
@mergify
Copy link

mergify bot commented Feb 19, 2026

Documentation preview: https://vllm--34907.org.readthedocs.build/en/34907/

@mergify mergify bot added documentation Improvements or additions to documentation rocm Related to AMD ROCm bug Something isn't working kv-connector labels Feb 19, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Feb 19, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a transfer_id to the MORI KV connector to maintain consistent request-to-transfer mappings across scheduler and worker processes, addressing issues caused by random suffixes in completion IDs. While the architectural change is appropriate, the current implementation contains several critical issues, including NameErrors in log messages and exception handlers, and potential runtime crashes due to missing keys in mapping dictionaries, particularly for aborted requests or when operating in non-WRITE modes.

Signed-off-by: Randall Smith <Randall.Smith@amd.com>
@mergify
Copy link

mergify bot commented Feb 19, 2026

Hi @rasmith, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
# Reqs to send and their expiration time
self._reqs_need_send: dict[ReqId, float] = {}
self.paths: dict[str, zmq.Socket] = {}
self.transfer_id_to_request_id: dict[TransferId, ReqId] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rasmith a lot! Would it be better to define two mapping functions here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the function name, this could save some horizontal space, but what other benefit could be had?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For versatility and maintainability, if it's only used within the class, I think using a dictionary is also fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you meant actually, but I added a map and unmap functions for request-id-transfer-id pairs. If you mean to swap the dictionaries for functions, its also possible to implement a class with __get_item__ and use that instead. It should be pretty easy, so if the time arises it shouldn't be too hard, but for the time being I don't see a need for it.

@rasmith
Copy link
Contributor Author

rasmith commented Feb 25, 2026

@orozery @NickLucche @ApostaC Please take a look?

Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
@inkcherry
Copy link
Contributor

inkcherry commented Mar 4, 2026

hi, @rasmith , thank you. I noticed that the author of #32630 mentioned that it affects the write mode, but the current tests mainly focus on read testing. I also noticed this PR: #34415. I'm wondering whether the current write mode can still function correctly?

@hongxiayang
Copy link
Collaborator

@rasmith
Note:
This needs the following environment variable:

export VLLM_DISABLE_REQUEST_ID_RANDOMIZATION=1

@hongxiayang
Copy link
Collaborator

@vllmellm Let's help to verify this along with the environment variable to help land this PR.

@hongxiayang hongxiayang moved this from Todo to In Progress in AMD Mar 10, 2026
@rasmith
Copy link
Contributor Author

rasmith commented Mar 10, 2026

@rasmith Note: This needs the following environment variable:

export VLLM_DISABLE_REQUEST_ID_RANDOMIZATION=1

It doesn't need randomization to be disabled.

@tjtanaa
Copy link
Collaborator

tjtanaa commented Mar 13, 2026

@rasmith please try to fix the doc CI error.

@junkang1991
Copy link

Verified both RDMA write and read modes on Qwen/Qwen3-235B-A22B-FP8, disaggregated 1P+1D (TP=4 each), single MI300X node (8 GPUs).

Benchmark config: --random-input-len 2000 --random-output-len 1000 --num-prompts 100 --request-rate 8

Both modes complete successfully without needing to set VLLM_DISABLE_REQUEST_ID_RANDOMIZATION:

Case Transfer mode Request ID randomization Mean TTFT (ms) Mean ITL (ms)
1 RDMA Write disabled 749.97 32.84
2 RDMA Read disabled 776.91 32.66
3 RDMA Write enabled 741.34 32.74
4 RDMA Read enabled 752.20 32.85

@AndreasKaratzas AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 13, 2026
@AndreasKaratzas
Copy link
Collaborator

Added the ready label for tests to start.

@rasmith
Copy link
Contributor Author

rasmith commented Mar 13, 2026

@inkcherry Please take another look?

Signed-off-by: Randall Smith <Randall.Smith@amd.com>
@rasmith
Copy link
Contributor Author

rasmith commented Mar 13, 2026

@rasmith please try to fix the doc CI error.

Fixed

Signed-off-by: Randall Smith <Randall.Smith@amd.com>
chunfangamd added a commit to SemiAnalysisAI/InferenceX that referenced this pull request Mar 13, 2026
…roxy

Replace NixlConnector with MoRIIOConnector for KV cache transfer and
replace the Rust-based vllm-router with a MoRI-IO-aware Python proxy
that handles both HTTP routing and ZMQ-based RDMA endpoint discovery.

The key architectural change is that the proxy enriches each request's
kv_transfer_params with remote RDMA endpoint info (handshake_port,
notify_port, host, port) before dispatching, enabling concurrent
prefill+decode in WRITE mode — something vllm-router could not do
because it only understands HTTP, not the MoRI-IO registration protocol.

Changes:
- Add moriio_proxy.py: MoRI-IO-aware proxy with ZMQ service discovery,
  request enrichment, and /health endpoint (adapted from vLLM upstream
  moriio_toy_proxy_server.py)
- server.sh: switch --kv-transfer-config from NixlConnector to
  MoRIIOConnector with kv_connector_extra_config (proxy_ip,
  proxy_ping_port, http_port); launch proxy before prefill on NODE_RANK=0;
  set VLLM_DISABLE_REQUEST_ID_RANDOMIZATION=1 as workaround for v0.17.1
  completion-ID mismatch (upstream fix: vllm-project/vllm#34907)
- setup_deps.sh: replace vllm-router/Rust install with lightweight
  Python deps (quart, aiohttp, msgpack, pyzmq) for the proxy

Benchmark (Job 2853 vs 2818 NixlConnector baseline, ISL/OSL=1024):
  TTFT median:  -37% to -55% across C8–C64 (e.g. 384→241ms @C64)
  TTFT p99:     -63% at C64 (6622→2469ms)
  Throughput:   +8% at C64 (2634→2844 tok/s)
  TPOT:         unchanged (~22ms @C64)
@DarkLight1337 DarkLight1337 merged commit 0024f39 into vllm-project:main Mar 16, 2026
49 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in AMD Mar 16, 2026
@junkang1991
Copy link

Issue: lm_eval fails against the disaggregated proxy

Issue 1 — endpoint mismatch (400 error)

When running lm-evaluation-harness against the proxy at /v1/chat/completions, requests fail with:

ValueError: Either prompt or prompt_embeds must be provided and non-empty.

The proxy's registered request_address points to /v1/completions. When lm_eval sends to /v1/chat/completions with messages, the proxy forwards it verbatim to /v1/completions which expects prompt instead.

Issue 2 — wrong content-type

The proxy response carries Content-Type: text/html, which lm_eval's async client rejects.


Attempted quick fix

  1. Rewrite the forwarded URL to match the incoming request path:
def rewrite_url(registered_url, path):
    return re.sub(r'/v1/(chat/)?completions$', path, registered_url)
  1. Forward the content-type from the decode response:
response.headers['Content-Type'] = decode_response.headers.get('Content-Type', 'application/json')

GSM8K (5-shot, 50 samples, RDMA write mode):

lm_eval --model local-chat-completions \
    --model_args model=Qwen/Qwen3-235B-A22B-FP8,base_url=http://localhost:10001/v1/chat/completions,tokenizer=Qwen/Qwen3-235B-A22B-FP8,num_concurrent=1 \
    --tasks gsm8k --num_fewshot 5 --apply_chat_template --batch_size 1 \
    --gen_kwargs '{"max_tokens": 4096}'
Filter exact_match
flexible-extract 0.88
strict-match 0.76

@AndreasKaratzas
Copy link
Collaborator

AndreasKaratzas commented Mar 16, 2026

  1. Rewrite the forwarded URL to match the incoming request path:

@junkang1991 This just disables chat completion if I am correct.

Is the issue you describe a problem with this PR?

@tjtanaa
Copy link
Collaborator

tjtanaa commented Mar 16, 2026

Issue: lm_eval fails against the disaggregated proxy

Issue 1 — endpoint mismatch (400 error)

When running lm-evaluation-harness against the proxy at /v1/chat/completions, requests fail with:

ValueError: Either prompt or prompt_embeds must be provided and non-empty.

The proxy's registered request_address points to /v1/completions. When lm_eval sends to /v1/chat/completions with messages, the proxy forwards it verbatim to /v1/completions which expects prompt instead.

Issue 2 — wrong content-type

The proxy response carries Content-Type: text/html, which lm_eval's async client rejects.

Attempted quick fix

  1. Rewrite the forwarded URL to match the incoming request path:
def rewrite_url(registered_url, path):
    return re.sub(r'/v1/(chat/)?completions$', path, registered_url)
  1. Forward the content-type from the decode response:
response.headers['Content-Type'] = decode_response.headers.get('Content-Type', 'application/json')

GSM8K (5-shot, 50 samples, RDMA write mode):

lm_eval --model local-chat-completions \
    --model_args model=Qwen/Qwen3-235B-A22B-FP8,base_url=http://localhost:10001/v1/chat/completions,tokenizer=Qwen/Qwen3-235B-A22B-FP8,num_concurrent=1 \
    --tasks gsm8k --num_fewshot 5 --apply_chat_template --batch_size 1 \
    --gen_kwargs '{"max_tokens": 4096}'

Filter exact_match
flexible-extract 0.88
strict-match 0.76

@rasmith @inkcherry can you guys help to take a look at this problem?
In the bugfix PR, please include lm-eval score and make sure to test against different endpoint e.g. v1/completions , v1/chat/completions.

@AndreasKaratzas
Copy link
Collaborator

Is this some kind of test group that fails?

@tjtanaa
Copy link
Collaborator

tjtanaa commented Mar 16, 2026

@AndreasKaratzas it seems in this PR, The NIXL tests are not triggered automatically. I have just manually triggered one of the tests and I will come back and check the status again, to know if it captures the failure.

@inkcherry
Copy link
Contributor

inkcherry commented Mar 16, 2026

Thanks @junkang1991, @tjtanaa for your efforts. @knitcapcat-amd let's refine the proxy example based on the feedback to ensure better compatibility across more scenarios. Contributions from @rasmith are also very welcome.

@AndreasKaratzas
Copy link
Collaborator

@AndreasKaratzas it seems in this PR, The NIXL tests are not triggered automatically. I have just manually triggered one of the tests and I will come back and check the status again, to know if it captures the failure.

Indeed, this PR is behind the V1 others regression on ROCm too.

@rasmith
Copy link
Contributor Author

rasmith commented Mar 16, 2026

Issue: lm_eval fails against the disaggregated proxy
Issue 1 — endpoint mismatch (400 error)
When running lm-evaluation-harness against the proxy at /v1/chat/completions, requests fail with:

ValueError: Either prompt or prompt_embeds must be provided and non-empty.

The proxy's registered request_address points to /v1/completions. When lm_eval sends to /v1/chat/completions with messages, the proxy forwards it verbatim to /v1/completions which expects prompt instead.
Issue 2 — wrong content-type
The proxy response carries Content-Type: text/html, which lm_eval's async client rejects.
Attempted quick fix

  1. Rewrite the forwarded URL to match the incoming request path:
def rewrite_url(registered_url, path):
    return re.sub(r'/v1/(chat/)?completions$', path, registered_url)
  1. Forward the content-type from the decode response:
response.headers['Content-Type'] = decode_response.headers.get('Content-Type', 'application/json')

GSM8K (5-shot, 50 samples, RDMA write mode):

lm_eval --model local-chat-completions \
    --model_args model=Qwen/Qwen3-235B-A22B-FP8,base_url=http://localhost:10001/v1/chat/completions,tokenizer=Qwen/Qwen3-235B-A22B-FP8,num_concurrent=1 \
    --tasks gsm8k --num_fewshot 5 --apply_chat_template --batch_size 1 \
    --gen_kwargs '{"max_tokens": 4096}'

Filter exact_match
flexible-extract 0.88
strict-match 0.76

@rasmith @inkcherry can you guys help to take a look at this problem? In the bugfix PR, please include lm-eval score and make sure to test against different endpoint e.g. v1/completions , v1/chat/completions.

@tjtanaa @inkcherry I'll look into it

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026
…iio_connector to restore P/D functionality (vllm-project#34907)

Signed-off-by: Randall Smith <Randall.Smith@amd.com>
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
…iio_connector to restore P/D functionality (vllm-project#34907)

Signed-off-by: Randall Smith <Randall.Smith@amd.com>
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
…iio_connector to restore P/D functionality (vllm-project#34907)

Signed-off-by: Randall Smith <Randall.Smith@amd.com>
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026
…iio_connector to restore P/D functionality (vllm-project#34907)

Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working documentation Improvements or additions to documentation kv-connector ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants