Skip to content

[BugFix] SharedMemoryConnector: only use shared memory if message size is over threshold#1643

Closed
NickCao wants to merge 2 commits into
vllm-project:mainfrom
NickCao:shm-multi-node
Closed

[BugFix] SharedMemoryConnector: only use shared memory if message size is over threshold#1643
NickCao wants to merge 2 commits into
vllm-project:mainfrom
NickCao:shm-multi-node

Conversation

@NickCao
Copy link
Copy Markdown
Contributor

@NickCao NickCao commented Mar 3, 2026

Purpose

This allows #939 to be used across multiple nodes.

Test Plan

Start three gpu nodes:

# on master node 0
vllm serve --omni --port 8091 --stage-id 0 \
  Qwen/Qwen2.5-Omni-3B \
  --omni-master-address "<master ip>" --omni-master-port 8092

# on worker node 1
vllm serve --omni --headless --stage-id 1 \
  Qwen/Qwen2.5-Omni-3B \
  --omni-master-address "<master ip>" --omni-master-port 8092

# on worker node 2
vllm serve --omni --headless --stage-id 2 \
  Qwen/Qwen2.5-Omni-3B \
  --omni-master-address "<master ip>" --omni-master-port 8092

Run test query:

curl http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": [
          { "type": "text", "text": "What is inside this image?" },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://http.cat/100"
            }
          }
        ]
      }
    ]
  }'

Test Result

An audio response is successfully generated.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@wuhang2014 @natureofnature PTAL

size = len(payload)

if True:
if size > self.threshold:
Copy link
Copy Markdown
Contributor

@natureofnature natureofnature Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On sender side, if the size <= self.threshold, we need to update metadata = {"inline_bytes": payload, "size": size}, but on the receiver side , the metadata might NOT be passes to the get function. And in this case, the receiver tries to receive data from shared memory, which does not exist.
I think you need to fix this to make sender<->receiver consistent in both with/without meta path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably just drop the without metadata path. Other connectors may necessitate the use of metadata and leaving the choice of whether to pass the metadata to the caller is not ideal.

Copy link
Copy Markdown
Contributor

@natureofnature natureofnature Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@R2-Y PTAL, is it possible that we remove non-metadata path for async chunk function and use meta in all model and modes?

@wuhang2014
Copy link
Copy Markdown
Contributor

First of all, I'm not quite sure that if SharedMemoryConnector works in a multi-node deployment. @NickCao @natureofnature

@natureofnature
Copy link
Copy Markdown
Contributor

natureofnature commented Mar 4, 2026

First of all, I'm not quite sure that if SharedMemoryConnector works in a multi-node deployment. @NickCao @natureofnature

No, it only works on a single node. For multi-node deployment, currently we can use mooncake store connector. @wuhang2014

(Supplementary information: Currently In Bagel/Qwen3 omni case when kv cache transfer manager and async chunk transfer are used, metadata is not carried, and SharedMemoryConnector does not work even inline mode is set.)

@NickCao
Copy link
Copy Markdown
Contributor Author

NickCao commented Mar 4, 2026

First of all, I'm not quite sure that if SharedMemoryConnector works in a multi-node deployment. @NickCao @natureofnature

No, it only works on a single node. For multi-node deployment, currently we can use mooncake store connector. @wuhang2014

It does work for multi-node, I've tested this using the stage base cli from #939 on a cluster of three ec2 instances. It works by forcing the threshold to sys.maxsize, thus sending all messages inline, without actually using shm.

@natureofnature
Copy link
Copy Markdown
Contributor

First of all, I'm not quite sure that if SharedMemoryConnector works in a multi-node deployment. @NickCao @natureofnature

No, it only works on a single node. For multi-node deployment, currently we can use mooncake store connector. @wuhang2014

It does work for multi-node, I've tested this using the stage base cli from #939 on a cluster of three ec2 instances. It works by forcing the threshold to sys.maxsize, thus sending all messages inline, without actually using shm.

You’re right that multi-node can work when shm_threshold_bytes=sys.maxsize, but in that mode payload transfer is effectively inline over the stage transport (ZMQ queue), not shared memory. So this is a compatibility workaround rather than the intended high-performance path. For multi-node deployments, we recommend a network connector (e.g., MooncakeTransferEngineConnector / MooncakeStoreConnector / YuanrongConnector). @NickCao

@NickCao
Copy link
Copy Markdown
Contributor Author

NickCao commented Mar 4, 2026

First of all, I'm not quite sure that if SharedMemoryConnector works in a multi-node deployment. @NickCao @natureofnature

No, it only works on a single node. For multi-node deployment, currently we can use mooncake store connector. @wuhang2014

It does work for multi-node, I've tested this using the stage base cli from #939 on a cluster of three ec2 instances. It works by forcing the threshold to sys.maxsize, thus sending all messages inline, without actually using shm.

You’re right that multi-node can work when shm_threshold_bytes=sys.maxsize, but in that mode payload transfer is effectively inline over the stage transport (ZMQ queue), not shared memory. So this is a compatibility workaround rather than the intended high-performance path. For multi-node deployments, we recommend a network connector (e.g., MooncakeTransferEngineConnector / MooncakeStoreConnector / YuanrongConnector). @NickCao

I'm aware that this may have degraded performance, but it's still useful for development and testing?

@NickCao
Copy link
Copy Markdown
Contributor Author

NickCao commented Mar 4, 2026

And for single node deployments, this can reduce the overhead for small messages (which I suppose is the reason the threshold exists in the first place).

@natureofnature
Copy link
Copy Markdown
Contributor

And for single node deployments, this can reduce the overhead for small messages (which I suppose is the reason the threshold exists in the first place).

And for single node deployments, this can reduce the overhead for small messages (which I suppose is the reason the threshold exists in the first place).

Yes, that's exactly why it's there.

@natureofnature
Copy link
Copy Markdown
Contributor

First of all, I'm not quite sure that if SharedMemoryConnector works in a multi-node deployment. @NickCao @natureofnature

No, it only works on a single node. For multi-node deployment, currently we can use mooncake store connector. @wuhang2014

It does work for multi-node, I've tested this using the stage base cli from #939 on a cluster of three ec2 instances. It works by forcing the threshold to sys.maxsize, thus sending all messages inline, without actually using shm.

You’re right that multi-node can work when shm_threshold_bytes=sys.maxsize, but in that mode payload transfer is effectively inline over the stage transport (ZMQ queue), not shared memory. So this is a compatibility workaround rather than the intended high-performance path. For multi-node deployments, we recommend a network connector (e.g., MooncakeTransferEngineConnector / MooncakeStoreConnector / YuanrongConnector). @NickCao

I'm aware that this may have degraded performance, but it's still useful for development and testing?

@NickCao Currently, if we force to set meta data, it requires some refactoring on chunk_transfer_adapter and kv_transfer_manager. To merge the PR, I suggest step 1. Apply the threshold check. In get function, set default metadata to {} and fall back to the SHM path when metadata does not contain "inline_bytes". Add a warning log when inline data is expected but missing, so users are aware of the potential mismatch. Step 2 (following PR maybe), Refactor chunk_transfer_adapter and kv_transfer_manager to propagate put() metadata to get() in all scenarios, ensuring sender and receiver are always consistent. @princepride @R2-Y What's your ideas?

@NickCao
Copy link
Copy Markdown
Contributor Author

NickCao commented Mar 4, 2026

In get function, set default metadata to {} and fall back to the SHM path when metadata does not contain "inline_bytes".

This fallback already happens when metadata is missing.

Add a warning log when inline data is expected but missing, so users are aware of the potential mismatch.

The only way to detect this is to check if the expected file does not exist in /dev/shm?

@natureofnature
Copy link
Copy Markdown
Contributor

natureofnature commented Mar 4, 2026

In get function, set default metadata to {} and fall back to the SHM path when metadata does not contain "inline_bytes".

This fallback already happens when metadata is missing.

Add a warning log when inline data is expected but missing, so users are aware of the potential mismatch.

The only way to detect this is to check if the expected file does not exist in /dev/shm?

I mean somehow like

def get(self, ..., metadata=None):
    if metadata is None:
        metadata = {}
    if "inline_bytes" in metadata:
        # inline path
    elif "shm" in metadata:
        # get data from shared memory with the name "shm"
    else:
        # fallback: get data from sim using key
        logger.warning("No inline key ...") # when timeout

By the way, in Bagel/Qwen3 omni case when kv cache transfer manager and async chunk transfer are used, metadata is not carried, that's why I said shared memory does not work for multi node deployment now. And perhaps in these models you even need users to set the threshold to low value to make them work before letting them propagating the metadata.

@NickCao NickCao force-pushed the shm-multi-node branch 2 times, most recently from 0729286 to d1773ff Compare March 4, 2026 17:02
@NickCao
Copy link
Copy Markdown
Contributor Author

NickCao commented Mar 4, 2026

Added error log for this scenario, this is a hard error not a warning since it straight does not work under this configuration: multi node, shm connector, metadata not passed.

Followup PR would be to make kv cache transfer manager and async chunk transfer compatible with this.

Comment thread vllm_omni/distributed/omni_connectors/connectors/test_shm_connector.py Outdated
Comment thread vllm_omni/distributed/omni_connectors/connectors/test_shm_connector.py Outdated
Comment thread vllm_omni/distributed/omni_connectors/connectors/shm_connector.py
@R2-Y
Copy link
Copy Markdown
Contributor

R2-Y commented Mar 5, 2026

We need to use put_key and get_key in shared memory to control the order in which chunk data are received and sent. It seems difficult to correctly transmit chunk metadata using an inline method, inline will lead disorder of recv chunk. @NickCao @natureofnature

@NickCao
Copy link
Copy Markdown
Contributor Author

NickCao commented Mar 5, 2026

We need to use put_key and get_key in shared memory to control the order in which chunk data are received and sent. It seems difficult to correctly transmit chunk metadata using an inline method, inline will lead disorder of recv chunk. @NickCao @natureofnature

This also applies to single node? That mean we shall instead drop the inline code path completely.

@R2-Y
Copy link
Copy Markdown
Contributor

R2-Y commented Mar 5, 2026

We need to use put_key and get_key in shared memory to control the order in which chunk data are received and sent. It seems difficult to correctly transmit chunk metadata using an inline method, inline will lead disorder of recv chunk. @NickCao @natureofnature

This also applies to single node? That mean we shall instead drop the inline code path completely.

yes, also applies to single node.

@natureofnature
Copy link
Copy Markdown
Contributor

I also have some concerns, which depends on future architecture: If it totally follows async mode and the orchestrator sends request meta to multiple stages at a time (async stage execution), inline mode's advantage is never there: the following stages can not get the payload through the inline path when receiving the meta data from orchestrator.
(For example , in current async chunk mode, Qwen3 omni stage 2 does not wait for stage 1's result).

NickCao added 2 commits March 10, 2026 11:10
…e is over threshold

Signed-off-by: Nick Cao <ncao@redhat.com>
…fallback path

Signed-off-by: Nick Cao <ncao@redhat.com>
@NickCao NickCao marked this pull request as draft March 10, 2026 15:12
@NickCao
Copy link
Copy Markdown
Contributor Author

NickCao commented Mar 10, 2026

Marking draft due to concerns with the compatibility between inline data and async/chunked kv transfer.

@NickCao NickCao mentioned this pull request Mar 10, 2026
1 task
@NickCao NickCao closed this Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants