Skip to content

[1/2] Add ModelExpress coordination for remote instance weight loading - matching TP#19920

Merged
ishandhanani merged 13 commits intomainfrom
ishan/mx
Mar 18, 2026
Merged

[1/2] Add ModelExpress coordination for remote instance weight loading - matching TP#19920
ishandhanani merged 13 commits intomainfrom
ishan/mx

Conversation

@ishandhanani
Copy link
Copy Markdown
Collaborator

@ishandhanani ishandhanani commented Mar 5, 2026

Summary

Add ModelExpress (MX) as a coordination backend for SGLang's remote instance weight loading, enabling persistent metadata discovery for GPU-to-GPU weight transfers via Mooncake TransferEngine.

Related: #12910 -- MX provides the "planner" functionality discussed there (seed discovery, metadata persistence, multi-model coordination) as a standalone Rust gRPC service, with K8s-native backends (CRD, ConfigMap, Redis). In the future - MX will also be able to handle reshaping in order to handle going to DEPX -> DEPY

Motivation

SGLang's existing remote instance weight loading ("rfork") supports two backends: NCCL and Mooncake TransferEngine. Both rely on direct HTTP coordination between seed and target -- the target must know the seed's IP and port upfront, and metadata is ephemeral (lost if the seed restarts).

ModelExpress solves this by providing:

  • Persistent metadata storage: tensor addresses and TransferEngine session IDs survive seed restarts
  • K8s-native discovery: supports CRD, ConfigMap, and Redis backends so targets can discover seeds without hardcoded addresses
  • Multi-model support: a single MX server coordinates metadata for multiple models
  • Decoupled lifecycle: seed and target don't need to be started in a specific order -- the target polls MX until the seed is ready

This integration reuses 100% of SGLang's existing TransferEngine infrastructure (register_memory_region(), batch_transfer_sync_read()). The only new code is the coordination layer that replaces direct HTTP with MX gRPC calls.

Architecture

Seed Instance                    ModelExpress Server              Target Instance
─────────────                    ──────────────────              ───────────────
1. Load model from disk          
2. Init TransferEngine           
3. register_memory_region()      
4. Publish metadata ──────────►  Store {session_id,              
                                  tensor descriptors}            
5. Publish ready ─────────────►  Set ready flag                  
                                                                 1. Init TransferEngine
                                                                 2. Create dummy weights
                                                                 3. register_memory_region()
                                 ◄────────────────────────────── 4. Poll wait_for_ready()
                                 Return ready ─────────────────► 
                                 ◄────────────────────────────── 5. get_metadata()
                                 Return {session_id, tensors} ─► 
                                                                 6. batch_transfer_sync_read()
                                                                    (RDMA direct from seed GPU)

How it builds on existing SGLang infra

  • register_memory_region() (existing): registers model weight GPU memory with TransferEngine for RDMA access. Used by both seed (to expose weights) and target (to provide destination buffers).
  • batch_transfer_sync_read() (existing): performs batched RDMA reads from seed GPU memory to target GPU memory. The core transfer mechanism is unchanged.
  • RemoteInstanceModelLoader (existing): the model loader that creates dummy weights and fills them from a remote source. Extended with a new MODEL_EXPRESS backend dispatch.
  • --remote-instance-weight-loader-start-seed-via-transfer-engine (existing): initializes TransferEngine on the seed side. Reused as-is.

The new code is thin (~150 lines in loader.py, ~60 lines in model_runner.py) and sits between the existing TransferEngine primitives and the new MX coordination layer.

CLI

All MX config is passed via a single --model-express-config JSON arg:

Seed (publishes weights):

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-14B-FP8 --port 30000 \
  --load-format auto \
  --remote-instance-weight-loader-start-seed-via-transfer-engine \
  --model-express-config '{"url": "localhost:8001", "model_name": "Qwen/Qwen3-14B-FP8", "source": true}'

Target (loads weights via RDMA):

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-14B-FP8 --port 30001 \
  --load-format remote_instance \
  --remote-instance-weight-loader-backend model_express \
  --model-express-config '{"url": "localhost:8001", "model_name": "Qwen/Qwen3-14B-FP8"}'
Key Description
url MX gRPC server address (e.g., localhost:8001)
model_name Model identifier in MX (defaults to --model-path)
source Seed mode: publish metadata to MX after loading (default: false)

Changes

  • server_args.py: --model-express-config JSON arg with @property accessors for url, model_name, source. model_express added to remote_instance_weight_loader_backend enum.
  • load_config.py: model_express_url and model_express_model_name fields
  • remote_instance_weight_loader_utils.py: MODEL_EXPRESS = "model_express" enum variant
  • loader.py: load_model_from_model_express() -- queries MX for seed metadata, validates per-tensor byte sizes, executes batched RDMA transfer
  • model_runner.py: _publish_model_express_metadata() -- calls register_memory_region() for seed mode, publishes per-tensor metadata with correct per-tensor dtype (handles FP8 mixed-dtype models where weights are FP8, norms are BF16, scales are FP32)

FP8 support

FP8 models have mixed dtypes in memory -- a single model dtype doesn't describe all tensors. The implementation:

  • Seed: derives dtype string from each tensor's actual element_size (1 byte -> float8_e4m3fn, 2 -> bfloat16, 4 -> float32)
  • Target: compares total byte sizes (not numel + element_size) since RDMA is a memcpy

Verified with Qwen3-14B-FP8 (483 tensors, ~15 GB).

Dependencies

Test plan

  • Qwen3-0.6B (BF16): 226 tensors, seed + target produce identical outputs
  • Qwen3-14B-FP8: 483 tensors, seed + target produce identical outputs
  • Multi-GPU (tp=4, ep=4): Qwen3-235B-A22B on 8x GPU, all ranks publish/read correctly
  • Backward compat: existing NCCL and TransferEngine backends unaffected
  • Multi-node RDMA benchmark (requires IB/NVLink hardware)

Backward compatibility

Fully backward compatible. All changes are additive -- new enum value, new dispatch branch, new CLI arg with None default. Existing NCCL and TransferEngine+HTTP paths are untouched.

Benchmark (single node, 8x H100 NVLink)

Model:              Qwen/Qwen3-235B-A22B-Thinking-2507-FP8
TP=4 EP=4

Seed (disk load):   15.79s
Target (MX+RDMA):    0.41s
Speedup:            38.51x

Transport: NVLink (MC_INTRANODE_NVLINK=1)

Weight loading is 38.5x faster when transferring via NVLink from a seed instance vs loading from disk.

Add MODEL_EXPRESS backend for remote instance weight loading that uses
ModelExpress gRPC server for metadata coordination instead of direct
HTTP between seed and target instances. Supports FP8 and BF16 models
with per-tensor byte-size matching for mixed-dtype transfers.

New CLI args: --model-express-url, --model-express-model-name,
--model-express-source
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ishandhanani ishandhanani changed the title Add ModelExpress coordination for remote instance weight loading [WIP] Add ModelExpress coordination for remote instance weight loading Mar 5, 2026
Replace --model-express-url, --model-express-model-name, --model-express-source
with single --model-express-config JSON arg. Properties provide backwards-compatible
access for all downstream code (model_runner, loader, load_config).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ishandhanani ishandhanani changed the title [WIP] Add ModelExpress coordination for remote instance weight loading [1/2] Add ModelExpress coordination for remote instance weight loading - matching TP Mar 5, 2026
Ishan Dhanani added 2 commits March 6, 2026 04:16
Dead code from initial MX integration. We switched to raw byte size
comparison instead of dtype string conversion.
- Remove unused _get_model_dtype_str() method
- Drop lossy element_size_to_dtype reverse mapping from seed publish
  (dtype field was never read on target side)
- Wrap MxClient usage in try/finally to prevent gRPC channel leaks
- Close MxClient before starting RDMA transfers (connection not needed
  during transfer phase)
@ishandhanani ishandhanani marked this pull request as ready for review March 6, 2026 06:12
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ShangmingCai
Copy link
Copy Markdown
Collaborator

cc: @tianyuzhou95

@tianyuzhou95
Copy link
Copy Markdown
Contributor

cc @amysaq2023

@ishandhanani
Copy link
Copy Markdown
Collaborator Author

hi @tianyuzhou95 @amysaq2023 - please feel free to contact me in the sglang slack! Have a lot of neat ideas that I'd love to collaborate on!

Copy link
Copy Markdown
Contributor

@AndyDai-nv AndyDai-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on the modelexpress side! One nit, could you change the model_express... or model-express... in args to modelexpress...?

@amysaq2023
Copy link
Copy Markdown
Contributor

hi @tianyuzhou95 @amysaq2023 - please feel free to contact me in the sglang slack! Have a lot of neat ideas that I'd love to collaborate on!

Hi @ishandhanani, MX looks great! It's exactly what the remote instance weight loader needed.
I have a few questions regarding how a seed instance is registered with the MX server. When a seed instance publishes its metadata to the MX server, is it registered with a key that includes information about the parallel mechanism being used (such as TP, DP, PP, and nnodes information)? Since different parallel mechanisms affect how weights are partitioned across GPU memory, instances running the same model but with different parallel configurations would require distinct seed instances.

@amysaq2023
Copy link
Copy Markdown
Contributor

hi @tianyuzhou95 @amysaq2023 - please feel free to contact me in the sglang slack! Have a lot of neat ideas that I'd love to collaborate on!

@ishandhanani Also should we consider the scenario of nnode>1 or dp>1 mentioned in this PR: #17389 ?

@ishandhanani
Copy link
Copy Markdown
Collaborator Author

ishandhanani commented Mar 12, 2026

hi @tianyuzhou95 @amysaq2023 - please feel free to contact me in the sglang slack! Have a lot of neat ideas that I'd love to collaborate on!

Hi @ishandhanani, MX looks great! It's exactly what the remote instance weight loader needed. I have a few questions regarding how a seed instance is registered with the MX server. When a seed instance publishes its metadata to the MX server, is it registered with a key that includes information about the parallel mechanism being used (such as TP, DP, PP, and nnodes information)? Since different parallel mechanisms affect how weights are partitioned across GPU memory, instances running the same model but with different parallel configurations would require distinct seed instances.

Hi @amysaq2023 - its great that you think so! We really want to design MX to be compatible with SGLang needs. Regarding mixed parallelism - I decided to cover that functionality in a separate PR here #19983. This PR just provides the initial server scaffolding for MX itself.

Should we move this discussion there?

@amysaq2023
Copy link
Copy Markdown
Contributor

amysaq2023 commented Mar 12, 2026

hi @tianyuzhou95 @amysaq2023 - please feel free to contact me in the sglang slack! Have a lot of neat ideas that I'd love to collaborate on!

Hi @ishandhanani, MX looks great! It's exactly what the remote instance weight loader needed. I have a few questions regarding how a seed instance is registered with the MX server. When a seed instance publishes its metadata to the MX server, is it registered with a key that includes information about the parallel mechanism being used (such as TP, DP, PP, and nnodes information)? Since different parallel mechanisms affect how weights are partitioned across GPU memory, instances running the same model but with different parallel configurations would require distinct seed instances.

Hi @amysaq2023 - its great that you think so! We really want to design MX to be compatible with SGLang needs. Regarding mixed parallelism - I decided to cover that functionality in a separate PR here #19983. This PR just provides the initial server scaffolding for MX itself.

Should we move this discussion there?

Sure thing :)

ishandhanani and others added 3 commits March 13, 2026 16:25
Address review nit: remove separator from model_express/model-express
naming to use modelexpress consistently across CLI args, field names,
enum values, and method names.
Document modelexpress as a third R-Fork backend option alongside NCCL
and TransferEngine, including seed/client usage examples and the
--modelexpress-config argument.
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 13, 2026
@AndyDai-nv
Copy link
Copy Markdown
Contributor

LGTM, thanks for the update

@ishandhanani
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@ishandhanani
Copy link
Copy Markdown
Collaborator Author

As stated above - this PR is purely used to scaffold the MX integration with SGL. All imports are safe and gated

@ishandhanani
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@ishandhanani ishandhanani merged commit 8f0f36c into main Mar 18, 2026
85 of 93 checks passed
@ishandhanani ishandhanani deleted the ishan/mx branch March 18, 2026 20:38
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
…g - matching TP (sgl-project#19920)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Ishan Dhanani <ishan@dhanani.dev>
JD-ETH added a commit to JD-ETH/sglang that referenced this pull request Mar 24, 2026
Remove _publish_modelexpress_metadata (from upstream PR sgl-project#19920),
modelexpress_config server arg, ssl_verify/engine_info_bootstrap_url
methods, and enhanced url() method — none of these belong in our
feature branches. Revert glm4.py super().__init__() to original form.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026
…g - matching TP (sgl-project#19920)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Ishan Dhanani <ishan@dhanani.dev>
dutsc pushed a commit to dutsc/sglang that referenced this pull request Mar 30, 2026
…g - matching TP (sgl-project#19920)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Ishan Dhanani <ishan@dhanani.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants