[1/2] Add ModelExpress coordination for remote instance weight loading - matching TP by ishandhanani · Pull Request #19920 · sgl-project/sglang

ishandhanani · 2026-03-05T06:59:27Z

Summary

Add ModelExpress (MX) as a coordination backend for SGLang's remote instance weight loading, enabling persistent metadata discovery for GPU-to-GPU weight transfers via Mooncake TransferEngine.

Related: #12910 -- MX provides the "planner" functionality discussed there (seed discovery, metadata persistence, multi-model coordination) as a standalone Rust gRPC service, with K8s-native backends (CRD, ConfigMap, Redis). In the future - MX will also be able to handle reshaping in order to handle going to DEPX -> DEPY

Motivation

SGLang's existing remote instance weight loading ("rfork") supports two backends: NCCL and Mooncake TransferEngine. Both rely on direct HTTP coordination between seed and target -- the target must know the seed's IP and port upfront, and metadata is ephemeral (lost if the seed restarts).

ModelExpress solves this by providing:

Persistent metadata storage: tensor addresses and TransferEngine session IDs survive seed restarts
K8s-native discovery: supports CRD, ConfigMap, and Redis backends so targets can discover seeds without hardcoded addresses
Multi-model support: a single MX server coordinates metadata for multiple models
Decoupled lifecycle: seed and target don't need to be started in a specific order -- the target polls MX until the seed is ready

This integration reuses 100% of SGLang's existing TransferEngine infrastructure (register_memory_region(), batch_transfer_sync_read()). The only new code is the coordination layer that replaces direct HTTP with MX gRPC calls.

Architecture

Seed Instance                    ModelExpress Server              Target Instance
─────────────                    ──────────────────              ───────────────
1. Load model from disk          
2. Init TransferEngine           
3. register_memory_region()      
4. Publish metadata ──────────►  Store {session_id,              
                                  tensor descriptors}            
5. Publish ready ─────────────►  Set ready flag                  
                                                                 1. Init TransferEngine
                                                                 2. Create dummy weights
                                                                 3. register_memory_region()
                                 ◄────────────────────────────── 4. Poll wait_for_ready()
                                 Return ready ─────────────────► 
                                 ◄────────────────────────────── 5. get_metadata()
                                 Return {session_id, tensors} ─► 
                                                                 6. batch_transfer_sync_read()
                                                                    (RDMA direct from seed GPU)

How it builds on existing SGLang infra

register_memory_region() (existing): registers model weight GPU memory with TransferEngine for RDMA access. Used by both seed (to expose weights) and target (to provide destination buffers).
batch_transfer_sync_read() (existing): performs batched RDMA reads from seed GPU memory to target GPU memory. The core transfer mechanism is unchanged.
RemoteInstanceModelLoader (existing): the model loader that creates dummy weights and fills them from a remote source. Extended with a new MODEL_EXPRESS backend dispatch.
--remote-instance-weight-loader-start-seed-via-transfer-engine (existing): initializes TransferEngine on the seed side. Reused as-is.

The new code is thin (~150 lines in loader.py, ~60 lines in model_runner.py) and sits between the existing TransferEngine primitives and the new MX coordination layer.

CLI

All MX config is passed via a single --model-express-config JSON arg:

Seed (publishes weights):

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-14B-FP8 --port 30000 \
  --load-format auto \
  --remote-instance-weight-loader-start-seed-via-transfer-engine \
  --model-express-config '{"url": "localhost:8001", "model_name": "Qwen/Qwen3-14B-FP8", "source": true}'

Target (loads weights via RDMA):

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-14B-FP8 --port 30001 \
  --load-format remote_instance \
  --remote-instance-weight-loader-backend model_express \
  --model-express-config '{"url": "localhost:8001", "model_name": "Qwen/Qwen3-14B-FP8"}'

Key	Description
`url`	MX gRPC server address (e.g., `localhost:8001`)
`model_name`	Model identifier in MX (defaults to `--model-path`)
`source`	Seed mode: publish metadata to MX after loading (default: false)

Changes

server_args.py: --model-express-config JSON arg with @property accessors for url, model_name, source. model_express added to remote_instance_weight_loader_backend enum.
load_config.py: model_express_url and model_express_model_name fields
remote_instance_weight_loader_utils.py: MODEL_EXPRESS = "model_express" enum variant
loader.py: load_model_from_model_express() -- queries MX for seed metadata, validates per-tensor byte sizes, executes batched RDMA transfer
model_runner.py: _publish_model_express_metadata() -- calls register_memory_region() for seed mode, publishes per-tensor metadata with correct per-tensor dtype (handles FP8 mixed-dtype models where weights are FP8, norms are BF16, scales are FP32)

FP8 support

FP8 models have mixed dtypes in memory -- a single model dtype doesn't describe all tensors. The implementation:

Seed: derives dtype string from each tensor's actual element_size (1 byte -> float8_e4m3fn, 2 -> bfloat16, 4 -> float32)
Target: compares total byte sizes (not numel + element_size) since RDMA is a memcpy

Verified with Qwen3-14B-FP8 (483 tensors, ~15 GB).

Dependencies

Requires modelexpress Python client (pip install modelexpress)
Requires mooncake-transfer-engine for TransferEngine
MX server: https://github.com/ai-dynamo/modelexpress (see PR #157 for TransferEngine backend support)

Test plan

Qwen3-0.6B (BF16): 226 tensors, seed + target produce identical outputs
Qwen3-14B-FP8: 483 tensors, seed + target produce identical outputs
Multi-GPU (tp=4, ep=4): Qwen3-235B-A22B on 8x GPU, all ranks publish/read correctly
Backward compat: existing NCCL and TransferEngine backends unaffected
Multi-node RDMA benchmark (requires IB/NVLink hardware)

Backward compatibility

Fully backward compatible. All changes are additive -- new enum value, new dispatch branch, new CLI arg with None default. Existing NCCL and TransferEngine+HTTP paths are untouched.

Benchmark (single node, 8x H100 NVLink)

Model:              Qwen/Qwen3-235B-A22B-Thinking-2507-FP8
TP=4 EP=4

Seed (disk load):   15.79s
Target (MX+RDMA):    0.41s
Speedup:            38.51x

Transport: NVLink (MC_INTRANODE_NVLINK=1)

Weight loading is 38.5x faster when transferring via NVLink from a seed instance vs loading from disk.

Add MODEL_EXPRESS backend for remote instance weight loading that uses ModelExpress gRPC server for metadata coordination instead of direct HTTP between seed and target instances. Supports FP8 and BF16 models with per-tensor byte-size matching for mixed-dtype transfers. New CLI args: --model-express-url, --model-express-model-name, --model-express-source

gemini-code-assist · 2026-03-05T06:59:31Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Replace --model-express-url, --model-express-model-name, --model-express-source with single --model-express-config JSON arg. Properties provide backwards-compatible access for all downstream code (model_runner, loader, load_config). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Dead code from initial MX integration. We switched to raw byte size comparison instead of dtype string conversion.

- Remove unused _get_model_dtype_str() method - Drop lossy element_size_to_dtype reverse mapping from seed publish (dtype field was never read on target side) - Wrap MxClient usage in try/finally to prevent gRPC channel leaks - Close MxClient before starting RDMA transfers (connection not needed during transfer phase)

gemini-code-assist · 2026-03-06T06:12:21Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ShangmingCai · 2026-03-06T06:43:26Z

cc: @tianyuzhou95

tianyuzhou95 · 2026-03-06T06:48:35Z

cc @amysaq2023

ishandhanani · 2026-03-06T08:08:35Z

hi @tianyuzhou95 @amysaq2023 - please feel free to contact me in the sglang slack! Have a lot of neat ideas that I'd love to collaborate on!

AndyDai-nv

LGTM on the modelexpress side! One nit, could you change the model_express... or model-express... in args to modelexpress...?

amysaq2023 · 2026-03-11T06:12:19Z

hi @tianyuzhou95 @amysaq2023 - please feel free to contact me in the sglang slack! Have a lot of neat ideas that I'd love to collaborate on!

Hi @ishandhanani, MX looks great! It's exactly what the remote instance weight loader needed.
I have a few questions regarding how a seed instance is registered with the MX server. When a seed instance publishes its metadata to the MX server, is it registered with a key that includes information about the parallel mechanism being used (such as TP, DP, PP, and nnodes information)? Since different parallel mechanisms affect how weights are partitioned across GPU memory, instances running the same model but with different parallel configurations would require distinct seed instances.

amysaq2023 · 2026-03-11T06:23:18Z

hi @tianyuzhou95 @amysaq2023 - please feel free to contact me in the sglang slack! Have a lot of neat ideas that I'd love to collaborate on!

@ishandhanani Also should we consider the scenario of nnode>1 or dp>1 mentioned in this PR: #17389 ?

ishandhanani · 2026-03-12T01:10:36Z

hi @tianyuzhou95 @amysaq2023 - please feel free to contact me in the sglang slack! Have a lot of neat ideas that I'd love to collaborate on!

Hi @ishandhanani, MX looks great! It's exactly what the remote instance weight loader needed. I have a few questions regarding how a seed instance is registered with the MX server. When a seed instance publishes its metadata to the MX server, is it registered with a key that includes information about the parallel mechanism being used (such as TP, DP, PP, and nnodes information)? Since different parallel mechanisms affect how weights are partitioned across GPU memory, instances running the same model but with different parallel configurations would require distinct seed instances.

Hi @amysaq2023 - its great that you think so! We really want to design MX to be compatible with SGLang needs. Regarding mixed parallelism - I decided to cover that functionality in a separate PR here #19983. This PR just provides the initial server scaffolding for MX itself.

Should we move this discussion there?

amysaq2023 · 2026-03-12T08:30:19Z

hi @tianyuzhou95 @amysaq2023 - please feel free to contact me in the sglang slack! Have a lot of neat ideas that I'd love to collaborate on!

Hi @ishandhanani, MX looks great! It's exactly what the remote instance weight loader needed. I have a few questions regarding how a seed instance is registered with the MX server. When a seed instance publishes its metadata to the MX server, is it registered with a key that includes information about the parallel mechanism being used (such as TP, DP, PP, and nnodes information)? Since different parallel mechanisms affect how weights are partitioned across GPU memory, instances running the same model but with different parallel configurations would require distinct seed instances.

Hi @amysaq2023 - its great that you think so! We really want to design MX to be compatible with SGLang needs. Regarding mixed parallelism - I decided to cover that functionality in a separate PR here #19983. This PR just provides the initial server scaffolding for MX itself.

Should we move this discussion there?

Sure thing :)

Address review nit: remove separator from model_express/model-express naming to use modelexpress consistently across CLI args, field names, enum values, and method names.

Document modelexpress as a third R-Fork backend option alongside NCCL and TransferEngine, including seed/client usage examples and the --modelexpress-config argument.

AndyDai-nv · 2026-03-16T04:59:06Z

LGTM, thanks for the update

ishandhanani · 2026-03-17T19:29:59Z

/tag-and-rerun-ci

ishandhanani · 2026-03-17T20:25:48Z

As stated above - this PR is purely used to scaffold the MX integration with SGL. All imports are safe and gated

ishandhanani · 2026-03-17T20:25:54Z

/tag-and-rerun-ci

…g - matching TP (sgl-project#19920) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Ishan Dhanani <ishan@dhanani.dev>

Remove _publish_modelexpress_metadata (from upstream PR sgl-project#19920), modelexpress_config server arg, ssl_verify/engine_info_bootstrap_url methods, and enhanced url() method — none of these belong in our feature branches. Revert glm4.py super().__init__() to original form. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…g - matching TP (sgl-project#19920) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Ishan Dhanani <ishan@dhanani.dev>

ishandhanani changed the title ~~Add ModelExpress coordination for remote instance weight loading~~ [WIP] Add ModelExpress coordination for remote instance weight loading Mar 5, 2026

ishandhanani changed the title ~~[WIP] Add ModelExpress coordination for remote instance weight loading~~ [1/2] Add ModelExpress coordination for remote instance weight loading - matching TP Mar 5, 2026

Ishan Dhanani added 2 commits March 6, 2026 04:16

Remove unused _DTYPE_ELEMENT_SIZES dict and _dtype_to_element_size()

bca0e9d

Dead code from initial MX integration. We switched to raw byte size comparison instead of dtype string conversion.

ishandhanani marked this pull request as ready for review March 6, 2026 06:12

ishandhanani requested review from Fridge003, Ying1123, hnyls2002, ispobock and merrymercy as code owners March 6, 2026 06:12

Merge branch 'main' into ishan/mx

2e7c354

AndyDai-nv reviewed Mar 9, 2026

View reviewed changes

AndyDai-nv approved these changes Mar 10, 2026

View reviewed changes

ishandhanani and others added 3 commits March 13, 2026 16:25

Merge branch 'main' into ishan/mx

9ef7169

Rename model_express to modelexpress in args and identifiers

de918d1

Address review nit: remove separator from model_express/model-express naming to use modelexpress consistently across CLI args, field names, enum values, and method names.

Add ModelExpress backend documentation to R-Fork docs

22cebb5

Document modelexpress as a third R-Fork backend option alongside NCCL and TransferEngine, including seed/client usage examples and the --modelexpress-config argument.

github-actions bot added the documentation Improvements or additions to documentation label Mar 13, 2026

ishandhanani added 2 commits March 14, 2026 15:32

Merge branch 'main' into ishan/mx

1013a94

Merge branch 'main' into ishan/mx

283e429

KavinKrishnan approved these changes Mar 16, 2026

View reviewed changes

github-actions bot added the run-ci label Mar 17, 2026

ishandhanani added 3 commits March 17, 2026 13:09

lint

d201240

Gate optional modelexpress imports

42216f7

Simplify modelexpress import gating

961354e

ishandhanani merged commit 8f0f36c into main Mar 18, 2026
85 of 93 checks passed

ishandhanani deleted the ishan/mx branch March 18, 2026 20:38

AndyDai-nv mentioned this pull request Apr 6, 2026

feat: update ModelExpress metadata API to SourceIdentity-based schema #21222

Open

3 tasks

Conversation

ishandhanani commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Architecture

How it builds on existing SGLang infra

CLI

Changes

FP8 support

Dependencies

Test plan

Backward compatibility

Benchmark (single node, 8x H100 NVLink)

Uh oh!

gemini-code-assist bot commented Mar 5, 2026

Uh oh!

gemini-code-assist bot commented Mar 6, 2026

Uh oh!

ShangmingCai commented Mar 6, 2026

Uh oh!

tianyuzhou95 commented Mar 6, 2026

Uh oh!

ishandhanani commented Mar 6, 2026

Uh oh!

AndyDai-nv left a comment

Choose a reason for hiding this comment

Uh oh!

amysaq2023 commented Mar 11, 2026

Uh oh!

amysaq2023 commented Mar 11, 2026

Uh oh!

ishandhanani commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amysaq2023 commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndyDai-nv commented Mar 16, 2026

Uh oh!

ishandhanani commented Mar 17, 2026

Uh oh!

ishandhanani commented Mar 17, 2026

Uh oh!

ishandhanani commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ishandhanani commented Mar 5, 2026 •

edited

Loading

ishandhanani commented Mar 12, 2026 •

edited

Loading

amysaq2023 commented Mar 12, 2026 •

edited

Loading