[1/2] Add ModelExpress coordination for remote instance weight loading - matching TP#19920
[1/2] Add ModelExpress coordination for remote instance weight loading - matching TP#19920ishandhanani merged 13 commits intomainfrom
Conversation
Add MODEL_EXPRESS backend for remote instance weight loading that uses ModelExpress gRPC server for metadata coordination instead of direct HTTP between seed and target instances. Supports FP8 and BF16 models with per-tensor byte-size matching for mixed-dtype transfers. New CLI args: --model-express-url, --model-express-model-name, --model-express-source
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Replace --model-express-url, --model-express-model-name, --model-express-source with single --model-express-config JSON arg. Properties provide backwards-compatible access for all downstream code (model_runner, loader, load_config). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dead code from initial MX integration. We switched to raw byte size comparison instead of dtype string conversion.
- Remove unused _get_model_dtype_str() method - Drop lossy element_size_to_dtype reverse mapping from seed publish (dtype field was never read on target side) - Wrap MxClient usage in try/finally to prevent gRPC channel leaks - Close MxClient before starting RDMA transfers (connection not needed during transfer phase)
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
cc: @tianyuzhou95 |
|
cc @amysaq2023 |
|
hi @tianyuzhou95 @amysaq2023 - please feel free to contact me in the sglang slack! Have a lot of neat ideas that I'd love to collaborate on! |
AndyDai-nv
left a comment
There was a problem hiding this comment.
LGTM on the modelexpress side! One nit, could you change the model_express... or model-express... in args to modelexpress...?
Hi @ishandhanani, MX looks great! It's exactly what the remote instance weight loader needed. |
@ishandhanani Also should we consider the scenario of nnode>1 or dp>1 mentioned in this PR: #17389 ? |
Hi @amysaq2023 - its great that you think so! We really want to design MX to be compatible with SGLang needs. Regarding mixed parallelism - I decided to cover that functionality in a separate PR here #19983. This PR just provides the initial server scaffolding for MX itself. Should we move this discussion there? |
Sure thing :) |
Address review nit: remove separator from model_express/model-express naming to use modelexpress consistently across CLI args, field names, enum values, and method names.
Document modelexpress as a third R-Fork backend option alongside NCCL and TransferEngine, including seed/client usage examples and the --modelexpress-config argument.
|
LGTM, thanks for the update |
|
/tag-and-rerun-ci |
|
As stated above - this PR is purely used to scaffold the MX integration with SGL. All imports are safe and gated |
|
/tag-and-rerun-ci |
…g - matching TP (sgl-project#19920) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Ishan Dhanani <ishan@dhanani.dev>
Remove _publish_modelexpress_metadata (from upstream PR sgl-project#19920), modelexpress_config server arg, ssl_verify/engine_info_bootstrap_url methods, and enhanced url() method — none of these belong in our feature branches. Revert glm4.py super().__init__() to original form. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…g - matching TP (sgl-project#19920) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Ishan Dhanani <ishan@dhanani.dev>
…g - matching TP (sgl-project#19920) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Ishan Dhanani <ishan@dhanani.dev>
Summary
Add ModelExpress (MX) as a coordination backend for SGLang's remote instance weight loading, enabling persistent metadata discovery for GPU-to-GPU weight transfers via Mooncake TransferEngine.
Related: #12910 -- MX provides the "planner" functionality discussed there (seed discovery, metadata persistence, multi-model coordination) as a standalone Rust gRPC service, with K8s-native backends (CRD, ConfigMap, Redis). In the future - MX will also be able to handle reshaping in order to handle going to DEPX -> DEPY
Motivation
SGLang's existing remote instance weight loading ("rfork") supports two backends: NCCL and Mooncake TransferEngine. Both rely on direct HTTP coordination between seed and target -- the target must know the seed's IP and port upfront, and metadata is ephemeral (lost if the seed restarts).
ModelExpress solves this by providing:
This integration reuses 100% of SGLang's existing TransferEngine infrastructure (
register_memory_region(),batch_transfer_sync_read()). The only new code is the coordination layer that replaces direct HTTP with MX gRPC calls.Architecture
How it builds on existing SGLang infra
register_memory_region()(existing): registers model weight GPU memory with TransferEngine for RDMA access. Used by both seed (to expose weights) and target (to provide destination buffers).batch_transfer_sync_read()(existing): performs batched RDMA reads from seed GPU memory to target GPU memory. The core transfer mechanism is unchanged.RemoteInstanceModelLoader(existing): the model loader that creates dummy weights and fills them from a remote source. Extended with a newMODEL_EXPRESSbackend dispatch.--remote-instance-weight-loader-start-seed-via-transfer-engine(existing): initializes TransferEngine on the seed side. Reused as-is.The new code is thin (~150 lines in loader.py, ~60 lines in model_runner.py) and sits between the existing TransferEngine primitives and the new MX coordination layer.
CLI
All MX config is passed via a single
--model-express-configJSON arg:Seed (publishes weights):
python -m sglang.launch_server \ --model-path Qwen/Qwen3-14B-FP8 --port 30000 \ --load-format auto \ --remote-instance-weight-loader-start-seed-via-transfer-engine \ --model-express-config '{"url": "localhost:8001", "model_name": "Qwen/Qwen3-14B-FP8", "source": true}'Target (loads weights via RDMA):
python -m sglang.launch_server \ --model-path Qwen/Qwen3-14B-FP8 --port 30001 \ --load-format remote_instance \ --remote-instance-weight-loader-backend model_express \ --model-express-config '{"url": "localhost:8001", "model_name": "Qwen/Qwen3-14B-FP8"}'urllocalhost:8001)model_name--model-path)sourceChanges
server_args.py:--model-express-configJSON arg with@propertyaccessors forurl,model_name,source.model_expressadded toremote_instance_weight_loader_backendenum.load_config.py:model_express_urlandmodel_express_model_namefieldsremote_instance_weight_loader_utils.py:MODEL_EXPRESS = "model_express"enum variantloader.py:load_model_from_model_express()-- queries MX for seed metadata, validates per-tensor byte sizes, executes batched RDMA transfermodel_runner.py:_publish_model_express_metadata()-- callsregister_memory_region()for seed mode, publishes per-tensor metadata with correct per-tensor dtype (handles FP8 mixed-dtype models where weights are FP8, norms are BF16, scales are FP32)FP8 support
FP8 models have mixed dtypes in memory -- a single model dtype doesn't describe all tensors. The implementation:
element_size(1 byte -> float8_e4m3fn, 2 -> bfloat16, 4 -> float32)Verified with Qwen3-14B-FP8 (483 tensors, ~15 GB).
Dependencies
modelexpressPython client (pip install modelexpress)mooncake-transfer-enginefor TransferEngineTest plan
Backward compatibility
Fully backward compatible. All changes are additive -- new enum value, new dispatch branch, new CLI arg with None default. Existing NCCL and TransferEngine+HTTP paths are untouched.
Benchmark (single node, 8x H100 NVLink)
Weight loading is 38.5x faster when transferring via NVLink from a seed instance vs loading from disk.