Skip to content

[PD][MoRI] Align hybrid state transfer with per-component schema#26539

Merged
HaiShaw merged 12 commits into
sgl-project:mainfrom
maning00:fix/mori-hybrid-state-component-aware
May 29, 2026
Merged

[PD][MoRI] Align hybrid state transfer with per-component schema#26539
HaiShaw merged 12 commits into
sgl-project:mainfrom
maning00:fix/mori-hybrid-state-component-aware

Conversation

@maning00
Copy link
Copy Markdown
Contributor

@maning00 maning00 commented May 28, 2026

Motivation

PR #24932 ([PD] Refactor hybrid state transfer) migrated KVArgs from a flat state layout (state_type: str, state_item_lens: List[int], state_dim_per_tensor: List[int]) to a per-component one (state_types: List[StateType], both *_item_lens / *_dim_per_tensor become List[List[int]]). Mooncake and NIXL were migrated to the new schema in the same PR, but MoRI was only partially migrated — the inner-loop in _register_local_buffers was updated, while _register_kv_args, send_state, send_metadata, TransferInfo, and KVArgsRegisterInfo were left on the old flat assumption.

For any model with a non-empty state pool (DeepSeek V4, GLM-5, Qwen3.5) this manifests as struct.error: required argument is not an integer at PD bootstrap (#26525), because _register_kv_args does struct.pack("I", item_len) on what is now a list. A flatten-on-send hack would silence that crash but still routes Mamba state buffers through the SWA/DSA contiguous-page logic on multi-component hybrids, so this change aligns MoRI with the per-component dispatch model Mooncake and NIXL already use.

Modifications

  • Wire format: switch state_item_lens / state_dim_per_tensor to pack_int_lists("I") / unpack_int_lists("I"), switch state_indices to pack_int_lists("i") / unpack_int_lists("i"), and add nested-msgpack helpers for List[List[MemoryDesc]].
  • Types: KVArgsRegisterInfo.dst_state_{mem_descs,item_lens,dim_per_tensor} and TransferInfo.dst_state_indices become List[List[...]] / List[np.ndarray].
  • MoriKVManager.state_mem_descs becomes List[List[MemoryDesc]]; _register_local_buffers builds it per-component.
  • send_state iterates state_types[i] and dispatches each component to _send_mamba_state or _send_swa_dsa_state independently (mirrors MooncakeKVManager.maybe_send_extra and NixlKVManager.maybe_send_extra).
  • _send_mamba_state / _send_swa_dsa_state accept a single component's slice instead of indexing into self.kv_args.* directly.
  • _normalize_state_indices_per_component ravels each component's payload to 1-D once at the API boundary, removing the 2-D single-component DSA edge case at the source.

Accuracy Tests

Cross-machine PD on AMD MI300X with --disaggregation-transfer-backend mori.

Qwen3-8B (pure transformer, validates non-hybrid path / empty state lists):

Setup GSM8K (200q) Errors
Single-machine, TP=2 + TP=2 94.50% 0
1P (TP=4) + 1D (TP=4) cross-machine 94.00% 0

Qwen3.5-122B-A10B (hybrid linear attention, exercises the per-component mamba state transfer path — decode logs show Mamba Cache is allocated with ssm_state 18.02GB / TP rank and Using hybrid linear attention backend for hybrid GDN models, and per-request mamba usage is non-zero):

Sample GSM8K Errors
30q 96.67% 0
100q 98.00% 0
300q @ concurrency 32 96.67% 0

No state-transfer-related errors in prefill or decode logs across all runs.

Speed Tests and Profiling

sglang.bench_serving --backend sglang-oai-chat against PD router
fronting 1P + 1D over RDMA:

Qwen3-8B (1P TP=4 + 1D TP=4):

in / out / concurrency reqs total throughput
1024 / 256 / 32 128 / 128 21.9k tok/s

Qwen3.5-122B-A10B (1P TP=8 + 1D TP=8):

in / out / concurrency reqs total throughput mean E2E
1024 / 256 / 64 256 / 256 12.5k tok/s 6.02 s
2048 / 512 / 32 128 / 128 11.0k tok/s 6.96 s

Checklist

cc @Duyi-Wang


CI States

Latest PR Test (Base): ⏳ Run #26624086997
Latest PR Test (Extra): ❌ Run #26624086820

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@maning00 maning00 changed the title [PD][MoRI] Migrate hybrid state transfer to per-component schema [PD][MoRI] Align hybrid state transfer with per-component schema May 28, 2026
Comment on lines +914 to +954
state_type = getattr(self.kv_args, "state_type", "none")

if state_type == "none":
raise RuntimeError(
"PD state transfer failed: state_type is 'none' but state_indices were provided"
)

if not peer_info.dst_state_mem_descs:
state_types = getattr(self.kv_args, "state_types", None) or []
Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think state_types = self.kv_args.state_types is enough. We have made sure this value will be set.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated as suggested

Comment thread python/sglang/srt/disaggregation/mori/conn.py
Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Others LGTM

@ShangmingCai
Copy link
Copy Markdown
Collaborator

CC: @HaiShaw

@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented May 28, 2026

/tag-and-rerun-ci

@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented May 29, 2026

@amd-bot ci-status

@amd-bot
Copy link
Copy Markdown

amd-bot commented May 29, 2026

@HaiShaw

CI Status for PR #26539

PR: [PD][MoRI] Align hybrid state transfer with per-component schema
Changed files: python/sglang/srt/disaggregation/mori/conn.py (+157/-79), test/registered/amd/disaggregation/test_mori_transfer_engine_e2e.py (+25/-7)

AMD: 1 failure (0 likely related) | Others: 12 failures (0 related)

AMD CI Failures

Job Test File Test Function Error Related? Explanation Log
stage-a-test-1-gpu-small-amd (linux-mi325-1gpu-sglang) test/registered/attention/test_wave_attention_kernels.py (collection-time import) ModuleNotFoundError: No module named 'unittest.mock' — caused by stray test/registered/attention/unittest/__init__.py shadowing stdlib unittest; also preceded by HF cache miss + private-registry timeout 🟢 Unlikely PR only touches disaggregation/mori/; failing file is an attention kernel test that fails at from unittest import SkipTest in sglang/srt/utils/common.py:74 because Python resolves unittest to a CI-side directory Log

Other CI Failures

Job Test File Test Function Error Related? Explanation Log
base-b-test-1-gpu-large (3) test/registered/attention/test_chunk_gated_delta_rule.py import-time ImportError: cannot import name 'mock' from 'unittest' (...test/registered/attention/unittest/__init__.py) 🟢 Unlikely Same stdlib-unittest shadowing issue; PR unrelated Log
base-b-test-1-gpu-large (6) test/registered/attention/test_triton_attention_backend.py import-time ImportError: cannot import name 'SkipTest' from 'unittest' 🟢 Unlikely Same issue Log
base-b-test-1-gpu-large (8) test/registered/attention/test_deterministic.py import-time ImportError: cannot import name 'SkipTest' from 'unittest' 🟢 Unlikely Same issue Log
base-b-test-1-gpu-small (5) test/registered/attention/test_create_kvindices.py import-time ImportError: cannot import name 'mock' from 'unittest' 🟢 Unlikely Same issue Log
base-b-test-2-gpu-large (2) test/registered/attention/test_gemma4_swa_triton_oob_regression.py import-time ImportError: cannot import name 'SkipTest' from 'unittest' 🟢 Unlikely Same issue Log
base-b-test-4-gpu-b200 (1) test/registered/attention/test_flash_attention_4.py import-time ImportError: cannot import name 'SkipTest' from 'unittest' 🟢 Unlikely Same issue Log
stage-a-test-1-gpu-xpu N/A N/A Docker build failed: pip install torch==2.11.0+xpu ... exit code 1 (XPU index installation failure) 🟢 Unlikely XPU image build infra failure; PR doesn't touch XPU/Docker Log
stage-b-test-1-npu-a2 (0) test/registered/ascend/basic_function/quant/test_npu_w8a8_quantization.py test_gsm8k runtime error in test_utils.py:2194 lambda 🟢 Unlikely NPU-only quant test; PR touches no NPU code path Log
multimodal-gen-test-1-npu-a3 multimodal_gen/test/server/ascend/test_server_1_npu.py test_diffusion_generation[wan2_1_t2v_1.3b_1_npu] Performance validation failed 🟢 Unlikely NPU diffusion server perf test; unrelated to mori disaggregation Log
multimodal-gen-test-2-npu-a3 multimodal_gen/test/server/ascend/test_server_* diffusion perf Performance validation failed 🟢 Unlikely Same NPU diffusion perf cluster Log
multimodal-gen-test-8-npu-a3 multimodal_gen/test/server/ascend/... diffusion perf Performance validation failed 🟢 Unlikely Same NPU diffusion perf cluster Log
finish N/A N/A gate fail (downstream of XPU build) 🟢 Unlikely Downstream finish gate; not a real test failure Log

Details

None of the failures are related to this PR's changes (mori/conn.py + its e2e test).

  • Dominant cluster (8 of 13 failures): a stray test/registered/attention/unittest/ directory on the CI runners is being placed on sys.path and shadowing Python's stdlib unittest package. sglang/srt/utils/common.py:74 (from unittest import SkipTest) then fails to import. The correct local directory is test/registered/attention/unittests/ (with an s); the runner's checkout has somehow created or persisted a unittest/ variant. This is a CI infrastructure issue affecting every PR that hits these attention tests, not a regression caused by this PR.
  • XPU build (stage-a-test-1-gpu-xpu): failed during pip install torch==2.11.0+xpu from download.pytorch.org/whl/xpu — upstream index issue.
  • NPU jobs: diffusion server performance validation and a w8a8 quant gsm8k failure — NPU-only code paths, untouched by this PR.
  • AMD stage-a: same unittest shadowing issue, preceded by a private docker-registry timeout and HF cache miss — none touch mori code.

Verdict: the failures are unrelated to this PR. Safe to ignore from a correctness standpoint; the unittest/ shadowing cluster is a CI-infra problem that needs separate cleanup of the runner workspace.

Generated by amd-bot using Claude Code CLI

Copy link
Copy Markdown
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HaiShaw HaiShaw merged commit 4d1163e into sgl-project:main May 29, 2026
66 of 104 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants