[Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, and broken URIs for multimodal models by junliu-mde · Pull Request #22715 · sgl-project/sglang

junliu-mde · 2026-04-13T15:47:29Z

Motivation

Found while testing #17948 (Direct model loading from object storage with Runai Model Streamer) on an 8×H200 cluster with Kimi-K2.5 (WNA16 Marlin MoE) and DeepSeek-V3-0324 (BF16 MoE) via az:// paths with TP=8.

Three independent bugs prevent RunAI streamer from working correctly with multi-GPU quantized / multimodal models:

RunaiModelStreamerLoader omits quant_config — quantized models initialize as BF16, causing OOM or wrong inference.
Stale tensor views from distributed buffer reuse — the RunAI distributed streamer reuses GPU staging buffers across batches; two code paths (Kimi-K2.5 list accumulation, DeepSeek MLA cached_a_proj) hold references across batches, resulting in silently corrupted weights (model outputs all !).
get_processor() does not resolve object-storage URIs — unlike get_config() and get_tokenizer(), get_processor() passes raw az:///s3:///gs:// URIs to HuggingFace, crashing all multimodal models loaded from object storage.

Modifications

Four atomic commits, one per root cause:

Commit	File	Change
`7292838`	`python/sglang/srt/model_loader/loader.py`	Pass `quant_config` to `_initialize_model()` in `RunaiModelStreamerLoader.load_model()`, consistent with every other loader
`a5f7a2d`	`python/sglang/srt/models/deepseek_common/deepseek_weight_loader.py`	`.clone()` tensors cached in `cached_a_proj` so they survive buffer reuse across streamer batches
`29590bc`	`python/sglang/srt/models/kimi_k25.py`	Refactor `load_weights()` from accumulate-then-process to streaming pattern for language weights; clone only the small vision weights that need batching
`fef3c44`	`python/sglang/srt/utils/hf_transformers_utils.py`	Add `is_runai_obj_uri()` / `ObjectStorageModel.get_path()` normalization to `get_processor()`, matching `get_config()` and `get_tokenizer()`

Accuracy Tests

Tested end-to-end on 8×H200, Azure Blob Storage (same-region):

Kimi-K2.5 (WNA16 Marlin, TP=8, az://)

Before fix: server starts, but all chat responses are ! (token 0, all-zero logits from stale buffers)
After fix: coherent multi-turn chat output, correct reasoning and vision responses

DeepSeek-V3-0324 (FP8, TP=8, az://)

Before fix: OOM during _initialize_model() (missing quant_config causes BF16 allocation for FP8 layers)
After fix: loads successfully in ~386s (8 GB streamer memory limit), correct inference output

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

gemini-code-assist · 2026-04-13T15:47:35Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

yhyang201 · 2026-04-14T05:24:46Z

/tag-and-rerun-ci

noa-neria · 2026-04-15T08:31:17Z

@yhyang201 LGTM and thanks for fixing these issues!
For robustness, we can make runai model streamer default behavior to yield cloned tensors whenever an internal buffer is reused (either GPU or CPU). Adding owned=True to stream_files to control the clone, letting streaming clients keep the current behavior is desired.

amacaskill · 2026-04-29T16:42:55Z

@yhyang201 Can you please resolve the merge conflicts so we can get this merged?

alexnails

all minor, please fix them and merge conflict and I will approve

alexnails · 2026-04-29T23:08:24Z

 ):
    # pop 'revision' from kwargs if present.
    revision = kwargs.pop("revision", tokenizer_revision)
+    if is_runai_obj_uri(tokenizer_name):


can we make this a helper ? it is called many times in this file.

Thanks for review, latest main has changed hf_transformers_utils.py into multiple files, I'll rebase correspondingly.

RunAI model loading skipped quant_config during model construction, so quantized paths could instantiate the wrong modules before any weights were streamed.

The DeepSeek loader keeps q_a_proj and kv_a_proj_with_mqa tensors across iterator steps before fusing them. Clone at cache insertion so distributed RunAI buffer reuse cannot corrupt the fused MLA weights.

Kimi-K2.5 buffered all language weights before handing them to the DeepSeek loader. Stream language tensors immediately and clone only delayed vision weights when needed so distributed RunAI batches cannot invalidate retained views.

Share object-storage URI normalization across config, tokenizer, and processor helpers so multimodal processors do not pass raw s3://, gs://, or az:// paths into HuggingFace.

Add a CPU unit test that verifies RunaiModelStreamerLoader passes the computed quant config into model initialization.

alexnails · 2026-05-04T18:47:04Z

@junliu-mde seems like some changes are subsetted with yours as there is now conflict, could you investigate and let me know what next steps are for this PR ?

junliu-mde · 2026-05-05T02:13:27Z

@junliu-mde seems like some changes are subsetted with yours as there is now conflict, could you investigate and let me know what next steps are for this PR ?

Let me check the conflict, thanks for reminder.

Resolve the overlapping RunAI streamer fixes now present on main while preserving the PR-specific behavior for quant config propagation, streamed Kimi loading, DeepSeek cached tensor ownership, and object-storage URI normalization.

Main now covers the same quant-config propagation path in test_runai_model_streamer_loader.py, so keep the PR focused on the remaining object-storage URI helper changes.

junliu-mde · 2026-05-05T14:59:25Z

@alexnails I checked the conflict and the remaining diff.

The overlap came from #23850 being merged first. That PR already fixed the RunAI weight-loading related pieces: passing quant_config during RunAI model initialization, keeping Kimi weight loading streaming, routing RunAI streamer loads correctly, and cloning retained DeepSeek q/kv-a tensors from RunAI streamer buffers.

After merging main, those parts are now covered by main and are no longer the meaningful remaining diff in this PR.

The part that still remains useful here is the object-storage URI handling for multimodal processors. For az://..., s3://..., and gs://... model paths, get_processor() could still pass the raw object-storage URI directly to HuggingFace and fail. This PR now normalizes those paths through the RunAI metadata cache first, matching the config/tokenizer path behavior, and then loads the processor from the local cached metadata directory.

So the current PR is effectively narrowed to: resolve RunAI object-storage URIs consistently across HF config/tokenizer/processor helpers, especially for multimodal processor loading.

* main: (894 commits) [Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, and broken URIs for multimodal models (sgl-project#22715) [Kernel] Deprecate DeepGemm in sgl kernel and apply custom wheel sgl-deep-gemm (sgl-project#24268) propagate pytest exit code from test __main__ entries (sgl-project#24487) [R3] Avoid implicit CUDA sync in routed experts DP slicing (sgl-project#24550) Add ChatCompletionRequest-style support to /v1/tokenize (sgl-project#23981) Support Triton MLA FP8 KV cache (sgl-project#20479) [diffusion] chore: align LTX-2 with official (sgl-project#24313) Expand support matrix for pypi wheel release (sgl-project#24565) [codex] Optimize Z-Image packed QKV (sgl-project#24117) [Misc] Fix breaking weight checker test (sgl-project#24553) [LoRA] Fix qkv_proj LoRA buffer sizing when tp_size > num_key_value_heads (sgl-project#24420) ci: bump test_mimo_models.py est_time 330 → 610 (sgl-project#24551) [CI] Temporarily disable marco/mcdse-2b-v1 in test_embedding_models (sgl-project#24279) Improve metrics, observability, and PD deploy tooling (sgl-project#24521) Fix diffusion fallback guards and validation (sgl-project#23335) [PD] Prevent update_status to Failed from cleared entries (sgl-project#24539) [CP] Register KV cache allgather buffer with symmetric memory (sgl-project#24040) Support getting checksums in weight checker (sgl-project#24537) Refactor buffer patterns in weight checker (sgl-project#24538) Add unit and end-to-end tests for weight checker (sgl-project#24536) ... # Conflicts: # python/sglang/srt/managers/scheduler.py # python/sglang/srt/model_executor/model_runner.py

…and broken URIs for multimodal models (sgl-project#22715) Co-authored-by: Alex Nails <alex.nails@radixark.ai>

junliu-mde requested review from Fridge003, ch-wan, fzyzcjy and ispobock as code owners April 13, 2026 15:47

github-actions Bot added the deepseek label Apr 13, 2026

github-actions Bot added the run-ci label Apr 14, 2026

This was referenced Apr 15, 2026

fix(model_loader): RunAI streamer quant_config and distributed tensor clones #22522

Open

[Bug] Runai Streamer Errors out on Multimodal models like KimiK2.5 #22749

Open

alexnails reviewed Apr 29, 2026

View reviewed changes

junliu-mde force-pushed the fix/runai-streamer-buffer-reuse branch from fef3c44 to 65d82ea Compare April 30, 2026 09:51

junliu-mde added 5 commits April 30, 2026 19:09

fix: pass quant config to RunAI loader initialization

27bf9e8

RunAI model loading skipped quant_config during model construction, so quantized paths could instantiate the wrong modules before any weights were streamed.

fix: clone cached DeepSeek MLA tensors for RunAI streaming

6c7992e

The DeepSeek loader keeps q_a_proj and kv_a_proj_with_mqa tensors across iterator steps before fusing them. Clone at cache insertion so distributed RunAI buffer reuse cannot corrupt the fused MLA weights.

fix: resolve RunAI object storage URIs in HF helpers

29d68bf

Share object-storage URI normalization across config, tokenizer, and processor helpers so multimodal processors do not pass raw s3://, gs://, or az:// paths into HuggingFace.

test: cover RunAI loader quant config propagation

e078529

Add a CPU unit test that verifies RunaiModelStreamerLoader passes the computed quant config into model initialization.

junliu-mde force-pushed the fix/runai-streamer-buffer-reuse branch from 90d6e85 to e078529 Compare April 30, 2026 10:10

Merge branch 'main' into fix/runai-streamer-buffer-reuse

0f77e74

alexnails approved these changes May 1, 2026

View reviewed changes

junliu-mde added 3 commits May 5, 2026 15:26

Merge origin/main into RunAI streamer fix

f562d2e

Resolve the overlapping RunAI streamer fixes now present on main while preserving the PR-specific behavior for quant config propagation, streamed Kimi loading, DeepSeek cached tensor ownership, and object-storage URI normalization.

test: remove duplicate RunAI loader test

43ae519

Main now covers the same quant-config propagation path in test_runai_model_streamer_loader.py, so keep the PR focused on the remaining object-storage URI helper changes.

Merge branch 'main' into fix/runai-streamer-buffer-reuse

de716c2

alexnails added 2 commits May 5, 2026 12:35

Merge branch 'main' into fix/runai-streamer-buffer-reuse

98f528d

Merge branch 'main' into fix/runai-streamer-buffer-reuse

10a64d9

Kangyan-Zhou merged commit 65ce996 into sgl-project:main May 7, 2026
149 of 204 checks passed

LLThomas pushed a commit to LLThomas/sglang that referenced this pull request May 8, 2026

[Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, …

86cd2cc

…and broken URIs for multimodal models (sgl-project#22715) Co-authored-by: Alex Nails <alex.nails@radixark.ai>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, and broken URIs for multimodal models#22715

[Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, and broken URIs for multimodal models#22715
Kangyan-Zhou merged 11 commits into
sgl-project:mainfrom
junliu-mde:fix/runai-streamer-buffer-reuse

junliu-mde commented Apr 13, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 13, 2026

Uh oh!

yhyang201 commented Apr 14, 2026

Uh oh!

noa-neria commented Apr 15, 2026 •

edited

Loading

Uh oh!

amacaskill commented Apr 29, 2026

Uh oh!

alexnails left a comment

Uh oh!

alexnails Apr 29, 2026

Uh oh!

junliu-mde Apr 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexnails commented May 4, 2026

Uh oh!

junliu-mde commented May 5, 2026

Uh oh!

junliu-mde commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

junliu-mde commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Checklist

Uh oh!

gemini-code-assist Bot commented Apr 13, 2026

Uh oh!

yhyang201 commented Apr 14, 2026

Uh oh!

noa-neria commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amacaskill commented Apr 29, 2026

Uh oh!

alexnails left a comment

Choose a reason for hiding this comment

Uh oh!

alexnails Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

junliu-mde Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexnails commented May 4, 2026

Uh oh!

junliu-mde commented May 5, 2026

Uh oh!

junliu-mde commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

junliu-mde commented Apr 13, 2026 •

edited

Loading

noa-neria commented Apr 15, 2026 •

edited

Loading