Skip to content

[Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, and broken URIs for multimodal models#22715

Merged
Kangyan-Zhou merged 11 commits into
sgl-project:mainfrom
junliu-mde:fix/runai-streamer-buffer-reuse
May 7, 2026
Merged

[Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, and broken URIs for multimodal models#22715
Kangyan-Zhou merged 11 commits into
sgl-project:mainfrom
junliu-mde:fix/runai-streamer-buffer-reuse

Conversation

@junliu-mde
Copy link
Copy Markdown
Contributor

@junliu-mde junliu-mde commented Apr 13, 2026

Motivation

Fixes #22701.

Found while testing #17948 (Direct model loading from object storage with Runai Model Streamer) on an 8×H200 cluster with Kimi-K2.5 (WNA16 Marlin MoE) and DeepSeek-V3-0324 (BF16 MoE) via az:// paths with TP=8.

Three independent bugs prevent RunAI streamer from working correctly with multi-GPU quantized / multimodal models:

  1. RunaiModelStreamerLoader omits quant_config — quantized models initialize as BF16, causing OOM or wrong inference.
  2. Stale tensor views from distributed buffer reuse — the RunAI distributed streamer reuses GPU staging buffers across batches; two code paths (Kimi-K2.5 list accumulation, DeepSeek MLA cached_a_proj) hold references across batches, resulting in silently corrupted weights (model outputs all !).
  3. get_processor() does not resolve object-storage URIs — unlike get_config() and get_tokenizer(), get_processor() passes raw az:///s3:///gs:// URIs to HuggingFace, crashing all multimodal models loaded from object storage.

Modifications

Four atomic commits, one per root cause:

Commit File Change
7292838 python/sglang/srt/model_loader/loader.py Pass quant_config to _initialize_model() in RunaiModelStreamerLoader.load_model(), consistent with every other loader
a5f7a2d python/sglang/srt/models/deepseek_common/deepseek_weight_loader.py .clone() tensors cached in cached_a_proj so they survive buffer reuse across streamer batches
29590bc python/sglang/srt/models/kimi_k25.py Refactor load_weights() from accumulate-then-process to streaming pattern for language weights; clone only the small vision weights that need batching
fef3c44 python/sglang/srt/utils/hf_transformers_utils.py Add is_runai_obj_uri() / ObjectStorageModel.get_path() normalization to get_processor(), matching get_config() and get_tokenizer()

Accuracy Tests

Tested end-to-end on 8×H200, Azure Blob Storage (same-region):

Kimi-K2.5 (WNA16 Marlin, TP=8, az://)

  • Before fix: server starts, but all chat responses are ! (token 0, all-zero logits from stale buffers)
  • After fix: coherent multi-turn chat output, correct reasoning and vision responses

DeepSeek-V3-0324 (FP8, TP=8, az://)

  • Before fix: OOM during _initialize_model() (missing quant_config causes BF16 allocation for FP8 layers)
  • After fix: loads successfully in ~386s (8 GB streamer memory limit), correct inference output

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@yhyang201
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@noa-neria
Copy link
Copy Markdown
Contributor

noa-neria commented Apr 15, 2026

@yhyang201 LGTM and thanks for fixing these issues!
For robustness, we can make runai model streamer default behavior to yield cloned tensors whenever an internal buffer is reused (either GPU or CPU). Adding owned=True to stream_files to control the clone, letting streaming clients keep the current behavior is desired.

@amacaskill
Copy link
Copy Markdown
Contributor

@yhyang201 Can you please resolve the merge conflicts so we can get this merged?

Copy link
Copy Markdown
Collaborator

@alexnails alexnails left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all minor, please fix them and merge conflict and I will approve

):
# pop 'revision' from kwargs if present.
revision = kwargs.pop("revision", tokenizer_revision)
if is_runai_obj_uri(tokenizer_name):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this a helper ? it is called many times in this file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review, latest main has changed hf_transformers_utils.py into multiple files, I'll rebase correspondingly.

Comment thread python/sglang/srt/models/kimi_k25.py Outdated
Comment thread python/sglang/srt/models/kimi_k25.py Outdated
Comment thread python/sglang/srt/model_loader/loader.py
@junliu-mde junliu-mde force-pushed the fix/runai-streamer-buffer-reuse branch from fef3c44 to 65d82ea Compare April 30, 2026 09:51
RunAI model loading skipped quant_config during model construction, so quantized paths could instantiate the wrong modules before any weights were streamed.
The DeepSeek loader keeps q_a_proj and kv_a_proj_with_mqa tensors across iterator steps before fusing them. Clone at cache insertion so distributed RunAI buffer reuse cannot corrupt the fused MLA weights.
Kimi-K2.5 buffered all language weights before handing them to the DeepSeek loader. Stream language tensors immediately and clone only delayed vision weights when needed so distributed RunAI batches cannot invalidate retained views.
Share object-storage URI normalization across config, tokenizer, and processor helpers so multimodal processors do not pass raw s3://, gs://, or az:// paths into HuggingFace.
Add a CPU unit test that verifies RunaiModelStreamerLoader passes the computed quant config into model initialization.
@junliu-mde junliu-mde force-pushed the fix/runai-streamer-buffer-reuse branch from 90d6e85 to e078529 Compare April 30, 2026 10:10
@alexnails
Copy link
Copy Markdown
Collaborator

@junliu-mde seems like some changes are subsetted with yours as there is now conflict, could you investigate and let me know what next steps are for this PR ?

@junliu-mde
Copy link
Copy Markdown
Contributor Author

@junliu-mde seems like some changes are subsetted with yours as there is now conflict, could you investigate and let me know what next steps are for this PR ?

Let me check the conflict, thanks for reminder.

junliu-mde added 3 commits May 5, 2026 15:26
Resolve the overlapping RunAI streamer fixes now present on main while preserving the PR-specific behavior for quant config propagation, streamed Kimi loading, DeepSeek cached tensor ownership, and object-storage URI normalization.
Main now covers the same quant-config propagation path in test_runai_model_streamer_loader.py, so keep the PR focused on the remaining object-storage URI helper changes.
@junliu-mde
Copy link
Copy Markdown
Contributor Author

@alexnails I checked the conflict and the remaining diff.

The overlap came from #23850 being merged first. That PR already fixed the RunAI weight-loading related pieces: passing quant_config during RunAI model initialization, keeping Kimi weight loading streaming, routing RunAI streamer loads correctly, and cloning retained DeepSeek q/kv-a tensors from RunAI streamer buffers.

After merging main, those parts are now covered by main and are no longer the meaningful remaining diff in this PR.

The part that still remains useful here is the object-storage URI handling for multimodal processors. For az://..., s3://..., and gs://... model paths, get_processor() could still pass the raw object-storage URI directly to HuggingFace and fail. This PR now normalizes those paths through the RunAI metadata cache first, matching the config/tokenizer path behavior, and then loads the processor from the local cached metadata directory.

So the current PR is effectively narrowed to: resolve RunAI object-storage URIs consistently across HF config/tokenizer/processor helpers, especially for multimodal processor loading.

@Kangyan-Zhou Kangyan-Zhou merged commit 65ce996 into sgl-project:main May 7, 2026
149 of 204 checks passed
ltcs11 added a commit to ltcs11/sglang that referenced this pull request May 7, 2026
* main: (894 commits)
  [Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, and broken URIs for multimodal models (sgl-project#22715)
  [Kernel] Deprecate DeepGemm in sgl kernel and apply custom wheel sgl-deep-gemm (sgl-project#24268)
  propagate pytest exit code from test __main__ entries (sgl-project#24487)
  [R3] Avoid implicit CUDA sync in routed experts DP slicing (sgl-project#24550)
  Add ChatCompletionRequest-style support to /v1/tokenize (sgl-project#23981)
  Support Triton MLA FP8 KV cache (sgl-project#20479)
  [diffusion] chore: align LTX-2 with official (sgl-project#24313)
  Expand support matrix for pypi wheel release (sgl-project#24565)
  [codex] Optimize Z-Image packed QKV (sgl-project#24117)
  [Misc] Fix breaking weight checker test (sgl-project#24553)
  [LoRA] Fix qkv_proj LoRA buffer sizing when tp_size > num_key_value_heads (sgl-project#24420)
  ci: bump test_mimo_models.py est_time 330 → 610 (sgl-project#24551)
  [CI] Temporarily disable marco/mcdse-2b-v1 in test_embedding_models (sgl-project#24279)
  Improve metrics, observability, and PD deploy tooling (sgl-project#24521)
  Fix diffusion fallback guards and validation (sgl-project#23335)
  [PD] Prevent update_status to Failed from cleared entries (sgl-project#24539)
  [CP] Register KV cache allgather buffer with symmetric memory (sgl-project#24040)
  Support getting checksums in weight checker (sgl-project#24537)
  Refactor buffer patterns in weight checker (sgl-project#24538)
  Add unit and end-to-end tests for weight checker (sgl-project#24536)
  ...

# Conflicts:
#	python/sglang/srt/managers/scheduler.py
#	python/sglang/srt/model_executor/model_runner.py
LLThomas pushed a commit to LLThomas/sglang that referenced this pull request May 8, 2026
…and broken URIs for multimodal models (sgl-project#22715)

Co-authored-by: Alex Nails <alex.nails@radixark.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] RunAI streamer (#17948): corrupted weights, missing quant init, and broken object-storage URIs for multimodal models

6 participants