[Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, and broken URIs for multimodal models#22715
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
|
@yhyang201 LGTM and thanks for fixing these issues! |
|
@yhyang201 Can you please resolve the merge conflicts so we can get this merged? |
alexnails
left a comment
There was a problem hiding this comment.
all minor, please fix them and merge conflict and I will approve
| ): | ||
| # pop 'revision' from kwargs if present. | ||
| revision = kwargs.pop("revision", tokenizer_revision) | ||
| if is_runai_obj_uri(tokenizer_name): |
There was a problem hiding this comment.
can we make this a helper ? it is called many times in this file.
There was a problem hiding this comment.
Thanks for review, latest main has changed hf_transformers_utils.py into multiple files, I'll rebase correspondingly.
fef3c44 to
65d82ea
Compare
RunAI model loading skipped quant_config during model construction, so quantized paths could instantiate the wrong modules before any weights were streamed.
The DeepSeek loader keeps q_a_proj and kv_a_proj_with_mqa tensors across iterator steps before fusing them. Clone at cache insertion so distributed RunAI buffer reuse cannot corrupt the fused MLA weights.
Kimi-K2.5 buffered all language weights before handing them to the DeepSeek loader. Stream language tensors immediately and clone only delayed vision weights when needed so distributed RunAI batches cannot invalidate retained views.
Share object-storage URI normalization across config, tokenizer, and processor helpers so multimodal processors do not pass raw s3://, gs://, or az:// paths into HuggingFace.
Add a CPU unit test that verifies RunaiModelStreamerLoader passes the computed quant config into model initialization.
90d6e85 to
e078529
Compare
|
@junliu-mde seems like some changes are subsetted with yours as there is now conflict, could you investigate and let me know what next steps are for this PR ? |
Let me check the conflict, thanks for reminder. |
Resolve the overlapping RunAI streamer fixes now present on main while preserving the PR-specific behavior for quant config propagation, streamed Kimi loading, DeepSeek cached tensor ownership, and object-storage URI normalization.
Main now covers the same quant-config propagation path in test_runai_model_streamer_loader.py, so keep the PR focused on the remaining object-storage URI helper changes.
|
@alexnails I checked the conflict and the remaining diff. The overlap came from #23850 being merged first. That PR already fixed the RunAI weight-loading related pieces: passing After merging main, those parts are now covered by main and are no longer the meaningful remaining diff in this PR. The part that still remains useful here is the object-storage URI handling for multimodal processors. For So the current PR is effectively narrowed to: resolve RunAI object-storage URIs consistently across HF config/tokenizer/processor helpers, especially for multimodal processor loading. |
* main: (894 commits) [Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, and broken URIs for multimodal models (sgl-project#22715) [Kernel] Deprecate DeepGemm in sgl kernel and apply custom wheel sgl-deep-gemm (sgl-project#24268) propagate pytest exit code from test __main__ entries (sgl-project#24487) [R3] Avoid implicit CUDA sync in routed experts DP slicing (sgl-project#24550) Add ChatCompletionRequest-style support to /v1/tokenize (sgl-project#23981) Support Triton MLA FP8 KV cache (sgl-project#20479) [diffusion] chore: align LTX-2 with official (sgl-project#24313) Expand support matrix for pypi wheel release (sgl-project#24565) [codex] Optimize Z-Image packed QKV (sgl-project#24117) [Misc] Fix breaking weight checker test (sgl-project#24553) [LoRA] Fix qkv_proj LoRA buffer sizing when tp_size > num_key_value_heads (sgl-project#24420) ci: bump test_mimo_models.py est_time 330 → 610 (sgl-project#24551) [CI] Temporarily disable marco/mcdse-2b-v1 in test_embedding_models (sgl-project#24279) Improve metrics, observability, and PD deploy tooling (sgl-project#24521) Fix diffusion fallback guards and validation (sgl-project#23335) [PD] Prevent update_status to Failed from cleared entries (sgl-project#24539) [CP] Register KV cache allgather buffer with symmetric memory (sgl-project#24040) Support getting checksums in weight checker (sgl-project#24537) Refactor buffer patterns in weight checker (sgl-project#24538) Add unit and end-to-end tests for weight checker (sgl-project#24536) ... # Conflicts: # python/sglang/srt/managers/scheduler.py # python/sglang/srt/model_executor/model_runner.py
…and broken URIs for multimodal models (sgl-project#22715) Co-authored-by: Alex Nails <alex.nails@radixark.ai>
Motivation
Fixes #22701.
Found while testing #17948 (Direct model loading from object storage with Runai Model Streamer) on an 8×H200 cluster with Kimi-K2.5 (WNA16 Marlin MoE) and DeepSeek-V3-0324 (BF16 MoE) via
az://paths with TP=8.Three independent bugs prevent RunAI streamer from working correctly with multi-GPU quantized / multimodal models:
RunaiModelStreamerLoaderomitsquant_config— quantized models initialize as BF16, causing OOM or wrong inference.cached_a_proj) hold references across batches, resulting in silently corrupted weights (model outputs all!).get_processor()does not resolve object-storage URIs — unlikeget_config()andget_tokenizer(),get_processor()passes rawaz:///s3:///gs://URIs to HuggingFace, crashing all multimodal models loaded from object storage.Modifications
Four atomic commits, one per root cause:
7292838python/sglang/srt/model_loader/loader.pyquant_configto_initialize_model()inRunaiModelStreamerLoader.load_model(), consistent with every other loadera5f7a2dpython/sglang/srt/models/deepseek_common/deepseek_weight_loader.py.clone()tensors cached incached_a_projso they survive buffer reuse across streamer batches29590bcpython/sglang/srt/models/kimi_k25.pyload_weights()from accumulate-then-process to streaming pattern for language weights; clone only the small vision weights that need batchingfef3c44python/sglang/srt/utils/hf_transformers_utils.pyis_runai_obj_uri()/ObjectStorageModel.get_path()normalization toget_processor(), matchingget_config()andget_tokenizer()Accuracy Tests
Tested end-to-end on 8×H200, Azure Blob Storage (same-region):
Kimi-K2.5 (WNA16 Marlin, TP=8,
az://)!(token 0, all-zero logits from stale buffers)DeepSeek-V3-0324 (FP8, TP=8,
az://)_initialize_model()(missing quant_config causes BF16 allocation for FP8 layers)Checklist