Skip to content

[CI] Temporarily disable marco/mcdse-2b-v1 in test_embedding_models#24279

Merged
Qiaolin-Yu merged 3 commits into
sgl-project:mainfrom
fortunecookiee:jsheng/skip-mcdse-embedding
May 6, 2026
Merged

[CI] Temporarily disable marco/mcdse-2b-v1 in test_embedding_models#24279
Qiaolin-Yu merged 3 commits into
sgl-project:mainfrom
fortunecookiee:jsheng/skip-mcdse-embedding

Conversation

@fortunecookiee
Copy link
Copy Markdown
Contributor

@fortunecookiee fortunecookiee commented May 2, 2026

Motivation

test_embedding_models.py::TestEmbeddingModels::test_prefill_logits flakes whenever its random.choice(MODELS) lands on marco/mcdse-2b-v1, blocking the scheduled main PR Test pipeline.

Original failed run on main:
https://github.com/sgl-project/sglang/actions/runs/25224929325/job/73966043206

Failure signature: cosine similarity diffs of ~0.30 against the HF reference for short prompts (tensor(0.3409), tensor(0.2947)), well above the 1e-5 prefill_tolerance. Retries pass because random.choice(MODELS) lands on a different model on the next attempt.

Validation: PR #24279 fixes the failure (positive control)

This PR's CI:

Validation: mcdse is the offender (negative control)

Companion diagnostic PR #24327 inverts this change — pins test_prefill_logits to only marco/mcdse-2b-v1 so random.choice deterministically lands on it. CI on that branch reproduced the original failure exactly:

Original CI failure Diagnostic PR #24327
Run runs/25224929325/job/73966043206 runs/25293461474/job/74162321156
Job stage-b-test-1-gpu-small (1) stage-b-test-1-gpu-small (1)
similarity diff #1 tensor(0.3409) tensor(0.3411)
similarity diff #2 tensor(0.2947) tensor(0.2944)
Assertion embeddings are not all closeretry() exceed maximum number of retries identical
Duration to failure ~13 min 8m16s

Same partition, same prompts, same diff magnitudes (~1e-3 run-to-run variance is normal), same exception chain. Conclusive.

Investigation summary

I also did a side-by-side trace on A100 (sm_80) + sglang 0.5.10.post1 + fp16 + flashinfer + transformers 5.3.0, byte-identical truncated DEFAULT_PROMPTS fed to both HF (AutoModel.from_pretrainedQwen2VLModel, right-padded batch, last-token pool by attention mask) and SGLang (Engine(is_embedding=True), packed sequences):

idx tokens cosine sim 1 − cos
0 (long, truncated) 2047 0.999998 2e-6
1 (short) 7 0.999998 2e-6
2 (short) 8 0.999998 2e-6
3 (short) 9 0.999998 2e-6

A100 cannot reproduce the failure — diffs sit four orders of magnitude under the tolerance. Holding everything else constant — same sglang, same model, same dtype, same backend, same transformers version, same byte-for-byte prompts — the only differential between A100 (~1e-6) and 5090 (~0.30) is the GPU architecture and its flashinfer/CUDA kernels. A 0.30 cosine gap is far too large to attribute to fp16 numeric drift alone (sm-arch differences typically produce 1e-3 worst case across 28 layers). The signature points at a structural sm_120 kernel issue surfacing on mcdse's specific attention shape (head_dim=128, GQA 6:1, Qwen2-VL backbone) that the other three models in the suite don't hit.

What I confirmed is not the cause

Verified end-to-end on A100 with byte-identical inputs (same AutoModel class, transformers 5.3.0 on both sides):

  • Bidirectional vs causal attentionarchitectures: ["Qwen2VLForConditionalGeneration"] is causal; both sides run causal.
  • mRoPE for text-onlyforward_batch_info.py:798-808 builds [[0..N-1]]·3 for text-only requests; sglang routes through forward_triton (fused mRoPE kernel) which agrees with HF's mRoPE.
  • Tokenizer divergence — token IDs are identical on both sides ([785, 6722, 315, 9625, 374, 12095, 13] for "The capital of France is Paris.").
  • embed_tokens output — exact match (max-abs = 0.0).
  • Per-layer drift on A100 — accumulates 0.002 → 0.062 max-abs at the last token through 28 layers; final cosine still 0.999998. Normal fp16, not a structural issue.
  • fp16 vs attention backend — flashinfer and triton give the same answer on A100.
  • Test harness asymmetry — earlier hypothesis. Doesn't apply: even with sentence-transformers' raw AutoModel path (the harness's else branch), HF and SGLang agree to ~1e-6 on A100.

What's actually wrong

A flashinfer (or related) kernel issue specific to sm_120 + mcdse's attention shape that produces a structural ~0.30 cosine error in the final embedding. Other models in the same suite either don't share the shape or take different kernel paths and stay near 1e-6 even on the 5090.

Modifications

  • Comment out marco/mcdse-2b-v1 from MODEL_TO_CONFIG in test/registered/prefill_only/test_embedding_models.py, matching the precedent for jason9693/Qwen2.5-1.5B-apeach.
  • Inline comment links the failing CI run and notes the suspected sm_120 kernel issue, so this isn't silently re-enabled before the underlying kernel bug is found and fixed.

Coverage of the SGLang embedding path is preserved — the remaining three models keep CI green and continue to validate fp16 accuracy. mcdse is the only Qwen2-VL-derived embedding model in the suite, so what we lose by skipping it is signal on the sm_120 + Qwen2-VL kernel path. That's exactly the path the failure points at, but the test as written can't usefully discriminate "real bug" from "flake" because of the random-choice sampling and the cross-runner signal.

Re-enable plan

Block on a 5090 reproduction with --debug-tensor-dump-output-folder: load mcdse on sm_120, dump pre-attention/post-attention hidden states layer-by-layer, find the first layer where the per-token max-abs deviates from the sm_80 baseline by >1e-2, and trace the divergence into the attention kernel. Once that's fixed in flashinfer (or however), re-add the entry.

Accuracy Tests

N/A — this PR removes a model from the random-choice pool of an accuracy test; it does not change model code. A100 baseline above shows the path is correct on stable hardware. PR CI confirms green on the actual failing runner.

Speed Tests and Profiling

N/A

Checklist

The HF reference path in test/runners.py forks based on whether the
model is sentence-transformers formatted. The ST branch passes
config_kwargs={"is_causal": True} to mirror SGLang. The non-ST branch
(taken by marco/mcdse-2b-v1, a raw Qwen2-VL fine-tune) does not, so HF
runs Qwen2-VL with bidirectional attention while SGLang's Qwen2-VL
embedding path always runs causal. Last-token pooling under bidirectional
vs causal yields ~0.30 cosine diffs on short prompts — well above the
1e-5 tolerance — whenever random.choice(MODELS) lands on this model.

The other three models in MODEL_TO_CONFIG remain enabled and still agree
with HF to ~1e-6, so coverage of the SGLang embedding path is preserved.

Re-enable once the harness asymmetry is fixed (or SGLang's Qwen2-VL
embedding is reworked).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request temporarily disables the 'marco/mcdse-2b-v1' model in the embedding model tests. This change is due to a discrepancy between the Hugging Face reference implementation, which uses bidirectional attention, and SGLang's causal implementation, resulting in significant cosine differences. I have no feedback to provide as there were no review comments.

@fortunecookiee
Copy link
Copy Markdown
Contributor Author

/tag-run-ci-label

@github-actions github-actions Bot added the run-ci label May 3, 2026
@fortunecookiee
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator

/rerun-test test/registered/prefill_only/test_embedding_models.py

@Qiaolin-Yu Qiaolin-Yu self-assigned this May 6, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

1-gpu-5090 (1 test): View workflow run

cd test/ && python3 registered/prefill_only/test_embedding_models.py

@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator

this pr only modifies test_embedding_models.py, which has passed. so good to merge

@Qiaolin-Yu Qiaolin-Yu merged commit bc70488 into sgl-project:main May 6, 2026
127 of 159 checks passed
Fridge003 pushed a commit that referenced this pull request May 6, 2026
…24279)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ltcs11 added a commit to ltcs11/sglang that referenced this pull request May 7, 2026
* main: (894 commits)
  [Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, and broken URIs for multimodal models (sgl-project#22715)
  [Kernel] Deprecate DeepGemm in sgl kernel and apply custom wheel sgl-deep-gemm (sgl-project#24268)
  propagate pytest exit code from test __main__ entries (sgl-project#24487)
  [R3] Avoid implicit CUDA sync in routed experts DP slicing (sgl-project#24550)
  Add ChatCompletionRequest-style support to /v1/tokenize (sgl-project#23981)
  Support Triton MLA FP8 KV cache (sgl-project#20479)
  [diffusion] chore: align LTX-2 with official (sgl-project#24313)
  Expand support matrix for pypi wheel release (sgl-project#24565)
  [codex] Optimize Z-Image packed QKV (sgl-project#24117)
  [Misc] Fix breaking weight checker test (sgl-project#24553)
  [LoRA] Fix qkv_proj LoRA buffer sizing when tp_size > num_key_value_heads (sgl-project#24420)
  ci: bump test_mimo_models.py est_time 330 → 610 (sgl-project#24551)
  [CI] Temporarily disable marco/mcdse-2b-v1 in test_embedding_models (sgl-project#24279)
  Improve metrics, observability, and PD deploy tooling (sgl-project#24521)
  Fix diffusion fallback guards and validation (sgl-project#23335)
  [PD] Prevent update_status to Failed from cleared entries (sgl-project#24539)
  [CP] Register KV cache allgather buffer with symmetric memory (sgl-project#24040)
  Support getting checksums in weight checker (sgl-project#24537)
  Refactor buffer patterns in weight checker (sgl-project#24538)
  Add unit and end-to-end tests for weight checker (sgl-project#24536)
  ...

# Conflicts:
#	python/sglang/srt/managers/scheduler.py
#	python/sglang/srt/model_executor/model_runner.py
LLThomas pushed a commit to LLThomas/sglang that referenced this pull request May 8, 2026
…gl-project#24279)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants