Skip to content

fix: mistral embedding regression fix#21913

Merged
Fridge003 merged 3 commits intomainfrom
embedding-regression-fix
Apr 4, 2026
Merged

fix: mistral embedding regression fix#21913
Fridge003 merged 3 commits intomainfrom
embedding-regression-fix

Conversation

@dougyster
Copy link
Copy Markdown
Collaborator

@dougyster dougyster commented Apr 2, 2026

Motivation

Embedding models using LlamaTokenizerFast (e.g. intfloat/e5-mistral-7b-instruct) regressed in cosine similarity from ~1.0 to ~0.33 after the transformers v5.3.0 upgrade (#17784). The regression was identified by bisecting between v0.5.9 (good) and v0.5.10rc0 (bad), pinning the first bad commit to d1e95af ("Upgrade transformers==5.3.0 (#17784)").

Root cause

The transformers v5 upgrade introduced _fix_v5_add_bos_eos_token() to restore add_bos_token/add_eos_token flags that v5 strips when a tokenizer.json post-processor is present. The fix correctly handles add_bos_token for generative models like DeepSeek-V3, but incorrectly also restored add_eos_token for fast tokenizers.

For fast tokenizers (PreTrainedTokenizerFast), add_eos_token was never applied via the Python-level attribute in transformers v4 — EOS behavior was defined in tokenizer.json's post-processor. In v5, both sglang and HF vanilla end up with add_eos_token=False. By restoring add_eos_token=True only in sglang (but not in the HF reference), the tokenizers diverge:

  • sglang (with fix): EOS token appended → last-token pooling selects the EOS embedding
  • HF reference: no EOS → last-token pooling selects the last content token embedding
  • Result: cosine similarity ~0.33 instead of ~1.0

The model's tokenizer_config.json has "add_eos_token": true because Mistral uses EOS as a generation stop token — but for LlamaTokenizerFast, this was always handled by the post-processor, not the Python attribute. The attribute existed in v4 but was a no-op for fast tokenizers.

Modifications

In _fix_v5_add_bos_eos_token(), skip restoring add_eos_token when the loaded tokenizer is a PreTrainedTokenizerFast instance. This matches v5 vanilla behavior (which also doesn't restore it) and keeps sglang consistent with the HF reference for fast tokenizers. add_bos_token restoration is unaffected.

# fast tokenizers never applied add_eos_token via the Python attribute in v4
# — restoring it diverges from HF vanilla v5 and breaks embedding model accuracy
if attr == "add_eos_token" and isinstance(tokenizer, PreTrainedTokenizerFast):
    config_val = _V4_DEFAULTS["add_eos_token"]  # False

Accuracy Tests

Tested on intfloat/e5-mistral-7b-instruct with a full clean reinstall (sglang + sglang-kernel uninstalled and reinstalled from scratch):

Test set Similarities (before fix) Similarities (after fix)
Set 1 0.334, 0.387, 0.431 1.000, 1.000, 1.000
Set 2 0.334, 0.387, 0.431 1.000, 1.000, 1.000
Set 3 0.334, 0.387, 0.431 1.000, 1.000, 1.000

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown
Collaborator

@JustinTong0323 JustinTong0323 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Root cause analysis is clear and the fix is minimal and correct — for fast tokenizers, should remain to match HF reference behavior. Good bisection work.

Copy link
Copy Markdown
Collaborator

@JustinTong0323 JustinTong0323 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Root cause analysis is clear and the fix is minimal and correct. For fast tokenizers, add_eos_token should remain False to match HF reference behavior. Good bisection work.

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@github-actions github-actions bot added the run-ci label Apr 2, 2026
@dougyster dougyster force-pushed the embedding-regression-fix branch from 091dce5 to 619e344 Compare April 3, 2026 05:38
@Fridge003 Fridge003 merged commit a94c380 into main Apr 4, 2026
360 of 425 checks passed
@Fridge003 Fridge003 deleted the embedding-regression-fix branch April 4, 2026 07:11
sundar24295s pushed a commit to sundar24295s/sglang that referenced this pull request Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants