Skip to content

[TEMP FOR DOCKER BUILD - WILL DELETE LATER] Add Mistral Small 4 support with patched transformers#20713

Closed
dougyster wants to merge 30 commits into
mainfrom
mistral4-support
Closed

[TEMP FOR DOCKER BUILD - WILL DELETE LATER] Add Mistral Small 4 support with patched transformers#20713
dougyster wants to merge 30 commits into
mainfrom
mistral4-support

Conversation

@dougyster

Copy link
Copy Markdown
Collaborator

Summary

Changes

  • All SGLang-side Mistral 4 changes (config loading, vision processor, reasoning parser, chat template fallback)
  • Dockerfile: installs dougyster/transformers@mistral-4-patch which includes:
    • HuggingFace transformers main (with Mistral 4 model support from Add Mistral 4 huggingface/transformers#44760)
    • Tekken tokenizer fix: correct vocab ID offset by num_special_tokens
    • Tekken converter: use full tokenizer_object instead of bare vocab+merges

Test plan

  • Build Docker image from this branch
  • Verify Mistral-Small-4-119B-2603 loads and generates correct output with --tp 2
  • Verify tokenizer produces correct token IDs

🤖 Generated with Claude Code

JustinTong0323 and others added 21 commits February 28, 2026 13:57
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
…size

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
…processor

- Use patch_size * spatial_merge_size as the effective patch size in
  PixtralImageProcessor so images resize to multiples of 28 (not 14),
  matching PatchMerger requirements with spatial_merge_size=2
- Remove manual _resize and get_patch_grid_size methods, relying on
  the correctly configured HF image processor instead
- Add multi-image offset splitting for per-image MultimodalDataItem
- Remove unused torch import
- Add --model flag (default "default") to avoid hardcoded model name
- Add --reasoning-effort flag passed as top-level request field
- Support local image paths via base64 data URI encoding
- Pass reasoning_effort and model as explicit parameters instead of
  smuggling through sampling_params dict
…riable

The flashinfer trtllm_fp8_per_tensor_scale_moe already defaults activation_type
to Swiglu (3), which matches Mistral-Small-4's silu+gated config. Also replace
unused ncols with _ in pixtral processor.
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@dougyster dougyster changed the title Add Mistral Small 4 support with patched transformers [TEMP FOR DOCKER BUILD - WILL DELETE LATER] Add Mistral Small 4 support with patched transformers Mar 16, 2026
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
JustinTong0323 and others added 3 commits March 16, 2026 23:48
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
… tokens

Mistral's tokenizer defines [THINK] (id=34) and [/THINK] (id=35) as
special tokens. When skip_special_tokens=True (the default), these
tokens are stripped during decoding, making the reasoning parser unable
to detect thinking boundaries and split reasoning_content from content.

This is an upstream issue in the Mistral checkpoint/tokenizer config —
reasoning markers should not be special tokens (cf. DeepSeek's
<think>/<​/think> which are regular tokens and work without workarounds).

As a workaround, disable skip_special_tokens when the Mistral reasoning
parser is active and reasoning_effort is set.
JustinTong0323 and others added 3 commits March 17, 2026 04:08
The EAGLE draft model for Mistral Small 4 (mistralai/Mistral-Small-4-119B-2603-eagle)
uses dense MLA layers without MoE, unlike the Mistral Large 3 EAGLE which has MoE.
This caused three issues:

1. `adapt_config_dict` in mistral_utils.py did not handle dense EAGLE models
   (moe=null in params.json), falling through to an unsupported architecture.
   Fix: add a branch for `is_eagle and not is_moe` that sets model_type=deepseek_v3
   with all-dense MoE overrides (first_k_dense_replace=num_layers).

2. `_remap_mistral_yarn_args` did not include rope_theta in rope_scaling,
   causing transformers yarn validation to fail.
   Fix: copy rope_theta into the rope_scaling dict.

3. `MistralLarge3ForCausalLMEagle.__init__` set `self.model_cls` but
   `DeepseekV2ForCausalLM.__init__` hardcodes `self.model = DeepseekV2Model`,
   so the EAGLE fc layer was never created. The draft model ran without fusing
   token embeddings with target hidden states, producing garbage draft tokens
   (accept rate 0.25).
   Fix: call super().__init__() then replace self.model with
   MistralLarge3EagleModel which has the fc layer. Accept rate: 0.25 -> 0.83.
@dougyster dougyster closed this Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants