[TEMP FOR DOCKER BUILD - WILL DELETE LATER] Add Mistral Small 4 support with patched transformers#20713
Closed
dougyster wants to merge 30 commits into
Closed
[TEMP FOR DOCKER BUILD - WILL DELETE LATER] Add Mistral Small 4 support with patched transformers#20713dougyster wants to merge 30 commits into
dougyster wants to merge 30 commits into
Conversation
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
…size Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
…processor - Use patch_size * spatial_merge_size as the effective patch size in PixtralImageProcessor so images resize to multiples of 28 (not 14), matching PatchMerger requirements with spatial_merge_size=2 - Remove manual _resize and get_patch_grid_size methods, relying on the correctly configured HF image processor instead - Add multi-image offset splitting for per-image MultimodalDataItem - Remove unused torch import
- Add --model flag (default "default") to avoid hardcoded model name - Add --reasoning-effort flag passed as top-level request field - Support local image paths via base64 data URI encoding - Pass reasoning_effort and model as explicit parameters instead of smuggling through sampling_params dict
…riable The flashinfer trtllm_fp8_per_tensor_scale_moe already defaults activation_type to Swiglu (3), which matches Mistral-Small-4's silu+gated config. Also replace unused ncols with _ in pixtral processor.
…al with 0% accuracy when thinking
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
bd7113b to
b40d2f9
Compare
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
… tokens Mistral's tokenizer defines [THINK] (id=34) and [/THINK] (id=35) as special tokens. When skip_special_tokens=True (the default), these tokens are stripped during decoding, making the reasoning parser unable to detect thinking boundaries and split reasoning_content from content. This is an upstream issue in the Mistral checkpoint/tokenizer config — reasoning markers should not be special tokens (cf. DeepSeek's <think>/</think> which are regular tokens and work without workarounds). As a workaround, disable skip_special_tokens when the Mistral reasoning parser is active and reasoning_effort is set.
b40d2f9 to
89d23b3
Compare
The EAGLE draft model for Mistral Small 4 (mistralai/Mistral-Small-4-119B-2603-eagle) uses dense MLA layers without MoE, unlike the Mistral Large 3 EAGLE which has MoE. This caused three issues: 1. `adapt_config_dict` in mistral_utils.py did not handle dense EAGLE models (moe=null in params.json), falling through to an unsupported architecture. Fix: add a branch for `is_eagle and not is_moe` that sets model_type=deepseek_v3 with all-dense MoE overrides (first_k_dense_replace=num_layers). 2. `_remap_mistral_yarn_args` did not include rope_theta in rope_scaling, causing transformers yarn validation to fail. Fix: copy rope_theta into the rope_scaling dict. 3. `MistralLarge3ForCausalLMEagle.__init__` set `self.model_cls` but `DeepseekV2ForCausalLM.__init__` hardcodes `self.model = DeepseekV2Model`, so the EAGLE fc layer was never created. The draft model ran without fusing token embeddings with target hidden states, producing garbage draft tokens (accept rate 0.25). Fix: call super().__init__() then replace self.model with MistralLarge3EagleModel which has the fc layer. Accept rate: 0.25 -> 0.83.
89d23b3 to
c0bf47b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Changes
dougyster/transformers@mistral-4-patchwhich includes:num_special_tokenstokenizer_objectinstead of bare vocab+mergesTest plan
--tp 2🤖 Generated with Claude Code