fix(gguf): Use EOS token ID from GGUF metadata instead of HF tokenizer by kitaekatt · Pull Request #30434 · vllm-project/vllm

kitaekatt · 2025-12-10T23:20:58Z

Summary

GGUF files store the correct EOS token ID in tokenizer.ggml.eos_token_id metadata. However, vLLM was using the HuggingFace tokenizer's eos_token_id, which can differ from the GGUF value.

This causes generation to not stop properly for models like Gemma 3, where:

GGUF metadata specifies EOS token ID 106 (<end_of_turn>)
HF tokenizer reports EOS token ID 1 (<eos>)

The model generates <end_of_turn> to signal completion, but vLLM waits for token ID 1 which never comes, resulting in repeated EOS tokens until max_tokens is reached.

Example output before fix:

Response: Hello! I'm doing well, thank you!<end_of_turn><end_of_turn><end_of_turn>...(repeats until max_tokens)

Changes

Add extract_eos_token_id_from_gguf() in gguf_utils.py to read EOS from GGUF metadata
Patch tokenizer.eos_token_id in hf.py when loading GGUF tokenizers
Log when EOS token ID is patched for debugging

Testing

Tested with unsloth/gemma-3-12b-it-GGUF model
Verified GGUF EOS token ID (106) is extracted correctly
Generation now stops properly at <end_of_turn> token

Notes

This is a common issue with GGUF models that use different EOS tokens than the base HuggingFace tokenizer. The fix ensures vLLM uses the authoritative EOS token ID from the GGUF file itself.

gemini-code-assist

Code Review

This pull request correctly identifies and fixes an issue where the EOS token ID from GGUF metadata was not being used, leading to incorrect generation termination for certain models. The introduction of extract_eos_token_id_from_gguf and the patching logic in HfTokenizer are well-implemented.

However, I've identified a potential bug in the way the GGUF file path is resolved for remote models, which could cause the fix to not work in those scenarios. I've provided a detailed comment and a code suggestion to address this. Once this is fixed, the PR should be in good shape.

gemini-code-assist · 2025-12-10T23:24:03Z

+            gguf_path = Path(path_or_repo_id) / gguf_file
+            gguf_eos_id = extract_eos_token_id_from_gguf(str(gguf_path))
+            if gguf_eos_id is not None:
+                hf_eos_id = tokenizer.eos_token_id
+                if hf_eos_id != gguf_eos_id:
+                    logger.info(
+                        "Patching tokenizer eos_token_id from %d to %d "
+                        "(using GGUF metadata)",
+                        hf_eos_id,
+                        gguf_eos_id,
+                    )
+                    tokenizer.eos_token_id = gguf_eos_id


The current logic for constructing gguf_path is incorrect for remote GGUF models. When path_or_repo_id is a HuggingFace repository ID, Path(path_or_repo_id) creates a local relative path that doesn't exist, causing extract_eos_token_id_from_gguf to fail silently by returning None.

To correctly handle both local paths and remote repository IDs, you should first check if path_or_repo_id is a local directory. If not, assume it's a repository ID and use hf_hub_download to get the local path of the GGUF file. This ensures the EOS token ID can be extracted correctly for remote models.

Note that this might cause the GGUF file to be downloaded twice (once here and once during model loading), which is a performance consideration for a follow-up improvement. For now, this change ensures correctness.

Please add the following imports at the top of the file:

from huggingface_hub import hf_hub_download from huggingface_hub.utils import HfHubHTTPError

gguf_path = None if Path(path_or_repo_id).is_dir(): gguf_path = Path(path_or_repo_id) / gguf_file else: try: gguf_path = hf_hub_download( repo_id=str(path_or_repo_id), filename=gguf_file, revision=revision, cache_dir=download_dir, ) except (HfHubHTTPError, FileNotFoundError) as e: logger.warning("Failed to download GGUF file %s: %s", gguf_file, e) if gguf_path and Path(gguf_path).is_file(): gguf_eos_id = extract_eos_token_id_from_gguf(str(gguf_path)) if gguf_eos_id is not None: hf_eos_id = tokenizer.eos_token_id if hf_eos_id != gguf_eos_id: logger.info( "Patching tokenizer eos_token_id from %d to %d " "(using GGUF metadata)", hf_eos_id, gguf_eos_id, ) tokenizer.eos_token_id = gguf_eos_id

mergify · 2025-12-16T14:30:24Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kitaekatt.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

kitaekatt · 2026-01-21T00:28:08Z

Testing performed:

Tested with GGUF models on RTX 5090 (32GB, Blackwell architecture):

bartowski/NousResearch_Hermes-4-14B-GGUF
tensorblock/Qwen_Qwen3-30B-A3B-Instruct-2507-GGUF

Ran models through a local benchmark runner for HumanEval and GSM8K. Verified server startup completes reliably without hangs.

Related PRs:

This is part of a series of GGUF pipeline fixes for Blackwell GPU compatibility:

fix(gguf): Auto-select compatible dtype for GGUF models on Blackwell #30410: Fix GGUF dtype selection for Blackwell compatibility
fix(gguf): Skip lm_head mapping for models with tied word embeddings #30412: Fix tokenizer semaphore leak
fix(gguf): Use EOS token ID from GGUF metadata instead of HF tokenizer #30434 (this PR): Fix shared memory barriers

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

cursor · 2026-01-21T00:35:27Z

+                            "(using GGUF metadata)",
+                            hf_eos_id,
+                            gguf_eos_id,
+                        )


Logger crashes when HF tokenizer has no EOS token

Medium Severity

The logger.info call uses %d format specifier for hf_eos_id, but tokenizer.eos_token_id can be None (it's typed as Optional[int] in transformers). When hf_eos_id is None, the comparison hf_eos_id != gguf_eos_id evaluates to True, and the subsequent logging statement raises a TypeError because %d cannot format None.

This is technically valid but pre-existing behavior (the original eager initialization had the same pattern). For this to actually leak, cached_tokenizer_from_config() would need to fail transiently then succeed on retry, which doesn't happen in practice - tokenizer failures are deterministic (missing files, corrupt model, OOM).

I’d prefer to keep this PR focused on the semaphore fix.

GGUF files store the correct EOS token ID in tokenizer.ggml.eos_token_id metadata. However, vLLM was using the HuggingFace tokenizer's eos_token_id, which can differ from the GGUF value. This causes generation to not stop properly for models like Gemma 3, where: - GGUF metadata specifies EOS token ID 106 (<end_of_turn>) - HF tokenizer reports EOS token ID 1 (<eos>) The model generates <end_of_turn> to signal completion, but vLLM waits for token ID 1 which never comes, resulting in repeated EOS tokens until max_tokens is reached. Changes: - Add extract_eos_token_id_from_gguf() in gguf_utils.py to read EOS from GGUF - Patch tokenizer.eos_token_id in hf.py when loading GGUF tokenizers Signed-off-by: Christina Zhu <christina.zhu@hotmail.com> Signed-off-by: Christina <truffle@gmail.com>

Two bugs prevented EOS token patching from working: 1. AutoTokenizer.from_pretrained() pops gguf_file from kwargs internally, consuming it before vLLM could use it. Fixed by saving gguf_file before calling AutoTokenizer.from_pretrained(). 2. extract_eos_token_id_from_gguf() returned early if path didn't end with .gguf extension. HuggingFace Hub stores GGUF files as blob hashes without extensions (e.g., ea6d227c...). Removed the extension check since the caller validates via check_gguf_file(). Validated: Gemma GGUF models now correctly patch EOS token ID from 1 (HF default <eos>) to 106 (GGUF metadata <end_of_turn>), preventing infinite generation loops. Signed-off-by: Christina Hammer <christina@hammer.net> Signed-off-by: Christina <truffle@gmail.com>

kitaekatt · 2026-03-10T20:10:03Z

Validation Results

vLLM	transformers	Cherry-picked PRs	HumanEval	IFEval
HEAD	5.x	#30410, #30411, #30412, #30413, #30424, #30434, #30699, #30702, #31464, #33846	gem2-2b-gguf (42.1%), gemma3-1b (26.8%)	gem2-2b-gguf (65.6%)
HEAD	4.x	#30410, #30411, #30412, #30413, #30424, #30434, #30699, #30702, #31464, #33846	q3-moe-gguf (83.5%)	q3-moe-gguf (85.4%)

Tested on RTX 5090 (Blackwell, SM 120) with all listed PRs cherry-picked together; models listed under each benchmark passed that benchmark in the given environment, while the same models crash or fail without these PRs applied.

Rebased to current upstream HEAD and re-validated on RTX 5090 (Blackwell, SM 120). Fix confirmed still necessary — 3 GGUF models crash without it.

Isotr0py · 2026-03-11T07:30:49Z

Hi @kitaekatt, can you consolidate those cherry-picked PRs into one PR? Becaue most of them are minor and gemma2-specific.

This PR consolidates four related GGUF bug fixes for Gemma2 and Gemma3 models, plus a style improvement from reviewer feedback. **1. Add quant_config to embedding layer (PR vllm-project#30424)** Pass quant_config to VocabParallelEmbedding in Gemma2Model so that GGUFEmbeddingMethod is selected instead of UnquantizedEmbeddingMethod. Without this, quantized bytes are read as raw floats producing gibberish. **2. Fix EOS token extraction for HF blob paths (PR vllm-project#30434)** GGUF files served from HuggingFace Hub use blob paths that don't match the expected filename pattern. Extract EOS token ID directly from GGUF metadata as a reliable fallback. **3. Skip missing parameters during GGUF weight loading (PR vllm-project#30699)** Gemma2 GGUF files may lack certain weight keys (e.g. embed_tokens.qweight_type). Skip missing parameters gracefully instead of raising KeyError. **4. Use RMSNorm instead of GemmaRMSNorm for GGUF (PR vllm-project#31464)** GGUF files store RMSNorm weights with +1 baked in (llama.cpp convention). GemmaRMSNorm adds 1 in its forward pass, causing double addition. Select plain RMSNorm at construction time for GGUF models. Applied to all norm layers in Gemma2 and Gemma3 (including q_norm/k_norm). **Style: compact rms_norm_kwargs pattern (reviewer feedback)** Use rms_norm_kwargs dict to avoid repeating constructor arguments, per hmellor's review on PR vllm-project#31464. Tested on RTX 5090 (Blackwell, SM 120) with gem2-2b-gguf and gemma3-1b. Supersedes PRs vllm-project#30424, vllm-project#30434, vllm-project#30699, vllm-project#31464. Signed-off-by: Christina <truffle@gmail.com>

kitaekatt · 2026-03-16T19:38:07Z

Closing in favor of consolidated PR #37220, as requested by @Isotr0py in #30434. All fixes from this PR are included in the consolidated version.

gemini-code-assist Bot reviewed Dec 10, 2025

View reviewed changes

kitaekatt mentioned this pull request Dec 15, 2025

fix(gguf): GGUF model support fixes for Blackwell GPUs #30497

Closed

4 tasks

mergify Bot added the needs-rebase label Dec 16, 2025

kitaekatt force-pushed the fix/gguf-eos-token-extraction branch from 12fc6ba to b42a091 Compare December 29, 2025 20:45

mergify Bot removed the needs-rebase label Dec 29, 2025

kitaekatt force-pushed the fix/gguf-eos-token-extraction branch from b42a091 to 8c43934 Compare January 19, 2026 17:27

This was referenced Jan 21, 2026

fix(gguf): Auto-select compatible dtype for GGUF models on Blackwell #30410

Open

fix(gguf): Skip lm_head mapping for models with tied word embeddings #30412

Open

kitaekatt marked this pull request as ready for review January 21, 2026 00:28

cursor Bot reviewed Jan 21, 2026

View reviewed changes

kitaekatt force-pushed the fix/gguf-eos-token-extraction branch from 753d7d8 to 3b8b49a Compare February 5, 2026 00:18

kitaekatt added 2 commits March 10, 2026 10:37

kitaekatt force-pushed the fix/gguf-eos-token-extraction branch from 3b8b49a to 9c4b9e2 Compare March 10, 2026 15:38

kitaekatt requested review from DarkLight1337 and njhill as code owners March 10, 2026 15:38

DarkLight1337 requested a review from Isotr0py March 10, 2026 15:54

kitaekatt mentioned this pull request Mar 10, 2026

[Bugfix] Skip missing parameters during GGUF Gemma2 weight loading #30699

Closed

kitaekatt mentioned this pull request Mar 16, 2026

[Bugfix] Consolidate Gemma2/3 GGUF fixes for correctness on Blackwell #37220

Open

kitaekatt closed this Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(gguf): Use EOS token ID from GGUF metadata instead of HF tokenizer#30434

fix(gguf): Use EOS token ID from GGUF metadata instead of HF tokenizer#30434
kitaekatt wants to merge 2 commits into
vllm-project:mainfrom
kitaekatt:fix/gguf-eos-token-extraction

kitaekatt commented Dec 10, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Dec 10, 2025

Uh oh!

mergify Bot commented Dec 16, 2025

Uh oh!

kitaekatt commented Jan 21, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jan 21, 2026

Uh oh!

kitaekatt Jan 21, 2026

Uh oh!

kitaekatt commented Mar 10, 2026

Uh oh!

Isotr0py commented Mar 11, 2026

Uh oh!

kitaekatt commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kitaekatt commented Dec 10, 2025

Summary

Changes

Testing

Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Dec 16, 2025

Uh oh!

kitaekatt commented Jan 21, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jan 21, 2026

Choose a reason for hiding this comment

Logger crashes when HF tokenizer has no EOS token

Uh oh!

kitaekatt Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

kitaekatt commented Mar 10, 2026

Validation Results

Uh oh!

Isotr0py commented Mar 11, 2026

Uh oh!

kitaekatt commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants