fix(gguf): Use EOS token ID from GGUF metadata instead of HF tokenizer#30434
fix(gguf): Use EOS token ID from GGUF metadata instead of HF tokenizer#30434kitaekatt wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request correctly identifies and fixes an issue where the EOS token ID from GGUF metadata was not being used, leading to incorrect generation termination for certain models. The introduction of extract_eos_token_id_from_gguf and the patching logic in HfTokenizer are well-implemented.
However, I've identified a potential bug in the way the GGUF file path is resolved for remote models, which could cause the fix to not work in those scenarios. I've provided a detailed comment and a code suggestion to address this. Once this is fixed, the PR should be in good shape.
| gguf_path = Path(path_or_repo_id) / gguf_file | ||
| gguf_eos_id = extract_eos_token_id_from_gguf(str(gguf_path)) | ||
| if gguf_eos_id is not None: | ||
| hf_eos_id = tokenizer.eos_token_id | ||
| if hf_eos_id != gguf_eos_id: | ||
| logger.info( | ||
| "Patching tokenizer eos_token_id from %d to %d " | ||
| "(using GGUF metadata)", | ||
| hf_eos_id, | ||
| gguf_eos_id, | ||
| ) | ||
| tokenizer.eos_token_id = gguf_eos_id |
There was a problem hiding this comment.
The current logic for constructing gguf_path is incorrect for remote GGUF models. When path_or_repo_id is a HuggingFace repository ID, Path(path_or_repo_id) creates a local relative path that doesn't exist, causing extract_eos_token_id_from_gguf to fail silently by returning None.
To correctly handle both local paths and remote repository IDs, you should first check if path_or_repo_id is a local directory. If not, assume it's a repository ID and use hf_hub_download to get the local path of the GGUF file. This ensures the EOS token ID can be extracted correctly for remote models.
Note that this might cause the GGUF file to be downloaded twice (once here and once during model loading), which is a performance consideration for a follow-up improvement. For now, this change ensures correctness.
Please add the following imports at the top of the file:
from huggingface_hub import hf_hub_download
from huggingface_hub.utils import HfHubHTTPError gguf_path = None
if Path(path_or_repo_id).is_dir():
gguf_path = Path(path_or_repo_id) / gguf_file
else:
try:
gguf_path = hf_hub_download(
repo_id=str(path_or_repo_id),
filename=gguf_file,
revision=revision,
cache_dir=download_dir,
)
except (HfHubHTTPError, FileNotFoundError) as e:
logger.warning("Failed to download GGUF file %s: %s", gguf_file, e)
if gguf_path and Path(gguf_path).is_file():
gguf_eos_id = extract_eos_token_id_from_gguf(str(gguf_path))
if gguf_eos_id is not None:
hf_eos_id = tokenizer.eos_token_id
if hf_eos_id != gguf_eos_id:
logger.info(
"Patching tokenizer eos_token_id from %d to %d "
"(using GGUF metadata)",
hf_eos_id,
gguf_eos_id,
)
tokenizer.eos_token_id = gguf_eos_id|
This pull request has merge conflicts that must be resolved before it can be |
12fc6ba to
b42a091
Compare
b42a091 to
8c43934
Compare
|
Testing performed: Tested with GGUF models on RTX 5090 (32GB, Blackwell architecture):
Ran models through a local benchmark runner for HumanEval and GSM8K. Verified server startup completes reliably without hangs. Related PRs: This is part of a series of GGUF pipeline fixes for Blackwell GPU compatibility:
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
| "(using GGUF metadata)", | ||
| hf_eos_id, | ||
| gguf_eos_id, | ||
| ) |
There was a problem hiding this comment.
Logger crashes when HF tokenizer has no EOS token
Medium Severity
The logger.info call uses %d format specifier for hf_eos_id, but tokenizer.eos_token_id can be None (it's typed as Optional[int] in transformers). When hf_eos_id is None, the comparison hf_eos_id != gguf_eos_id evaluates to True, and the subsequent logging statement raises a TypeError because %d cannot format None.
There was a problem hiding this comment.
This is technically valid but pre-existing behavior (the original eager initialization had the same pattern). For this to actually leak, cached_tokenizer_from_config() would need to fail transiently then succeed on retry, which doesn't happen in practice - tokenizer failures are deterministic (missing files, corrupt model, OOM).
I’d prefer to keep this PR focused on the semaphore fix.
753d7d8 to
3b8b49a
Compare
GGUF files store the correct EOS token ID in tokenizer.ggml.eos_token_id metadata. However, vLLM was using the HuggingFace tokenizer's eos_token_id, which can differ from the GGUF value. This causes generation to not stop properly for models like Gemma 3, where: - GGUF metadata specifies EOS token ID 106 (<end_of_turn>) - HF tokenizer reports EOS token ID 1 (<eos>) The model generates <end_of_turn> to signal completion, but vLLM waits for token ID 1 which never comes, resulting in repeated EOS tokens until max_tokens is reached. Changes: - Add extract_eos_token_id_from_gguf() in gguf_utils.py to read EOS from GGUF - Patch tokenizer.eos_token_id in hf.py when loading GGUF tokenizers Signed-off-by: Christina Zhu <christina.zhu@hotmail.com> Signed-off-by: Christina <truffle@gmail.com>
Two bugs prevented EOS token patching from working: 1. AutoTokenizer.from_pretrained() pops gguf_file from kwargs internally, consuming it before vLLM could use it. Fixed by saving gguf_file before calling AutoTokenizer.from_pretrained(). 2. extract_eos_token_id_from_gguf() returned early if path didn't end with .gguf extension. HuggingFace Hub stores GGUF files as blob hashes without extensions (e.g., ea6d227c...). Removed the extension check since the caller validates via check_gguf_file(). Validated: Gemma GGUF models now correctly patch EOS token ID from 1 (HF default <eos>) to 106 (GGUF metadata <end_of_turn>), preventing infinite generation loops. Signed-off-by: Christina Hammer <christina@hammer.net> Signed-off-by: Christina <truffle@gmail.com>
3b8b49a to
9c4b9e2
Compare
Validation Results
Tested on RTX 5090 (Blackwell, SM 120) with all listed PRs cherry-picked together; models listed under each benchmark passed that benchmark in the given environment, while the same models crash or fail without these PRs applied. Rebased to current upstream HEAD and re-validated on RTX 5090 (Blackwell, SM 120). Fix confirmed still necessary — 3 GGUF models crash without it. |
|
Hi @kitaekatt, can you consolidate those cherry-picked PRs into one PR? Becaue most of them are minor and gemma2-specific. |
This PR consolidates four related GGUF bug fixes for Gemma2 and Gemma3 models, plus a style improvement from reviewer feedback. **1. Add quant_config to embedding layer (PR vllm-project#30424)** Pass quant_config to VocabParallelEmbedding in Gemma2Model so that GGUFEmbeddingMethod is selected instead of UnquantizedEmbeddingMethod. Without this, quantized bytes are read as raw floats producing gibberish. **2. Fix EOS token extraction for HF blob paths (PR vllm-project#30434)** GGUF files served from HuggingFace Hub use blob paths that don't match the expected filename pattern. Extract EOS token ID directly from GGUF metadata as a reliable fallback. **3. Skip missing parameters during GGUF weight loading (PR vllm-project#30699)** Gemma2 GGUF files may lack certain weight keys (e.g. embed_tokens.qweight_type). Skip missing parameters gracefully instead of raising KeyError. **4. Use RMSNorm instead of GemmaRMSNorm for GGUF (PR vllm-project#31464)** GGUF files store RMSNorm weights with +1 baked in (llama.cpp convention). GemmaRMSNorm adds 1 in its forward pass, causing double addition. Select plain RMSNorm at construction time for GGUF models. Applied to all norm layers in Gemma2 and Gemma3 (including q_norm/k_norm). **Style: compact rms_norm_kwargs pattern (reviewer feedback)** Use rms_norm_kwargs dict to avoid repeating constructor arguments, per hmellor's review on PR vllm-project#31464. Tested on RTX 5090 (Blackwell, SM 120) with gem2-2b-gguf and gemma3-1b. Supersedes PRs vllm-project#30424, vllm-project#30434, vllm-project#30699, vllm-project#31464. Signed-off-by: Christina <truffle@gmail.com>
Summary
GGUF files store the correct EOS token ID in
tokenizer.ggml.eos_token_idmetadata. However, vLLM was using the HuggingFace tokenizer'seos_token_id, which can differ from the GGUF value.This causes generation to not stop properly for models like Gemma 3, where:
<end_of_turn>)<eos>)The model generates
<end_of_turn>to signal completion, but vLLM waits for token ID 1 which never comes, resulting in repeated EOS tokens untilmax_tokensis reached.Example output before fix:
Changes
extract_eos_token_id_from_gguf()ingguf_utils.pyto read EOS from GGUF metadatatokenizer.eos_token_idinhf.pywhen loading GGUF tokenizersTesting
unsloth/gemma-3-12b-it-GGUFmodel<end_of_turn>tokenNotes
This is a common issue with GGUF models that use different EOS tokens than the base HuggingFace tokenizer. The fix ensures vLLM uses the authoritative EOS token ID from the GGUF file itself.