-
Notifications
You must be signed in to change notification settings - Fork 612
Validate tokenizer and model alignment before training #2074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -468,3 +468,45 @@ def get_moe_model_nparams_and_flops( | |
| nparams = nparams - nparams_embedding | ||
|
|
||
| return nparams, num_flops_per_token | ||
|
|
||
|
|
||
| def validate_tokenizer_model_alignment( | ||
| tokenizer: "BaseTokenizer | None", | ||
| model_args: "BaseModelArgs", | ||
| ) -> None: | ||
| """ | ||
| Validate that tokenizer configuration matches model configuration. | ||
|
|
||
| Args: | ||
| tokenizer: Tokenizer instance to validate. Can be None. | ||
| model_args: Model arguments object containing configuration to validate against. | ||
|
|
||
| Raises: | ||
| ValueError: If tokenizer and model configurations don't match. | ||
| """ | ||
| if tokenizer is None: | ||
| return | ||
|
|
||
| # Validate vocab_size | ||
| if hasattr(model_args, "vocab_size"): | ||
| tokenizer_vocab_size = tokenizer.get_vocab_size() | ||
| model_vocab_size = model_args.vocab_size | ||
| if tokenizer_vocab_size != model_vocab_size: | ||
| raise ValueError( | ||
| f"Tokenizer vocab_size ({tokenizer_vocab_size}) does not match " | ||
| f"model vocab_size ({model_vocab_size}). " | ||
| f"This mismatch will cause training errors. " | ||
| f"Please ensure the tokenizer and model configuration are aligned." | ||
| ) | ||
|
|
||
| # Validate eos_id | ||
| if hasattr(model_args, "eos_id"): | ||
|
||
| tokenizer_eos_id = getattr(tokenizer, "eos_id", None) | ||
| model_eos_id = model_args.eos_id | ||
| if tokenizer_eos_id is not None and tokenizer_eos_id != model_eos_id: | ||
| raise ValueError( | ||
| f"Tokenizer eos_id ({tokenizer_eos_id}) does not match " | ||
| f"model eos_id ({model_eos_id}). " | ||
| f"This mismatch may cause training errors. " | ||
| f"Please ensure the tokenizer and model configuration are aligned." | ||
| ) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contribution! We also noticed this and we are actually not require the tokenzier's vocab size is the same as model's vocab size. The tokenizer's vocab size defines the input space and related to dataset, while the model's vocab size (the embedding layer size) defines the model's representation space.
In general, it' user's responsibility to use the correct tokenizer and model embedding layer dimension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the mismatch is intentional.
E.g. if you have several phases of training, pretraining -> finetuning, then in pretraining, you don't need to have the tokenizer you use for finetuning, but in modeling you need to create an embedding table large enough so it has capacity to be trained later with custom finetuning tokenizers.
So the requirement here is model.vocab_size > tokenizer.vocab_size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review.
You're right, strict equality was definitely an oversight on my part.
(I was biased by my own workflow where I generate HF configs based on the tokenizer's vocab size, but i realize now that shouldn't be enforced generally here.)
However, tianyu-I mentioned, I do think it's valuable to verify that
model.vocab_size > tokenizer.vocab_size.While the training loop would eventually crash on a mismatch, it typicall y results in a vague CUDA Error, which is hard to debug. A proactive check here would provide a much more informative error message for users.
I've updated the PR to only enforce that model has sufficient capacity for the tokenizer.