Validate tokenizer and model alignment before training #2074

CryptoSalamander · 2025-11-21T08:35:41Z

First of all, thank you for creating this amazing project. I discovered a potential improvement regarding tokenizer usage while working with the codebase and have implemented the changes below.

Summary

I have added a validation step to ensure the provided Tokenizer's configuration matches the model_args before training begins.

Motivation & Context

It appears that users can utilize a Hugging Face style Custom Tokenizer by providing it as an Asset. However, I noticed an issue when using predefined Model args (such as Qwen3-8B).

While the training process itself proceeds without any immediate errors, a size mismatch issue occurs later when loading the model via transformers after HF conversion. This happens because the Custom Tokenizer's attributes do not align with the fixed model arguments used during training.

Changes

To prevent this silent failure, I implemented a check that verifies if the vocab_size and eos_id of the Tokenizer provided as an Asset match the current model_args.

If there is a mismatch, a ValueError is now raised immediately.
This allows users who intend to use a Custom Tokenizer (or those who accidentally input the wrong Tokenizer) to identify configuration issues before starting the training process.

meta-cla · 2025-11-21T08:35:47Z

Hi @CryptoSalamander!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

meta-cla · 2025-11-21T11:09:31Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

wwwjn · 2025-11-21T20:41:30Z

torchtitan/models/utils.py

+    if hasattr(model_args, "vocab_size"):
+        tokenizer_vocab_size = tokenizer.get_vocab_size()
+        model_vocab_size = model_args.vocab_size
+        if tokenizer_vocab_size != model_vocab_size:


Thanks for contribution! We also noticed this and we are actually not require the tokenzier's vocab size is the same as model's vocab size. The tokenizer's vocab size defines the input space and related to dataset, while the model's vocab size (the embedding layer size) defines the model's representation space.

In general, it' user's responsibility to use the correct tokenizer and model embedding layer dimension

Yes, the mismatch is intentional.
E.g. if you have several phases of training, pretraining -> finetuning, then in pretraining, you don't need to have the tokenizer you use for finetuning, but in modeling you need to create an embedding table large enough so it has capacity to be trained later with custom finetuning tokenizers.

So the requirement here is model.vocab_size > tokenizer.vocab_size.

Thanks for the review.

You're right, strict equality was definitely an oversight on my part.
(I was biased by my own workflow where I generate HF configs based on the tokenizer's vocab size, but i realize now that shouldn't be enforced generally here.)

However, tianyu-I mentioned, I do think it's valuable to verify that model.vocab_size > tokenizer.vocab_size.
While the training loop would eventually crash on a mismatch, it typicall y results in a vague CUDA Error, which is hard to debug. A proactive check here would provide a much more informative error message for users.

I've updated the PR to only enforce that model has sufficient capacity for the tokenizer.

tianyu-l · 2025-11-21T23:04:47Z

torchtitan/models/utils.py

+    if hasattr(model_args, "vocab_size"):
+        tokenizer_vocab_size = tokenizer.get_vocab_size()
+        model_vocab_size = model_args.vocab_size
+        if tokenizer_vocab_size != model_vocab_size:


Yes, the mismatch is intentional.
E.g. if you have several phases of training, pretraining -> finetuning, then in pretraining, you don't need to have the tokenizer you use for finetuning, but in modeling you need to create an embedding table large enough so it has capacity to be trained later with custom finetuning tokenizers.

So the requirement here is model.vocab_size > tokenizer.vocab_size.

tianyu-l · 2025-11-21T23:05:13Z

torchtitan/models/utils.py

+            )
+
+    # Validate eos_id
+    if hasattr(model_args, "eos_id"):


model_args shouldn't have the field eos_id?

I was initially concerned that the default value of eos_id in Qwen3ModelArgs and TransformerModelArgs might differ from the actual tokenizer.eos_id.

However, upon double-checking, it seems tokenizer.eos_id is effectively always used regardless of the model args. Since this validation logic seems redundant, I decided to remove it.

Quick question though: Do you happen to know the purpose of having eos_id in Qwen3ModelArgs and TransformerModelArgs in the first place?

I believe it's a mistake. It's never used. Please help remove.

Thanks for the clarification.
I have removed eos_id from both Qwen3ModelArgs and TransformerModelArgs as requested.

Validate tokenizer and model alignment before training

59dc807

CryptoSalamander requested review from fegin, tianyu-l, wconstab and wwwjn as code owners November 21, 2025 08:35

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 21, 2025

wwwjn reviewed Nov 21, 2025

View reviewed changes

tianyu-l reviewed Nov 21, 2025

View reviewed changes

CryptoSalamander added 2 commits November 22, 2025 02:01

Update tokenizer-model alignment validation logic

faa2122

Remove unused eos_id from model args

50b10b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Validate tokenizer and model alignment before training #2074

Validate tokenizer and model alignment before training #2074

Uh oh!

CryptoSalamander commented Nov 21, 2025

Uh oh!

meta-cla bot commented Nov 21, 2025

Uh oh!

meta-cla bot commented Nov 21, 2025

Uh oh!

wwwjn Nov 21, 2025

Uh oh!

tianyu-l Nov 21, 2025

Uh oh!

CryptoSalamander Nov 22, 2025

Uh oh!

tianyu-l Nov 21, 2025

Uh oh!

tianyu-l Nov 21, 2025

Uh oh!

CryptoSalamander Nov 22, 2025

Uh oh!

tianyu-l Nov 22, 2025

Uh oh!

CryptoSalamander Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Validate tokenizer and model alignment before training #2074

Are you sure you want to change the base?

Validate tokenizer and model alignment before training #2074

Uh oh!

Conversation

CryptoSalamander commented Nov 21, 2025

Summary

Motivation & Context

Changes

Uh oh!

meta-cla bot commented Nov 21, 2025

Action Required

Process

Uh oh!

meta-cla bot commented Nov 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants