Skip to content

Conversation

@CryptoSalamander
Copy link

First of all, thank you for creating this amazing project. I discovered a potential improvement regarding tokenizer usage while working with the codebase and have implemented the changes below.

Summary

I have added a validation step to ensure the provided Tokenizer's configuration matches the model_args before training begins.

Motivation & Context

It appears that users can utilize a Hugging Face style Custom Tokenizer by providing it as an Asset. However, I noticed an issue when using predefined Model args (such as Qwen3-8B).

While the training process itself proceeds without any immediate errors, a size mismatch issue occurs later when loading the model via transformers after HF conversion. This happens because the Custom Tokenizer's attributes do not align with the fixed model arguments used during training.

Changes

To prevent this silent failure, I implemented a check that verifies if the vocab_size and eos_id of the Tokenizer provided as an Asset match the current model_args.

  • If there is a mismatch, a ValueError is now raised immediately.
  • This allows users who intend to use a Custom Tokenizer (or those who accidentally input the wrong Tokenizer) to identify configuration issues before starting the training process.

@meta-cla
Copy link

meta-cla bot commented Nov 21, 2025

Hi @CryptoSalamander!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@meta-cla
Copy link

meta-cla bot commented Nov 21, 2025

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 21, 2025
if hasattr(model_args, "vocab_size"):
tokenizer_vocab_size = tokenizer.get_vocab_size()
model_vocab_size = model_args.vocab_size
if tokenizer_vocab_size != model_vocab_size:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contribution! We also noticed this and we are actually not require the tokenzier's vocab size is the same as model's vocab size. The tokenizer's vocab size defines the input space and related to dataset, while the model's vocab size (the embedding layer size) defines the model's representation space.

In general, it' user's responsibility to use the correct tokenizer and model embedding layer dimension

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the mismatch is intentional.
E.g. if you have several phases of training, pretraining -> finetuning, then in pretraining, you don't need to have the tokenizer you use for finetuning, but in modeling you need to create an embedding table large enough so it has capacity to be trained later with custom finetuning tokenizers.

So the requirement here is model.vocab_size > tokenizer.vocab_size.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review.

You're right, strict equality was definitely an oversight on my part.
(I was biased by my own workflow where I generate HF configs based on the tokenizer's vocab size, but i realize now that shouldn't be enforced generally here.)

However, tianyu-I mentioned, I do think it's valuable to verify that model.vocab_size > tokenizer.vocab_size.
While the training loop would eventually crash on a mismatch, it typicall y results in a vague CUDA Error, which is hard to debug. A proactive check here would provide a much more informative error message for users.

I've updated the PR to only enforce that model has sufficient capacity for the tokenizer.

if hasattr(model_args, "vocab_size"):
tokenizer_vocab_size = tokenizer.get_vocab_size()
model_vocab_size = model_args.vocab_size
if tokenizer_vocab_size != model_vocab_size:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the mismatch is intentional.
E.g. if you have several phases of training, pretraining -> finetuning, then in pretraining, you don't need to have the tokenizer you use for finetuning, but in modeling you need to create an embedding table large enough so it has capacity to be trained later with custom finetuning tokenizers.

So the requirement here is model.vocab_size > tokenizer.vocab_size.

)

# Validate eos_id
if hasattr(model_args, "eos_id"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_args shouldn't have the field eos_id?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially concerned that the default value of eos_id in Qwen3ModelArgs and TransformerModelArgs might differ from the actual tokenizer.eos_id.

However, upon double-checking, it seems tokenizer.eos_id is effectively always used regardless of the model args. Since this validation logic seems redundant, I decided to remove it.

Quick question though: Do you happen to know the purpose of having eos_id in Qwen3ModelArgs and TransformerModelArgs in the first place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's a mistake. It's never used. Please help remove.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification.
I have removed eos_id from both Qwen3ModelArgs and TransformerModelArgs as requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants