Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions torchtitan/models/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -468,3 +468,45 @@ def get_moe_model_nparams_and_flops(
nparams = nparams - nparams_embedding

return nparams, num_flops_per_token


def validate_tokenizer_model_alignment(
tokenizer: "BaseTokenizer | None",
model_args: "BaseModelArgs",
) -> None:
"""
Validate that tokenizer configuration matches model configuration.

Args:
tokenizer: Tokenizer instance to validate. Can be None.
model_args: Model arguments object containing configuration to validate against.

Raises:
ValueError: If tokenizer and model configurations don't match.
"""
if tokenizer is None:
return

# Validate vocab_size
if hasattr(model_args, "vocab_size"):
tokenizer_vocab_size = tokenizer.get_vocab_size()
model_vocab_size = model_args.vocab_size
if tokenizer_vocab_size != model_vocab_size:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contribution! We also noticed this and we are actually not require the tokenzier's vocab size is the same as model's vocab size. The tokenizer's vocab size defines the input space and related to dataset, while the model's vocab size (the embedding layer size) defines the model's representation space.

In general, it' user's responsibility to use the correct tokenizer and model embedding layer dimension

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the mismatch is intentional.
E.g. if you have several phases of training, pretraining -> finetuning, then in pretraining, you don't need to have the tokenizer you use for finetuning, but in modeling you need to create an embedding table large enough so it has capacity to be trained later with custom finetuning tokenizers.

So the requirement here is model.vocab_size > tokenizer.vocab_size.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review.

You're right, strict equality was definitely an oversight on my part.
(I was biased by my own workflow where I generate HF configs based on the tokenizer's vocab size, but i realize now that shouldn't be enforced generally here.)

However, tianyu-I mentioned, I do think it's valuable to verify that model.vocab_size > tokenizer.vocab_size.
While the training loop would eventually crash on a mismatch, it typicall y results in a vague CUDA Error, which is hard to debug. A proactive check here would provide a much more informative error message for users.

I've updated the PR to only enforce that model has sufficient capacity for the tokenizer.

raise ValueError(
f"Tokenizer vocab_size ({tokenizer_vocab_size}) does not match "
f"model vocab_size ({model_vocab_size}). "
f"This mismatch will cause training errors. "
f"Please ensure the tokenizer and model configuration are aligned."
)

# Validate eos_id
if hasattr(model_args, "eos_id"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_args shouldn't have the field eos_id?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially concerned that the default value of eos_id in Qwen3ModelArgs and TransformerModelArgs might differ from the actual tokenizer.eos_id.

However, upon double-checking, it seems tokenizer.eos_id is effectively always used regardless of the model args. Since this validation logic seems redundant, I decided to remove it.

Quick question though: Do you happen to know the purpose of having eos_id in Qwen3ModelArgs and TransformerModelArgs in the first place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's a mistake. It's never used. Please help remove.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification.
I have removed eos_id from both Qwen3ModelArgs and TransformerModelArgs as requested.

tokenizer_eos_id = getattr(tokenizer, "eos_id", None)
model_eos_id = model_args.eos_id
if tokenizer_eos_id is not None and tokenizer_eos_id != model_eos_id:
raise ValueError(
f"Tokenizer eos_id ({tokenizer_eos_id}) does not match "
f"model eos_id ({model_eos_id}). "
f"This mismatch may cause training errors. "
f"Please ensure the tokenizer and model configuration are aligned."
)
3 changes: 3 additions & 0 deletions torchtitan/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
)
from torchtitan.config import ConfigManager, JobConfig, TORCH_DTYPE_MAP
from torchtitan.distributed import ParallelDims, utils as dist_utils
from torchtitan.models.utils import validate_tokenizer_model_alignment
from torchtitan.protocols.model_converter import build_model_converters
from torchtitan.tools import utils
from torchtitan.tools.logging import init_logger, logger
Expand Down Expand Up @@ -134,6 +135,8 @@ def __init__(self, job_config: JobConfig):
model_args.update_from_config(job_config)
self.model_args = model_args

validate_tokenizer_model_alignment(self.tokenizer, model_args)

logger.info(
f"Building {job_config.model.name} {job_config.model.flavor} with {model_args}"
)
Expand Down