fix: pass revision parameter to tokenizer and processor loaders#3388
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe PR adds support for configurable model revision resolution to both processor and tokenizer loaders. A revision parameter (defaulting to Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/axolotl/loaders/tokenizer.py (1)
30-45:⚠️ Potential issue | 🟡 MinorDocument the new
revisionparameter.The docstring Args list doesn’t include
revision, so callers won’t know how to use it.✏️ Proposed doc update
@@ - Args: - tokenizer_path: Path or name of the original tokenizer - token_mappings: Dict mapping {token_id (int): new_token_string} - output_dir: Directory to save the modified tokenizer + Args: + tokenizer_path: Path or name of the original tokenizer + token_mappings: Dict mapping {token_id (int): new_token_string} + output_dir: Directory to save the modified tokenizer + revision: Model revision/branch/tag/commit to load from (HF Hub)
🤖 Fix all issues with AI agents
In `@src/axolotl/loaders/processor.py`:
- Around line 51-58: The current code always sets revision="main" and also has
VoxtralProcessor's early return ignoring cfg.revision_of_model; change both to
only pass revision into processor_cls.from_pretrained when cfg.revision_of_model
is truthy (i.e., compute a revision variable or build kwargs for from_pretrained
that include revision only if cfg.revision_of_model is set) and update the
VoxtralProcessor early-return path to honor cfg.revision_of_model the same way
so both tokenizer/model loading and VoxtralProcessor creation consistently use
conditional revision passing; refer to processor_cls.from_pretrained,
cfg.revision_of_model, cfg.processor_config, cfg.trust_remote_code, tokenizer,
and VoxtralProcessor to locate the affected code.
| revision = cfg.revision_of_model or "main" | ||
|
|
||
| processor = processor_cls.from_pretrained( | ||
| cfg.processor_config, | ||
| trust_remote_code=cfg.trust_remote_code or False, | ||
| tokenizer=tokenizer, | ||
| revision=revision, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for revision_of_model in the codebase to understand its default behavior
rg -n -C5 'revision_of_model' --type py | head -100Repository: axolotl-ai-cloud/axolotl
Length of output: 6823
🏁 Script executed:
#!/bin/bash
# Read the processor.py file to understand the full context
head -100 src/axolotl/loaders/processor.py | cat -nRepository: axolotl-ai-cloud/axolotl
Length of output: 3264
🌐 Web query:
What is the default behavior of AutoProcessor.from_pretrained in Hugging Face Transformers when revision parameter is omitted? Does it use the repository's default branch?
💡 Result:
In Transformers, AutoProcessor.from_pretrained(..., revision=...) defaults to revision="main" when you omit the parameter (i.e., it explicitly targets the "main" branch unless you override it). [1]
Under the hood this relies on huggingface_hub downloading logic, whose revision default is also the head of the "main" branch. [2]
So: it does not mean “use the repository’s default branch” in general—it means use "main" by default (which only coincides with the repo’s default branch if the default branch is "main"). [1][2]
Sources: [1] [2]
Use conditional revision passing to match codebase pattern; add revision support to VoxtralProcessor.
The code forces revision="main" unconditionally (line 51), but throughout the codebase (tokenizer.py, model.py, utils.py), the established pattern is to pass revision only when cfg.revision_of_model is set, letting HF Hub's default handle the rest. Additionally, the VoxtralProcessor early return (lines 40–43) completely ignores revision_of_model, creating an inconsistency.
🛠️ Proposed fix
if processor_cls == VoxtralProcessor:
+ kwargs = {}
+ if cfg.revision_of_model:
+ kwargs["revision"] = cfg.revision_of_model
return VoxtralProcessor.from_pretrained(
cfg.processor_config,
+ **kwargs
)
- revision = cfg.revision_of_model or "main"
-
+ processor_kwargs = {
+ "trust_remote_code": cfg.trust_remote_code or False,
+ "tokenizer": tokenizer,
+ }
+ if cfg.revision_of_model:
+ processor_kwargs["revision"] = cfg.revision_of_model
processor = processor_cls.from_pretrained(
cfg.processor_config,
- trust_remote_code=cfg.trust_remote_code or False,
- tokenizer=tokenizer,
- revision=revision,
+ **processor_kwargs,
)🤖 Prompt for AI Agents
In `@src/axolotl/loaders/processor.py` around lines 51 - 58, The current code
always sets revision="main" and also has VoxtralProcessor's early return
ignoring cfg.revision_of_model; change both to only pass revision into
processor_cls.from_pretrained when cfg.revision_of_model is truthy (i.e.,
compute a revision variable or build kwargs for from_pretrained that include
revision only if cfg.revision_of_model is set) and update the VoxtralProcessor
early-return path to honor cfg.revision_of_model the same way so both
tokenizer/model loading and VoxtralProcessor creation consistently use
conditional revision passing; refer to processor_cls.from_pretrained,
cfg.revision_of_model, cfg.processor_config, cfg.trust_remote_code, tokenizer,
and VoxtralProcessor to locate the affected code.
c8f676f to
f9e4362
Compare
| kwargs = {} | ||
| if cfg.revision_of_model: | ||
| kwargs["revision"] = cfg.revision_of_model | ||
| return VoxtralProcessor.from_pretrained( | ||
| cfg.processor_config, | ||
| **kwargs, | ||
| ) |
There was a problem hiding this comment.
We can probably init kwargs earlier to refactor the below as well
There was a problem hiding this comment.
Thanks, I made both updates. Please let me know if there are more changes required.
| revision = cfg.revision_of_model or "main" | ||
| tokenizer = HFMistralTokenizer.from_pretrained(cfg.tokenizer_config, revision=revision) |
There was a problem hiding this comment.
It may be better to pass in if not None.
f9e4362 to
49e3acf
Compare
|
Can you add the test I made here NanoCode012@220118e I can't push to your branch from local / create a PR for some reason |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
Thanks for the commit and the test. I've incorporated both your formatting fix (e68344c) and added the test file from your commit. all 6 tests pass locally |
- Reformat modify_tokenizer_files signature and from_pretrained call - Use kwargs pattern for modify_tokenizer_files call to avoid passing None revision - Add 6 unit tests for revision parameter in tokenizer/processor loaders
1946cc0 to
447b193
Compare
Summary
Fixes the
revision_of_modelconfig parameter to be properly passed to tokenizer and processor loaders, allowing users to load models from specific revisions/branches.Problem
When using
revision_of_modelin the config, the revision was not being passed to:load_tokenizer()- tokenizer loadingload_processor()- processor loading for multimodal modelsmodify_tokenizer_files()- when usingadded_tokens_overridesThis caused the tokenizer/processor to always load from the default branch even when a specific revision was specified.
Changes
src/axolotl/loaders/tokenizer.py:revisiontoAutoTokenizer.from_pretrained()whencfg.revision_of_modelis setrevisiontoHFMistralTokenizer.from_pretrained()for Mistral modelsrevisionparameter tomodify_tokenizer_files()functionsrc/axolotl/loaders/processor.py:revisiontoprocessor_cls.from_pretrained()when loading processorsSummary by CodeRabbit