Skip to content

fix: pass revision parameter to tokenizer and processor loaders#3388

Merged
NanoCode012 merged 3 commits into
axolotl-ai-cloud:mainfrom
edgerunner-ai:fix/model-revision-support
Feb 25, 2026
Merged

fix: pass revision parameter to tokenizer and processor loaders#3388
NanoCode012 merged 3 commits into
axolotl-ai-cloud:mainfrom
edgerunner-ai:fix/model-revision-support

Conversation

@madScientist10

@madScientist10 madScientist10 commented Feb 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes the revision_of_model config parameter to be properly passed to tokenizer and processor loaders, allowing users to load models from specific revisions/branches.

Problem

When using revision_of_model in the config, the revision was not being passed to:

  • load_tokenizer() - tokenizer loading
  • load_processor() - processor loading for multimodal models
  • modify_tokenizer_files() - when using added_tokens_overrides

This caused the tokenizer/processor to always load from the default branch even when a specific revision was specified.

Changes

  • src/axolotl/loaders/tokenizer.py:

    • Pass revision to AutoTokenizer.from_pretrained() when cfg.revision_of_model is set
    • Pass revision to HFMistralTokenizer.from_pretrained() for Mistral models
    • Add revision parameter to modify_tokenizer_files() function
  • src/axolotl/loaders/processor.py:

    • Pass revision to processor_cls.from_pretrained() when loading processors

Summary by CodeRabbit

  • New Features
    • Added configurable model revision support for processor and tokenizer components. Users can now specify which model revision is loaded during system initialization. Defaults to "main" for backward compatibility, enabling flexible version management and improved control over model dependencies across the loading pipeline.

@coderabbitai

coderabbitai Bot commented Feb 3, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR adds support for configurable model revision resolution to both processor and tokenizer loaders. A revision parameter (defaulting to cfg.revision_of_model or "main") is threaded through the loader functions, enabling users to specify which model revision gets loaded from Hugging Face Hub.

Changes

Cohort / File(s) Summary
Model Revision Support
src/axolotl/loaders/processor.py, src/axolotl/loaders/tokenizer.py
Added revision parameter to processor and tokenizer loaders with default fallback to "main". The revision is resolved from cfg.revision_of_model and propagated to AutoProcessor.from_pretrained and AutoTokenizer.from_pretrained calls. Method signature updated for modify_tokenizer_files to include revision parameter with default value.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested labels

ready to merge

Suggested reviewers

  • winglian
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding revision parameter support to tokenizer and processor loaders to enable loading from specific model revisions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/axolotl/loaders/tokenizer.py (1)

30-45: ⚠️ Potential issue | 🟡 Minor

Document the new revision parameter.

The docstring Args list doesn’t include revision, so callers won’t know how to use it.

✏️ Proposed doc update
@@
-    Args:
-        tokenizer_path: Path or name of the original tokenizer
-        token_mappings: Dict mapping {token_id (int): new_token_string}
-        output_dir: Directory to save the modified tokenizer
+    Args:
+        tokenizer_path: Path or name of the original tokenizer
+        token_mappings: Dict mapping {token_id (int): new_token_string}
+        output_dir: Directory to save the modified tokenizer
+        revision: Model revision/branch/tag/commit to load from (HF Hub)
🤖 Fix all issues with AI agents
In `@src/axolotl/loaders/processor.py`:
- Around line 51-58: The current code always sets revision="main" and also has
VoxtralProcessor's early return ignoring cfg.revision_of_model; change both to
only pass revision into processor_cls.from_pretrained when cfg.revision_of_model
is truthy (i.e., compute a revision variable or build kwargs for from_pretrained
that include revision only if cfg.revision_of_model is set) and update the
VoxtralProcessor early-return path to honor cfg.revision_of_model the same way
so both tokenizer/model loading and VoxtralProcessor creation consistently use
conditional revision passing; refer to processor_cls.from_pretrained,
cfg.revision_of_model, cfg.processor_config, cfg.trust_remote_code, tokenizer,
and VoxtralProcessor to locate the affected code.

Comment thread src/axolotl/loaders/processor.py Outdated
Comment on lines 51 to 58
revision = cfg.revision_of_model or "main"

processor = processor_cls.from_pretrained(
cfg.processor_config,
trust_remote_code=cfg.trust_remote_code or False,
tokenizer=tokenizer,
revision=revision,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for revision_of_model in the codebase to understand its default behavior
rg -n -C5 'revision_of_model' --type py | head -100

Repository: axolotl-ai-cloud/axolotl

Length of output: 6823


🏁 Script executed:

#!/bin/bash
# Read the processor.py file to understand the full context
head -100 src/axolotl/loaders/processor.py | cat -n

Repository: axolotl-ai-cloud/axolotl

Length of output: 3264


🌐 Web query:

What is the default behavior of AutoProcessor.from_pretrained in Hugging Face Transformers when revision parameter is omitted? Does it use the repository's default branch?

💡 Result:

In Transformers, AutoProcessor.from_pretrained(..., revision=...) defaults to revision="main" when you omit the parameter (i.e., it explicitly targets the "main" branch unless you override it). [1]

Under the hood this relies on huggingface_hub downloading logic, whose revision default is also the head of the "main" branch. [2]

So: it does not mean “use the repository’s default branch” in general—it means use "main" by default (which only coincides with the repo’s default branch if the default branch is "main"). [1][2]

Sources: [1] [2]


Use conditional revision passing to match codebase pattern; add revision support to VoxtralProcessor.

The code forces revision="main" unconditionally (line 51), but throughout the codebase (tokenizer.py, model.py, utils.py), the established pattern is to pass revision only when cfg.revision_of_model is set, letting HF Hub's default handle the rest. Additionally, the VoxtralProcessor early return (lines 40–43) completely ignores revision_of_model, creating an inconsistency.

🛠️ Proposed fix
        if processor_cls == VoxtralProcessor:
+           kwargs = {}
+           if cfg.revision_of_model:
+               kwargs["revision"] = cfg.revision_of_model
            return VoxtralProcessor.from_pretrained(
                cfg.processor_config,
+               **kwargs
            )

-    revision = cfg.revision_of_model or "main"
-
+    processor_kwargs = {
+        "trust_remote_code": cfg.trust_remote_code or False,
+        "tokenizer": tokenizer,
+    }
+    if cfg.revision_of_model:
+        processor_kwargs["revision"] = cfg.revision_of_model
     processor = processor_cls.from_pretrained(
         cfg.processor_config,
-        trust_remote_code=cfg.trust_remote_code or False,
-        tokenizer=tokenizer,
-        revision=revision,
+        **processor_kwargs,
     )
🤖 Prompt for AI Agents
In `@src/axolotl/loaders/processor.py` around lines 51 - 58, The current code
always sets revision="main" and also has VoxtralProcessor's early return
ignoring cfg.revision_of_model; change both to only pass revision into
processor_cls.from_pretrained when cfg.revision_of_model is truthy (i.e.,
compute a revision variable or build kwargs for from_pretrained that include
revision only if cfg.revision_of_model is set) and update the VoxtralProcessor
early-return path to honor cfg.revision_of_model the same way so both
tokenizer/model loading and VoxtralProcessor creation consistently use
conditional revision passing; refer to processor_cls.from_pretrained,
cfg.revision_of_model, cfg.processor_config, cfg.trust_remote_code, tokenizer,
and VoxtralProcessor to locate the affected code.

@madScientist10 madScientist10 force-pushed the fix/model-revision-support branch 2 times, most recently from c8f676f to f9e4362 Compare February 3, 2026 15:37
Comment thread src/axolotl/loaders/processor.py Outdated
Comment on lines 41 to 47
kwargs = {}
if cfg.revision_of_model:
kwargs["revision"] = cfg.revision_of_model
return VoxtralProcessor.from_pretrained(
cfg.processor_config,
**kwargs,
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably init kwargs earlier to refactor the below as well

@madScientist10 madScientist10 Feb 5, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I made both updates. Please let me know if there are more changes required.

Comment thread src/axolotl/loaders/tokenizer.py Outdated
Comment on lines +138 to +139
revision = cfg.revision_of_model or "main"
tokenizer = HFMistralTokenizer.from_pretrained(cfg.tokenizer_config, revision=revision)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be better to pass in if not None.

@madScientist10 madScientist10 force-pushed the fix/model-revision-support branch from f9e4362 to 49e3acf Compare February 5, 2026 13:55
@NanoCode012

Copy link
Copy Markdown
Collaborator

Can you add the test I made here NanoCode012@220118e

I can't push to your branch from local / create a PR for some reason

@codecov

codecov Bot commented Feb 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 66.66667% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/axolotl/loaders/tokenizer.py 50.00% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

@madScientist10

Copy link
Copy Markdown
Contributor Author

Thanks for the commit and the test. I've incorporated both your formatting fix (e68344c) and added the test file from your commit. all 6 tests pass locally

madScientist10 and others added 3 commits February 15, 2026 21:06
- Reformat modify_tokenizer_files signature and from_pretrained call
- Use kwargs pattern for modify_tokenizer_files call to avoid passing None revision
- Add 6 unit tests for revision parameter in tokenizer/processor loaders
@madScientist10 madScientist10 force-pushed the fix/model-revision-support branch from 1946cc0 to 447b193 Compare February 15, 2026 21:07
@NanoCode012 NanoCode012 merged commit 8f54b4e into axolotl-ai-cloud:main Feb 25, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants