fix: pass revision parameter to tokenizer and processor loaders by madScientist10 · Pull Request #3388 · axolotl-ai-cloud/axolotl

madScientist10 · 2026-02-03T14:44:11Z

Summary

Fixes the revision_of_model config parameter to be properly passed to tokenizer and processor loaders, allowing users to load models from specific revisions/branches.

Problem

When using revision_of_model in the config, the revision was not being passed to:

load_tokenizer() - tokenizer loading
load_processor() - processor loading for multimodal models
modify_tokenizer_files() - when using added_tokens_overrides

This caused the tokenizer/processor to always load from the default branch even when a specific revision was specified.

Changes

src/axolotl/loaders/tokenizer.py:
- Pass revision to AutoTokenizer.from_pretrained() when cfg.revision_of_model is set
- Pass revision to HFMistralTokenizer.from_pretrained() for Mistral models
- Add revision parameter to modify_tokenizer_files() function
src/axolotl/loaders/processor.py:
- Pass revision to processor_cls.from_pretrained() when loading processors

Summary by CodeRabbit

New Features
- Added configurable model revision support for processor and tokenizer components. Users can now specify which model revision is loaded during system initialization. Defaults to "main" for backward compatibility, enabling flexible version management and improved control over model dependencies across the loading pipeline.

coderabbitai · 2026-02-03T14:44:30Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

The PR adds support for configurable model revision resolution to both processor and tokenizer loaders. A revision parameter (defaulting to cfg.revision_of_model or "main") is threaded through the loader functions, enabling users to specify which model revision gets loaded from Hugging Face Hub.

Changes

Cohort / File(s)	Summary
Model Revision Support `src/axolotl/loaders/processor.py`, `src/axolotl/loaders/tokenizer.py`	Added revision parameter to processor and tokenizer loaders with default fallback to "main". The revision is resolved from `cfg.revision_of_model` and propagated to `AutoProcessor.from_pretrained` and `AutoTokenizer.from_pretrained` calls. Method signature updated for `modify_tokenizer_files` to include `revision` parameter with default value.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested labels

ready to merge

Suggested reviewers

winglian

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding revision parameter support to tokenizer and processor loaders to enable loading from specific model revisions.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/axolotl/loaders/tokenizer.py (1)

30-45: ⚠️ Potential issue | 🟡 Minor

Document the new revision parameter.

The docstring Args list doesn’t include revision, so callers won’t know how to use it.

✏️ Proposed doc update

@@
-    Args:
-        tokenizer_path: Path or name of the original tokenizer
-        token_mappings: Dict mapping {token_id (int): new_token_string}
-        output_dir: Directory to save the modified tokenizer
+    Args:
+        tokenizer_path: Path or name of the original tokenizer
+        token_mappings: Dict mapping {token_id (int): new_token_string}
+        output_dir: Directory to save the modified tokenizer
+        revision: Model revision/branch/tag/commit to load from (HF Hub)

🤖 Fix all issues with AI agents

In `@src/axolotl/loaders/processor.py`:
- Around line 51-58: The current code always sets revision="main" and also has
VoxtralProcessor's early return ignoring cfg.revision_of_model; change both to
only pass revision into processor_cls.from_pretrained when cfg.revision_of_model
is truthy (i.e., compute a revision variable or build kwargs for from_pretrained
that include revision only if cfg.revision_of_model is set) and update the
VoxtralProcessor early-return path to honor cfg.revision_of_model the same way
so both tokenizer/model loading and VoxtralProcessor creation consistently use
conditional revision passing; refer to processor_cls.from_pretrained,
cfg.revision_of_model, cfg.processor_config, cfg.trust_remote_code, tokenizer,
and VoxtralProcessor to locate the affected code.

coderabbitai · 2026-02-03T14:54:36Z

+    revision = cfg.revision_of_model or "main"
+
    processor = processor_cls.from_pretrained(
        cfg.processor_config,
        trust_remote_code=cfg.trust_remote_code or False,
        tokenizer=tokenizer,
+        revision=revision,
    )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for revision_of_model in the codebase to understand its default behavior rg -n -C5 'revision_of_model' --type py | head -100

Repository: axolotl-ai-cloud/axolotl

Length of output: 6823

🏁 Script executed:

#!/bin/bash # Read the processor.py file to understand the full context head -100 src/axolotl/loaders/processor.py | cat -n

Repository: axolotl-ai-cloud/axolotl

Length of output: 3264

🌐 Web query:

What is the default behavior of AutoProcessor.from_pretrained in Hugging Face Transformers when revision parameter is omitted? Does it use the repository's default branch?

💡 Result:

In Transformers, AutoProcessor.from_pretrained(..., revision=...) defaults to revision="main" when you omit the parameter (i.e., it explicitly targets the "main" branch unless you override it). [1]

Under the hood this relies on huggingface_hub downloading logic, whose revision default is also the head of the "main" branch. [2]

So: it does not mean “use the repository’s default branch” in general—it means use "main" by default (which only coincides with the repo’s default branch if the default branch is "main"). [1][2]

Sources: [1] [2]

Use conditional revision passing to match codebase pattern; add revision support to VoxtralProcessor.

The code forces revision="main" unconditionally (line 51), but throughout the codebase (tokenizer.py, model.py, utils.py), the established pattern is to pass revision only when cfg.revision_of_model is set, letting HF Hub's default handle the rest. Additionally, the VoxtralProcessor early return (lines 40–43) completely ignores revision_of_model, creating an inconsistency.

🛠️ Proposed fix

if processor_cls == VoxtralProcessor: + kwargs = {} + if cfg.revision_of_model: + kwargs["revision"] = cfg.revision_of_model return VoxtralProcessor.from_pretrained( cfg.processor_config, + **kwargs ) - revision = cfg.revision_of_model or "main" - + processor_kwargs = { + "trust_remote_code": cfg.trust_remote_code or False, + "tokenizer": tokenizer, + } + if cfg.revision_of_model: + processor_kwargs["revision"] = cfg.revision_of_model processor = processor_cls.from_pretrained( cfg.processor_config, - trust_remote_code=cfg.trust_remote_code or False, - tokenizer=tokenizer, - revision=revision, + **processor_kwargs, )

🤖 Prompt for AI Agents

In `@src/axolotl/loaders/processor.py` around lines 51 - 58, The current code always sets revision="main" and also has VoxtralProcessor's early return ignoring cfg.revision_of_model; change both to only pass revision into processor_cls.from_pretrained when cfg.revision_of_model is truthy (i.e., compute a revision variable or build kwargs for from_pretrained that include revision only if cfg.revision_of_model is set) and update the VoxtralProcessor early-return path to honor cfg.revision_of_model the same way so both tokenizer/model loading and VoxtralProcessor creation consistently use conditional revision passing; refer to processor_cls.from_pretrained, cfg.revision_of_model, cfg.processor_config, cfg.trust_remote_code, tokenizer, and VoxtralProcessor to locate the affected code.

NanoCode012 · 2026-02-05T06:24:47Z

+            kwargs = {}
+            if cfg.revision_of_model:
+                kwargs["revision"] = cfg.revision_of_model
            return VoxtralProcessor.from_pretrained(
                cfg.processor_config,
+                **kwargs,
            )


We can probably init kwargs earlier to refactor the below as well

Thanks, I made both updates. Please let me know if there are more changes required.

NanoCode012 · 2026-02-05T06:25:42Z

+        revision = cfg.revision_of_model or "main"
+        tokenizer = HFMistralTokenizer.from_pretrained(cfg.tokenizer_config, revision=revision)


It may be better to pass in if not None.

NanoCode012 · 2026-02-13T18:15:46Z

Can you add the test I made here NanoCode012@220118e

I can't push to your branch from local / create a PR for some reason

codecov · 2026-02-13T18:24:42Z

Codecov Report

❌ Patch coverage is 66.66667% with 5 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/loaders/tokenizer.py	50.00%	5 Missing ⚠️

📢 Thoughts on this report? Let us know!

madScientist10 · 2026-02-15T20:44:05Z

Thanks for the commit and the test. I've incorporated both your formatting fix (e68344c) and added the test file from your commit. all 6 tests pass locally

- Reformat modify_tokenizer_files signature and from_pretrained call - Use kwargs pattern for modify_tokenizer_files call to avoid passing None revision - Add 6 unit tests for revision parameter in tokenizer/processor loaders

coderabbitai Bot reviewed Feb 3, 2026

View reviewed changes

madScientist10 force-pushed the fix/model-revision-support branch 2 times, most recently from c8f676f to f9e4362 Compare February 3, 2026 15:37

NanoCode012 reviewed Feb 5, 2026

View reviewed changes

madScientist10 force-pushed the fix/model-revision-support branch from f9e4362 to 49e3acf Compare February 5, 2026 13:55

NanoCode012 added the under review label Feb 13, 2026

madScientist10 and others added 3 commits February 15, 2026 21:06

fix: pass revision parameter to tokenizer and processor loaders

ae6ebfc

fix: address revision=None passed to .from_pretrained

27ac1c6

add tests and address review feedback for revision parameter

447b193

- Reformat modify_tokenizer_files signature and from_pretrained call - Use kwargs pattern for modify_tokenizer_files call to avoid passing None revision - Add 6 unit tests for revision parameter in tokenizer/processor loaders

madScientist10 force-pushed the fix/model-revision-support branch from 1946cc0 to 447b193 Compare February 15, 2026 21:07

NanoCode012 approved these changes Feb 16, 2026

View reviewed changes

NanoCode012 added ready to merge and removed under review labels Feb 16, 2026

NanoCode012 merged commit 8f54b4e into axolotl-ai-cloud:main Feb 25, 2026
18 checks passed

winglian removed the ready to merge label Mar 22, 2026

coderabbitai Bot mentioned this pull request Apr 19, 2026

feat: add processor_kwargs YAML field forwarded to from_pretrained #3612

Merged

5 tasks

		revision = cfg.revision_of_model or "main"
		tokenizer = HFMistralTokenizer.from_pretrained(cfg.tokenizer_config, revision=revision)

Uh oh!

Conversation

madScientist10 commented Feb 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Changes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

NanoCode012 Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

madScientist10 Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NanoCode012 Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

NanoCode012 commented Feb 13, 2026

Uh oh!

codecov Bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

madScientist10 commented Feb 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

madScientist10 commented Feb 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 3, 2026 •

edited

Loading

madScientist10 Feb 5, 2026 •

edited

Loading

codecov Bot commented Feb 13, 2026 •

edited

Loading