feat: add processor_kwargs YAML field forwarded to from_pretrained by thad0ctor · Pull Request #3612 · axolotl-ai-cloud/axolotl

thad0ctor · 2026-04-19T20:32:08Z

Description

Adds a new processor_kwargs: dict[str, Any] | None field on ModelInputConfig. Its contents are forwarded as **kwargs to the processor class's from_pretrained() call inside load_processor, giving users a YAML hook for overriding fields that live in a model's on-disk processor_config.json without having to maintain a local directory that shadows the real weights just to swap one file or update the file directly.

Scope: applies to any model whose load path goes through load_processor (i.e. cfg.processor_type is set). HF's ProcessorMixin.from_pretrained propagates top-level kwargs to sub-processors (image / video / feature-extractor / tokenizer), so the same field works across image, video, and audio processors with no model-specific wiring.

Changes:

src/axolotl/utils/schemas/model.py: new processor_kwargs field on ModelInputConfig, placed next to processor_type and mirroring the existing model_quantization_config / model_quantization_config_kwargs pairing. Docstring notes (a) the name overlap with transformers' own call-time processor_kwargs argument and (b) that revision / trust_remote_code should stay on the top-level keys to avoid inconsistent precedence across loader branches.
src/axolotl/loaders/processor.py: forwards cfg.processor_kwargs into the shared kwargs dict before the Voxtral and main from_pretrained() calls. The Mistral3Processor path does not call from_pretrained and is intentionally untouched.
tests/test_revision_parameter.py: two new tests covering the forward path and the no-op default.

Example (Gemma-4):

processor_kwargs:
  image_seq_length: 1120
  max_soft_tokens: 1120

Default behavior is preserved when the field is unset (None): the guard in load_processor skips the update and nothing new reaches from_pretrained.

Motivation and Context

There is currently no YAML hook for per-processor-field overrides. Users who want to change defaults like Gemma-4's image_seq_length / max_soft_tokens or Qwen-VL's min_pixels / max_pixels have to maintain a local model directory that symlinks every real file except processor_config.json, which they override on disk or edit the model files directly.

How has this been tested?

Unit tests (mocked AutoProcessor, added in tests/test_revision_parameter.py):
- test_load_processor_forwards_processor_kwargs asserts image_seq_length and max_soft_tokens reach AutoProcessor.from_pretrained call kwargs.
- test_load_processor_omits_processor_kwargs_when_unset asserts no-op default.
- All 8 tests in the file pass (6 pre-existing + 2 new).
Schema round-trip against the real Gemma-4 YAML config: AxolotlInputConfig(**yaml.safe_load(...)) with processor_kwargs populated preserves the value intact through Pydantic validation.

End-to-end image tokenization against google/gemma-4-31B-it: same 896×896 test image fed through two processors loaded via load_processor.

metric	baseline	with override `image_seq_length=1120, max_soft_tokens=1120`	ratio
`image_seq_length` attr	280	1120	4×
`pixel_values` patches	2520	10080	4×
`image_position_ids` entries	2520	10080	4×

The patch-count scaling exactly matches the configured token budget, confirming the override reaches the live image processor's encode path.

Environment: Python 3.12, torch 2.9 / transformers 5.5.4, tested on CPU (no GPU-specific paths touched).

AI Usage Disclaimer

Yes — Claude (Anthropic) was used to assist with tracing the config-loading plumbing, drafting the schema field and forwarder, and authoring the tests.

Screenshots (if appropriate)

N/A.

Types of changes

New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactor / code cleanup

Social Handles (Optional)

N/A.

Summary by CodeRabbit

Release Notes

New Features
- Added a new configuration option to customize processor initialization, allowing you to override default processor settings at load time.
- Processor loading now properly forwards and applies custom configuration values for enhanced control over processor behavior.

Add a new `processor_kwargs` field on ModelInputConfig, forwarded as kwargs to the processor class's from_pretrained() call in load_processor. This gives users a clean YAML hook for overriding fields that live in a model's on-disk processor_config.json (e.g. image_seq_length / max_soft_tokens for Gemma-4, min_pixels / max_pixels for Qwen-VL, or any kwargs a speech/audio processor like Whisper accepts) without maintaining a local model directory that shadows the real weights solely to swap processor_config.json. Scope: applies to any model whose load path goes through load_processor (i.e. cfg.processor_type is set). HF propagates top-level kwargs to sub-processors (image_processor, video_processor, feature_extractor, tokenizer) via the standard ProcessorMixin.from_pretrained mechanism, so the same field works across image, video, and audio processors with no model-specific wiring. Changes: - src/axolotl/utils/schemas/model.py: new `processor_kwargs: dict[str, Any] | None` field on ModelInputConfig, placed next to `processor_type` and mirroring the existing `model_quantization_config` / `model_quantization_config_kwargs` pairing. Docstring notes (a) the name overlap with transformers' own call-time `processor_kwargs` argument and (b) that `revision` / inconsistent precedence across loader branches. - src/axolotl/loaders/processor.py: forward cfg.processor_kwargs into the shared kwargs dict before the Voxtral and main from_pretrained() calls. The Mistral3Processor path does not call from_pretrained and is intentionally untouched. - tests/test_revision_parameter.py: two new tests covering the forward path and the no-op default. Example (gemma 4): processor_kwargs: image_seq_length: 1120 max_soft_tokens: 1120

coderabbitai · 2026-04-19T20:32:25Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: af66b322-a8bc-417f-82c0-56c53a72a723

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR introduces support for processor-specific configuration kwargs through a new processor_kwargs config field. The loader was updated to conditionally forward these kwargs to the processor's from_pretrained method, with appropriate schema documentation and test coverage.

Changes

Cohort / File(s)	Summary
Processor Loading Enhancement `src/axolotl/loaders/processor.py`	Modified `load_processor` to conditionally merge `cfg.processor_kwargs` into processor construction kwargs before calling `from_pretrained`, enabling processor-specific configuration to be forwarded at load time.
Configuration Schema `src/axolotl/utils/schemas/model.py`	Added optional `processor_kwargs: dict[str, Any] \| None` field to `ModelInputConfig` with schema documentation specifying its use as load-time kwargs for processor's `from_pretrained()`, with guidance to use top-level `revision_of_model` and `trust_remote_code` instead of nesting them inside `processor_kwargs`.
Test Coverage `tests/test_revision_parameter.py`	Added two unit tests verifying that `cfg["processor_kwargs"]` values are correctly forwarded to `AutoProcessor.from_pretrained` and that absent `processor_kwargs` does not introduce unexpected keys in the call.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

fix: pass revision parameter to tokenizer and processor loaders #3388: Both PRs modify load_processor in similar ways to forward additional config-based kwargs into Processor.from_pretrained.
Save processor in quantizer CLI #3290: The quantize CLI PR imports and calls load_processor, so it will be affected by changes to how processor kwargs are handled.

Suggested reviewers

NanoCode012

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely describes the main change: adding a new `processor_kwargs` YAML field that gets forwarded to the processor's `from_pretrained()` method.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/axolotl/loaders/processor.py`:
- Around line 24-27: The YAML-provided processor_kwargs must not be allowed to
override reserved loader keys like "revision" and "trust_remote_code"; in
src/axolotl/loaders/processor.py (and similarly in other loaders), before doing
processor_kwargs.update(cfg.processor_kwargs) filter out any keys in a reserved
set (at least "revision" and "trust_remote_code") so that cfg.revision_of_model
and the loader's trust_remote_code setting always take precedence; implement the
guard by creating reserved = {"revision","trust_remote_code"} and merging only
cfg.processor_kwargs keys not in reserved (or explicitly pop/ignore them) and
apply the same pattern to model, tokenizer, and adapter loader code to keep
behavior consistent across branches.

In `@src/axolotl/utils/schemas/model.py`:
- Around line 67-86: The processor_kwargs Field currently only documents that
'revision' and 'trust_remote_code' are forbidden but does not enforce it; add a
Pydantic validator on processor_kwargs (e.g., `@validator`("processor_kwargs") or
a root_validator) in the same Model class in src/axolotl/utils/schemas/model.py
to reject any dict containing the reserved keys 'revision' or
'trust_remote_code' by raising a ValueError with a clear message; ensure the
validator handles None and non-dict inputs gracefully and keeps the existing
json_schema_extra description unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9deaaa72-4ee1-4b0f-95e8-aec82f202bc7

📥 Commits

Reviewing files that changed from the base of the PR and between 323da79 and 576ab3a.

📒 Files selected for processing (3)

src/axolotl/loaders/processor.py
src/axolotl/utils/schemas/model.py
tests/test_revision_parameter.py

Address CodeRabbit review feedback by promoting the reserved-key rule from a docstring warning to a hard Pydantic validator. Setting `revision` or `trust_remote_code` inside `processor_kwargs` now raises a ValueError at config parse time instead of causing inconsistent precedence across loader branches. No loader-side guard was added in processor.py: axolotl's real flow always parses cfg through Pydantic, so the schema validator is the authoritative gate and a duplicate loader check would be unreachable.

NanoCode012

I think this looks good overall, will let CI run

NanoCode012 · 2026-04-22T07:17:59Z

+        default=None,
+        json_schema_extra={
+            "description": (
+                "Extra kwargs forwarded to the processor's from_pretrained(), overriding "


I don't think we need such long descriptions :)

Thank you, haha noted :) I kept the (very) verbose comments in there to give you a bit more context as part of the review, I shortened this in commit - 96d8536

NanoCode012 · 2026-04-22T07:24:41Z

+                "use the top-level `revision_of_model` / `trust_remote_code` "
+                "config keys instead."
+            )
+        return processor_kwargs


Should also add a check that this is not compatible with cfg.tokenizer_use_mistral_common

Added via 96d8536

codecov · 2026-04-22T13:23:43Z

Codecov Report

❌ Patch coverage is 93.33333% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/utils/schemas/validation.py	50.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

updated per reviewer feedback

winglian

thanks!

thad0ctor · 2026-04-22T21:18:57Z

thanks!

you're welcome, keep up the good work!

coderabbitai Bot reviewed Apr 19, 2026

View reviewed changes

Comment thread src/axolotl/loaders/processor.py

Comment thread src/axolotl/utils/schemas/model.py

winglian requested a review from NanoCode012 April 22, 2026 05:17

NanoCode012 reviewed Apr 22, 2026

View reviewed changes

chore: lint

964a471

reject processor_kwargs with tokenizer_use_mistral_common

96d8536

updated per reviewer feedback

winglian approved these changes Apr 22, 2026

View reviewed changes

winglian merged commit 1bf65c5 into axolotl-ai-cloud:main Apr 23, 2026
18 checks passed

Uh oh!

Conversation

thad0ctor commented Apr 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NanoCode012 left a comment

Choose a reason for hiding this comment

Uh oh!

NanoCode012 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

thad0ctor Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

NanoCode012 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

thad0ctor Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

winglian left a comment

Choose a reason for hiding this comment

Uh oh!

thad0ctor commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thad0ctor commented Apr 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 19, 2026 •

edited

Loading

codecov Bot commented Apr 22, 2026 •

edited

Loading