Skip to content

feat: add processor_kwargs YAML field forwarded to from_pretrained#3612

Merged
winglian merged 4 commits into
axolotl-ai-cloud:mainfrom
thad0ctor:feat/processor-kwargs
Apr 23, 2026
Merged

feat: add processor_kwargs YAML field forwarded to from_pretrained#3612
winglian merged 4 commits into
axolotl-ai-cloud:mainfrom
thad0ctor:feat/processor-kwargs

Conversation

@thad0ctor

@thad0ctor thad0ctor commented Apr 19, 2026

Copy link
Copy Markdown
Contributor

Description

Adds a new processor_kwargs: dict[str, Any] | None field on ModelInputConfig. Its contents are forwarded as **kwargs to the processor class's from_pretrained() call inside load_processor, giving users a YAML hook for overriding fields that live in a model's on-disk processor_config.json without having to maintain a local directory that shadows the real weights just to swap one file or update the file directly.

Scope: applies to any model whose load path goes through load_processor (i.e. cfg.processor_type is set). HF's ProcessorMixin.from_pretrained propagates top-level kwargs to sub-processors (image / video / feature-extractor / tokenizer), so the same field works across image, video, and audio processors with no model-specific wiring.

Changes:

  • src/axolotl/utils/schemas/model.py: new processor_kwargs field on ModelInputConfig, placed next to processor_type and mirroring the existing model_quantization_config / model_quantization_config_kwargs pairing. Docstring notes (a) the name overlap with transformers' own call-time processor_kwargs argument and (b) that revision / trust_remote_code should stay on the top-level keys to avoid inconsistent precedence across loader branches.
  • src/axolotl/loaders/processor.py: forwards cfg.processor_kwargs into the shared kwargs dict before the Voxtral and main from_pretrained() calls. The Mistral3Processor path does not call from_pretrained and is intentionally untouched.
  • tests/test_revision_parameter.py: two new tests covering the forward path and the no-op default.

Example (Gemma-4):

processor_kwargs:
  image_seq_length: 1120
  max_soft_tokens: 1120

Default behavior is preserved when the field is unset (None): the guard in load_processor skips the update and nothing new reaches from_pretrained.

Motivation and Context

There is currently no YAML hook for per-processor-field overrides. Users who want to change defaults like Gemma-4's image_seq_length / max_soft_tokens or Qwen-VL's min_pixels / max_pixels have to maintain a local model directory that symlinks every real file except processor_config.json, which they override on disk or edit the model files directly.

How has this been tested?

  1. Unit tests (mocked AutoProcessor, added in tests/test_revision_parameter.py):

    • test_load_processor_forwards_processor_kwargs asserts image_seq_length and max_soft_tokens reach AutoProcessor.from_pretrained call kwargs.
    • test_load_processor_omits_processor_kwargs_when_unset asserts no-op default.
    • All 8 tests in the file pass (6 pre-existing + 2 new).
  2. Schema round-trip against the real Gemma-4 YAML config: AxolotlInputConfig(**yaml.safe_load(...)) with processor_kwargs populated preserves the value intact through Pydantic validation.

  3. End-to-end image tokenization against google/gemma-4-31B-it: same 896×896 test image fed through two processors loaded via load_processor.

    metric baseline with override image_seq_length=1120, max_soft_tokens=1120 ratio
    image_seq_length attr 280 1120
    pixel_values patches 2520 10080
    image_position_ids entries 2520 10080

    The patch-count scaling exactly matches the configured token budget, confirming the override reaches the live image processor's encode path.

Environment: Python 3.12, torch 2.9 / transformers 5.5.4, tested on CPU (no GPU-specific paths touched).

AI Usage Disclaimer

Yes — Claude (Anthropic) was used to assist with tracing the config-loading plumbing, drafting the schema field and forwarder, and authoring the tests.

Screenshots (if appropriate)

N/A.

Types of changes

  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactor / code cleanup

Social Handles (Optional)

N/A.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added a new configuration option to customize processor initialization, allowing you to override default processor settings at load time.
    • Processor loading now properly forwards and applies custom configuration values for enhanced control over processor behavior.

  Add a new `processor_kwargs` field on ModelInputConfig, forwarded as kwargs
  to the processor class's from_pretrained() call in load_processor. This
  gives users a clean YAML hook for overriding fields that live in a model's
  on-disk processor_config.json (e.g. image_seq_length / max_soft_tokens for
  Gemma-4, min_pixels / max_pixels for Qwen-VL, or any kwargs a speech/audio
  processor like Whisper accepts) without maintaining a local model directory
  that shadows the real weights solely to swap processor_config.json.

  Scope: applies to any model whose load path goes through load_processor
  (i.e. cfg.processor_type is set). HF propagates top-level kwargs to
  sub-processors (image_processor, video_processor, feature_extractor,
  tokenizer) via the standard ProcessorMixin.from_pretrained mechanism, so
  the same field works across image, video, and audio processors with no
  model-specific wiring.

  Changes:
  - src/axolotl/utils/schemas/model.py: new `processor_kwargs:
    dict[str, Any] | None` field on ModelInputConfig, placed next to
    `processor_type` and mirroring the existing
    `model_quantization_config` / `model_quantization_config_kwargs`
    pairing. Docstring notes (a) the name overlap with transformers'
    own call-time `processor_kwargs` argument and (b) that `revision` /
    inconsistent precedence across loader branches.
  - src/axolotl/loaders/processor.py: forward cfg.processor_kwargs into
    the shared kwargs dict before the Voxtral and main from_pretrained()
    calls. The Mistral3Processor path does not call from_pretrained and
    is intentionally untouched.
  - tests/test_revision_parameter.py: two new tests covering the forward
    path and the no-op default.

  Example (gemma 4):

    processor_kwargs:
      image_seq_length: 1120
      max_soft_tokens: 1120
@coderabbitai

coderabbitai Bot commented Apr 19, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: af66b322-a8bc-417f-82c0-56c53a72a723

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR introduces support for processor-specific configuration kwargs through a new processor_kwargs config field. The loader was updated to conditionally forward these kwargs to the processor's from_pretrained method, with appropriate schema documentation and test coverage.

Changes

Cohort / File(s) Summary
Processor Loading Enhancement
src/axolotl/loaders/processor.py
Modified load_processor to conditionally merge cfg.processor_kwargs into processor construction kwargs before calling from_pretrained, enabling processor-specific configuration to be forwarded at load time.
Configuration Schema
src/axolotl/utils/schemas/model.py
Added optional processor_kwargs: dict[str, Any] | None field to ModelInputConfig with schema documentation specifying its use as load-time kwargs for processor's from_pretrained(), with guidance to use top-level revision_of_model and trust_remote_code instead of nesting them inside processor_kwargs.
Test Coverage
tests/test_revision_parameter.py
Added two unit tests verifying that cfg["processor_kwargs"] values are correctly forwarded to AutoProcessor.from_pretrained and that absent processor_kwargs does not introduce unexpected keys in the call.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Suggested reviewers

  • NanoCode012
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main change: adding a new processor_kwargs YAML field that gets forwarded to the processor's from_pretrained() method.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/axolotl/loaders/processor.py`:
- Around line 24-27: The YAML-provided processor_kwargs must not be allowed to
override reserved loader keys like "revision" and "trust_remote_code"; in
src/axolotl/loaders/processor.py (and similarly in other loaders), before doing
processor_kwargs.update(cfg.processor_kwargs) filter out any keys in a reserved
set (at least "revision" and "trust_remote_code") so that cfg.revision_of_model
and the loader's trust_remote_code setting always take precedence; implement the
guard by creating reserved = {"revision","trust_remote_code"} and merging only
cfg.processor_kwargs keys not in reserved (or explicitly pop/ignore them) and
apply the same pattern to model, tokenizer, and adapter loader code to keep
behavior consistent across branches.

In `@src/axolotl/utils/schemas/model.py`:
- Around line 67-86: The processor_kwargs Field currently only documents that
'revision' and 'trust_remote_code' are forbidden but does not enforce it; add a
Pydantic validator on processor_kwargs (e.g., `@validator`("processor_kwargs") or
a root_validator) in the same Model class in src/axolotl/utils/schemas/model.py
to reject any dict containing the reserved keys 'revision' or
'trust_remote_code' by raising a ValueError with a clear message; ensure the
validator handles None and non-dict inputs gracefully and keeps the existing
json_schema_extra description unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9deaaa72-4ee1-4b0f-95e8-aec82f202bc7

📥 Commits

Reviewing files that changed from the base of the PR and between 323da79 and 576ab3a.

📒 Files selected for processing (3)
  • src/axolotl/loaders/processor.py
  • src/axolotl/utils/schemas/model.py
  • tests/test_revision_parameter.py

Comment thread src/axolotl/loaders/processor.py
Comment thread src/axolotl/utils/schemas/model.py
  Address CodeRabbit review feedback by promoting the reserved-key rule
  from a docstring warning to a hard Pydantic validator. Setting
  `revision` or `trust_remote_code` inside `processor_kwargs` now raises
  a ValueError at config parse time instead of causing inconsistent
  precedence across loader branches.

  No loader-side guard was added in processor.py: axolotl's real flow
  always parses cfg through Pydantic, so the schema validator is the
  authoritative gate and a duplicate loader check would be unreachable.
@winglian winglian requested a review from NanoCode012 April 22, 2026 05:17

@NanoCode012 NanoCode012 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good overall, will let CI run

Comment thread src/axolotl/utils/schemas/model.py Outdated
default=None,
json_schema_extra={
"description": (
"Extra kwargs forwarded to the processor's from_pretrained(), overriding "

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need such long descriptions :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, haha noted :) I kept the (very) verbose comments in there to give you a bit more context as part of the review, I shortened this in commit - 96d8536

"use the top-level `revision_of_model` / `trust_remote_code` "
"config keys instead."
)
return processor_kwargs

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also add a check that this is not compatible with cfg.tokenizer_use_mistral_common

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added via 96d8536

@codecov

codecov Bot commented Apr 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.33333% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/axolotl/utils/schemas/validation.py 50.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@winglian winglian left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@thad0ctor

Copy link
Copy Markdown
Contributor Author

thanks!

you're welcome, keep up the good work!

@winglian winglian merged commit 1bf65c5 into axolotl-ai-cloud:main Apr 23, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants