feat: add processor_kwargs YAML field forwarded to from_pretrained#3612
Conversation
Add a new `processor_kwargs` field on ModelInputConfig, forwarded as kwargs
to the processor class's from_pretrained() call in load_processor. This
gives users a clean YAML hook for overriding fields that live in a model's
on-disk processor_config.json (e.g. image_seq_length / max_soft_tokens for
Gemma-4, min_pixels / max_pixels for Qwen-VL, or any kwargs a speech/audio
processor like Whisper accepts) without maintaining a local model directory
that shadows the real weights solely to swap processor_config.json.
Scope: applies to any model whose load path goes through load_processor
(i.e. cfg.processor_type is set). HF propagates top-level kwargs to
sub-processors (image_processor, video_processor, feature_extractor,
tokenizer) via the standard ProcessorMixin.from_pretrained mechanism, so
the same field works across image, video, and audio processors with no
model-specific wiring.
Changes:
- src/axolotl/utils/schemas/model.py: new `processor_kwargs:
dict[str, Any] | None` field on ModelInputConfig, placed next to
`processor_type` and mirroring the existing
`model_quantization_config` / `model_quantization_config_kwargs`
pairing. Docstring notes (a) the name overlap with transformers'
own call-time `processor_kwargs` argument and (b) that `revision` /
inconsistent precedence across loader branches.
- src/axolotl/loaders/processor.py: forward cfg.processor_kwargs into
the shared kwargs dict before the Voxtral and main from_pretrained()
calls. The Mistral3Processor path does not call from_pretrained and
is intentionally untouched.
- tests/test_revision_parameter.py: two new tests covering the forward
path and the no-op default.
Example (gemma 4):
processor_kwargs:
image_seq_length: 1120
max_soft_tokens: 1120
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR introduces support for processor-specific configuration kwargs through a new Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/axolotl/loaders/processor.py`:
- Around line 24-27: The YAML-provided processor_kwargs must not be allowed to
override reserved loader keys like "revision" and "trust_remote_code"; in
src/axolotl/loaders/processor.py (and similarly in other loaders), before doing
processor_kwargs.update(cfg.processor_kwargs) filter out any keys in a reserved
set (at least "revision" and "trust_remote_code") so that cfg.revision_of_model
and the loader's trust_remote_code setting always take precedence; implement the
guard by creating reserved = {"revision","trust_remote_code"} and merging only
cfg.processor_kwargs keys not in reserved (or explicitly pop/ignore them) and
apply the same pattern to model, tokenizer, and adapter loader code to keep
behavior consistent across branches.
In `@src/axolotl/utils/schemas/model.py`:
- Around line 67-86: The processor_kwargs Field currently only documents that
'revision' and 'trust_remote_code' are forbidden but does not enforce it; add a
Pydantic validator on processor_kwargs (e.g., `@validator`("processor_kwargs") or
a root_validator) in the same Model class in src/axolotl/utils/schemas/model.py
to reject any dict containing the reserved keys 'revision' or
'trust_remote_code' by raising a ValueError with a clear message; ensure the
validator handles None and non-dict inputs gracefully and keeps the existing
json_schema_extra description unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9deaaa72-4ee1-4b0f-95e8-aec82f202bc7
📒 Files selected for processing (3)
src/axolotl/loaders/processor.pysrc/axolotl/utils/schemas/model.pytests/test_revision_parameter.py
Address CodeRabbit review feedback by promoting the reserved-key rule from a docstring warning to a hard Pydantic validator. Setting `revision` or `trust_remote_code` inside `processor_kwargs` now raises a ValueError at config parse time instead of causing inconsistent precedence across loader branches. No loader-side guard was added in processor.py: axolotl's real flow always parses cfg through Pydantic, so the schema validator is the authoritative gate and a duplicate loader check would be unreachable.
NanoCode012
left a comment
There was a problem hiding this comment.
I think this looks good overall, will let CI run
| default=None, | ||
| json_schema_extra={ | ||
| "description": ( | ||
| "Extra kwargs forwarded to the processor's from_pretrained(), overriding " |
There was a problem hiding this comment.
I don't think we need such long descriptions :)
There was a problem hiding this comment.
Thank you, haha noted :) I kept the (very) verbose comments in there to give you a bit more context as part of the review, I shortened this in commit - 96d8536
| "use the top-level `revision_of_model` / `trust_remote_code` " | ||
| "config keys instead." | ||
| ) | ||
| return processor_kwargs |
There was a problem hiding this comment.
Should also add a check that this is not compatible with cfg.tokenizer_use_mistral_common
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
updated per reviewer feedback
you're welcome, keep up the good work! |
Description
Adds a new
processor_kwargs: dict[str, Any] | Nonefield onModelInputConfig. Its contents are forwarded as**kwargsto the processor class'sfrom_pretrained()call insideload_processor, giving users a YAML hook for overriding fields that live in a model's on-diskprocessor_config.jsonwithout having to maintain a local directory that shadows the real weights just to swap one file or update the file directly.Scope: applies to any model whose load path goes through
load_processor(i.e.cfg.processor_typeis set). HF'sProcessorMixin.from_pretrainedpropagates top-level kwargs to sub-processors (image / video / feature-extractor / tokenizer), so the same field works across image, video, and audio processors with no model-specific wiring.Changes:
src/axolotl/utils/schemas/model.py: newprocessor_kwargsfield onModelInputConfig, placed next toprocessor_typeand mirroring the existingmodel_quantization_config/model_quantization_config_kwargspairing. Docstring notes (a) the name overlap with transformers' own call-timeprocessor_kwargsargument and (b) thatrevision/trust_remote_codeshould stay on the top-level keys to avoid inconsistent precedence across loader branches.src/axolotl/loaders/processor.py: forwardscfg.processor_kwargsinto the shared kwargs dict before the Voxtral and mainfrom_pretrained()calls. TheMistral3Processorpath does not callfrom_pretrainedand is intentionally untouched.tests/test_revision_parameter.py: two new tests covering the forward path and the no-op default.Example (Gemma-4):
Default behavior is preserved when the field is unset (
None): the guard inload_processorskips the update and nothing new reachesfrom_pretrained.Motivation and Context
There is currently no YAML hook for per-processor-field overrides. Users who want to change defaults like Gemma-4's
image_seq_length/max_soft_tokensor Qwen-VL'smin_pixels/max_pixelshave to maintain a local model directory that symlinks every real file exceptprocessor_config.json, which they override on disk or edit the model files directly.How has this been tested?
Unit tests (mocked
AutoProcessor, added intests/test_revision_parameter.py):test_load_processor_forwards_processor_kwargsassertsimage_seq_lengthandmax_soft_tokensreachAutoProcessor.from_pretrainedcall kwargs.test_load_processor_omits_processor_kwargs_when_unsetasserts no-op default.Schema round-trip against the real Gemma-4 YAML config:
AxolotlInputConfig(**yaml.safe_load(...))withprocessor_kwargspopulated preserves the value intact through Pydantic validation.End-to-end image tokenization against
google/gemma-4-31B-it: same 896×896 test image fed through two processors loaded viaload_processor.image_seq_length=1120, max_soft_tokens=1120image_seq_lengthattrpixel_valuespatchesimage_position_idsentriesThe patch-count scaling exactly matches the configured token budget, confirming the override reaches the live image processor's encode path.
Environment: Python 3.12, torch 2.9 / transformers 5.5.4, tested on CPU (no GPU-specific paths touched).
AI Usage Disclaimer
Yes — Claude (Anthropic) was used to assist with tracing the config-loading plumbing, drafting the schema field and forwarder, and authoring the tests.
Screenshots (if appropriate)
N/A.
Types of changes
Social Handles (Optional)
N/A.
Summary by CodeRabbit
Release Notes