feat support with autoprocessor by ved1beta · Pull Request #3656 · axolotl-ai-cloud/axolotl

ved1beta · 2026-05-15T07:18:53Z

Description

Keep the processor dict intact and augment it in place instead of
splitting into input_ids + extras and reassembling. Replace
_unwrap_build_prompt with a one-line _extract_input_ids helper for
the find_turn call sites. Behavior unchanged.

Motivation and Context

#3655

How has this been tested?

training runs with auto processor

AI Usage Disclaimer

claude helped dignose

Summary by CodeRabbit

Improvements
- Enhanced tokenization to better handle auxiliary processor outputs
- Improved compatibility with processor-generated fields (such as image-related data)
- Increased robustness of multi-turn conversation boundary detection

coderabbitai · 2026-05-15T07:19:06Z

📝 Walkthrough

Walkthrough

This PR adds a _unwrap_build_prompt normalization helper to the chat template prompt strategy, enabling consistent handling of build_prompt outputs regardless of whether they return raw token lists or processor-wrapped dicts. The helper and its call sites are updated in tokenization and turn boundary detection to extract and preserve processor auxiliary fields like attention_mask and image-related outputs.

Changes

Processor Output Normalization

Layer / File(s)	Summary
Normalization helper for build_prompt outputs `src/axolotl/prompt_strategies/chat_template.py`	Introduced `_unwrap_build_prompt` to extract `input_ids` and auxiliary fields from `build_prompt` return values, normalizing raw token lists and processor dicts into a consistent `(input_ids, extras)` interface.
Tokenization integration `src/axolotl/prompt_strategies/chat_template.py`	Updated `_tokenize_single_prompt` to call `build_prompt` via the normalization helper, capturing processor extras alongside token IDs. Enhanced tokenization output construction to derive `attention_mask` from processor extras and preserve additional processor fields without overwriting `input_ids`/`labels`.
Turn boundary detection `src/axolotl/prompt_strategies/chat_template.py`	Updated `find_turn` to generate token IDs using `_unwrap_build_prompt`, ensuring content-boundary comparisons work correctly for both raw token lists and processor dict outputs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'feat support with autoprocessor' is vague and doesn't clearly describe the main change, which is fixing multimodal training handling when a processor is configured by properly unwrapping and preserving processor outputs.	Revise the title to be more specific and descriptive of the fix, such as 'Fix multimodal training with autoprocessor by preserving processor outputs' or similar, to clearly communicate the primary change.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-15T07:30:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

support with autoprocessor

e61521c

simple add dict

cb96718

ved1beta requested a review from NanoCode012 May 19, 2026 11:06

NanoCode012 approved these changes May 19, 2026

View reviewed changes

NanoCode012 added the ready to merge label May 19, 2026

NanoCode012 merged commit d198094 into axolotl-ai-cloud:main May 22, 2026
14 of 15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat support with autoprocessor#3656

feat support with autoprocessor#3656
NanoCode012 merged 2 commits into
axolotl-ai-cloud:mainfrom
ved1beta:gemma-autoprocessor

ved1beta commented May 15, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 15, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

codecov Bot commented May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ved1beta commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

codecov Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ved1beta commented May 15, 2026 •

edited

Loading

coderabbitai Bot commented May 15, 2026 •

edited

Loading

codecov Bot commented May 15, 2026 •

edited

Loading