Skip to content

feat support with autoprocessor#3656

Merged
NanoCode012 merged 2 commits into
axolotl-ai-cloud:mainfrom
ved1beta:gemma-autoprocessor
May 22, 2026
Merged

feat support with autoprocessor#3656
NanoCode012 merged 2 commits into
axolotl-ai-cloud:mainfrom
ved1beta:gemma-autoprocessor

Conversation

@ved1beta

@ved1beta ved1beta commented May 15, 2026

Copy link
Copy Markdown
Member

Description

Keep the processor dict intact and augment it in place instead of
splitting into input_ids + extras and reassembling. Replace
_unwrap_build_prompt with a one-line _extract_input_ids helper for
the find_turn call sites. Behavior unchanged.

Motivation and Context

#3655

How has this been tested?

training runs with auto processor

AI Usage Disclaimer

claude helped dignose

Summary by CodeRabbit

  • Improvements
    • Enhanced tokenization to better handle auxiliary processor outputs
    • Improved compatibility with processor-generated fields (such as image-related data)
    • Increased robustness of multi-turn conversation boundary detection

Review Change Stack

@coderabbitai

coderabbitai Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR adds a _unwrap_build_prompt normalization helper to the chat template prompt strategy, enabling consistent handling of build_prompt outputs regardless of whether they return raw token lists or processor-wrapped dicts. The helper and its call sites are updated in tokenization and turn boundary detection to extract and preserve processor auxiliary fields like attention_mask and image-related outputs.

Changes

Processor Output Normalization

Layer / File(s) Summary
Normalization helper for build_prompt outputs
src/axolotl/prompt_strategies/chat_template.py
Introduced _unwrap_build_prompt to extract input_ids and auxiliary fields from build_prompt return values, normalizing raw token lists and processor dicts into a consistent (input_ids, extras) interface.
Tokenization integration
src/axolotl/prompt_strategies/chat_template.py
Updated _tokenize_single_prompt to call build_prompt via the normalization helper, capturing processor extras alongside token IDs. Enhanced tokenization output construction to derive attention_mask from processor extras and preserve additional processor fields without overwriting input_ids/labels.
Turn boundary detection
src/axolotl/prompt_strategies/chat_template.py
Updated find_turn to generate token IDs using _unwrap_build_prompt, ensuring content-boundary comparisons work correctly for both raw token lists and processor dict outputs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'feat support with autoprocessor' is vague and doesn't clearly describe the main change, which is fixing multimodal training handling when a processor is configured by properly unwrapping and preserving processor outputs. Revise the title to be more specific and descriptive of the fix, such as 'Fix multimodal training with autoprocessor by preserving processor outputs' or similar, to clearly communicate the primary change.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented May 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@ved1beta ved1beta requested a review from NanoCode012 May 19, 2026 11:06
@NanoCode012 NanoCode012 merged commit d198094 into axolotl-ai-cloud:main May 22, 2026
14 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants