Add Granite 4.1 Vision (granite4_vision) by tarekziade · Pull Request #16 · tarekziade/tarekziade-transformers-reviewer-test

tarekziade · 2026-05-07T13:17:27Z

What does this PR do?

Adds built-in support for Granite 4.1 Vision (granite4_vision), IBM's multimodal vision-language model for enterprise document understanding.

Architecture highlights

Vision encoder: SigLIP2 (google/siglip2-so400m-patch16-384), tiled 384×384 patches
Window Q-Former projector: 4×4 patch windows compressed to 2×2 query tokens via cross-attention (downsample_rate="4/8")
DeepStack feature injection: 8 vision-to-LLM injection points across two mechanisms:
- LayerDeepstack: features from 4 vision encoder depths injected at 4 LLM layers (reversed order — deepest vision → earliest LLM)
- SpatialDeepstack: deepest features split into 4 spatial offset groups (TL/TR/BL/BR), injected at 4 later LLM layers
Language model: GraniteForCausalLM (3.5B) with a rank-256 LoRA adapter (same-repo, LM-only)

Files added

File	Purpose
`modular_granite4_vision.py`	Source of truth — inherits from LLaVA-Next, overrides novel components
`configuration_granite4_vision.py`	Config (generated)
`modeling_granite4_vision.py`	Model (generated)
`processing_granite4_vision.py`	Unified processor (generated)
`image_processing_granite4_vision.py`	Torchvision-based image processor
`image_processing_pil_granite4_vision.py`	PIL/NumPy image processor
`tests/models/granite4_vision/`	Modeling, image processing, and processor tests
`docs/source/en/model_doc/granite4_vision.md`	Model documentation

Auto-registration

Config: auto-generated via configuration_granite4_vision.py model_type
Modeling: MODEL_MAPPING_NAMES + MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES
Processing + image processing: registered in respective auto files

Tests

Unit tests pass locally (pytest tests/models/granite4_vision/ -x -q)
@slow integration tests load real checkpoint and assert outputs within tolerance
make style and make check-repo pass (3 remaining failures are pre-existing upstream issues: mlinter version mismatch and Sam3Lite incomplete model)

Before submitting

This PR is not a duplicate
I have read the contributor guidelines
The documentation reflects the changes
The tests pass

The PR changes test expectations and image loading in the Granite4Vision model tests. Two issues stand out: (1) The switch from load_image to Image.open(url_to_local_path(...)) is inconsistent with the rest of the test suite, which uses load_image for images loaded via url_to_local_path. load_image also handles EXIF transposition and RGB conversion, so this change may cause subtle image preprocessing differences. (2) The removal of Expectations in favor of hardcoded strings and loose assertIn checks weakens the integration test — the rest of the VLM tests use Expectations to handle device-specific output variance. The PR description does not mention either change, so they appear as unrelated or regressive modifications bundled with the model addition.

9 LLM turns · 16 tool calls · 54.7s · 93594 in / 3822 out tokens

github-actions · 2026-05-07T13:25:05Z

        self.processor = AutoProcessor.from_pretrained(self.model_id)
        url = "http://images.cocodataset.org/val2017/000000039769.jpg"
-        self.image = load_image(url_to_local_path(url))
+        self.image = Image.open(url_to_local_path(url))


Switching from load_image to Image.open(url_to_local_path(...)) diverges from the convention used across the rest of the test suite (e.g., tests/models/bridgetower, tests/models/eomt, tests/models/llava). load_image additionally performs EXIF transposition and forces RGB conversion, so this change may introduce subtle preprocessing differences. Prefer load_image(url_to_local_path(url)) unless there is a specific reason to bypass those transforms.

github-actions · 2026-05-07T13:25:05Z

-            ("cuda", (8, 6)): "The image depicts two cats resting on a pink blanket. They are lying in a relaxed, sprawled position, with one cat appearing to be in a",
-        }).get_expectation()  # fmt: skip
-
+        EXPECTED_RESPONSE = "The image depicts two cats resting on a pink couch. They are lying in a relaxed, sprawled position, with one cat appearing to be in a"  # fmt: skip


Replacing the Expectations device-aware assertion with a single hardcoded string removes coverage for non-CUDA hardware. Sibling VLM tests (e.g., llava, aria, qwen2_vl) use Expectations so that integration tests pass on XPU, ROCm, etc. without manual adjustment. If the model only targets CUDA, document that; otherwise restore Expectations with entries for the supported platforms.

github-actions · 2026-05-07T13:25:05Z

-
-        self.assertEqual(responses[0].lower(), EXPECTED_RESPONSE[0])
-        self.assertEqual(responses[1].lower(), EXPECTED_RESPONSE[1])
+        self.assertIn("cat", responses[0].lower())


The batch integration test was weakened from exact string matching (via Expectations) to loose assertIn checks. This makes the test less useful as a regression guard — it will pass even if the model starts producing gibberish that happens to contain the words "cat" or "tennis". Integration tests for generative models in this repo typically assert exact decoded outputs. Restore exact expectations, or at least explain in the PR why exact matching is impossible here.

sergereview

🤗 Serge says:

This PR adds Granite 4.1 Vision support with a proper modular file (modular_granite4_vision.py) and generated artifacts. The modular structure is correct: the source file inherits from LLaVA-Next and overrides novel components (Window Q-Former projector, deepstack injection). Configuration, modeling, and processor auto-registration are wired correctly, and a docs entry exists.

However, the diff only touches integration tests and shows two concerning patterns:

Test robustness regression: Expectations (used across Transformers to handle different CUDA architecture outputs) is removed in favor of a single hardcoded string. This will cause failures on GPUs that produce the alternative output (e.g., "pink blanket" vs "pink couch").
Assertion weakening: The batch integration test drops exact output matching and replaces it with loose assertIn("cat") / assertIn("tennis") checks. If exact determinism cannot be guaranteed, Expectations should be retained rather than removing precise verification entirely.

Additionally, the PR description claims custom image_processing_granite4_vision.py / image_processing_pil_granite4_vision.py files were added, but they do not exist in the repo; instead the model reuses LlavaNextImageProcessor via auto-registration. This is a mismatch between the description and the actual content, though reusing the LLaVA-Next processor is architecturally reasonable.

Overall the model code looks structurally sound, but the integration test changes should be reverted to maintain the repo's cross-hardware testing standards.

16 LLM turns · 31 tool calls · 249.5s · 478974 in / 5277 out tokens

sergereview · 2026-05-11T14:13:53Z

-            ("cuda", (8, 6)): "The image depicts two cats resting on a pink blanket. They are lying in a relaxed, sprawled position, with one cat appearing to be in a",
-        }).get_expectation()  # fmt: skip
-
+        EXPECTED_RESPONSE = "The image depicts two cats resting on a pink couch. They are lying in a relaxed, sprawled position, with one cat appearing to be in a"  # fmt: skip


Removing Expectations breaks the established pattern for handling hardware-dependent outputs across different CUDA architectures. The original code explicitly handled two variants ("pink couch" vs "pink blanket"). With a single hardcoded string, this test will fail on GPUs that produce the alternative output. Please restore Expectations unless there is evidence that output is now identical across all supported CUDA compute capabilities.

sergereview · 2026-05-11T14:13:53Z

-
-        self.assertEqual(responses[0].lower(), EXPECTED_RESPONSE[0])
-        self.assertEqual(responses[1].lower(), EXPECTED_RESPONSE[1])
+        self.assertIn("cat", responses[0].lower())


Replacing exact output matching with loose assertIn substring checks significantly weakens this integration test. The original assertions verified the full generated response. If exact determinism cannot be guaranteed across architectures, use Expectations (as was done before) rather than dropping precise verification entirely.

tarekziade added 10 commits May 7, 2026 12:23

Add AI review workflow and configuration

a7aac3d

Adds the ai-review GitHub Actions workflow plus the .ai/ context script and review rules used by the local AI review bot setup.

Remove all GitHub workflows except ai-review

fa328aa

Keeps only the ai-review workflow used by this fork's review bot setup; upstream CI/release/benchmark workflows are not needed here.

ai-review: shallow checkout PR head for browse tools

76004a3

The reviewer now exposes read_file/list_dir/grep tools rooted at GITHUB_WORKSPACE, so the workflow checks out the PR head (shallow) instead of the default branch.

otel proto

f41ee42

split out transformers-ci

96f4770

wire OLTP env

86802bd

added test_torch with otel

7563963

safe workspace

8f564e4

seeding a few tests on main

b216083

Add Granite 4.1 Vision (granite4_vision)

1f58e5f

tarekziade added the ai-review-demo label May 7, 2026

github-actions Bot reviewed May 7, 2026

View reviewed changes

tarekziade removed the ai-review-demo label May 11, 2026

sergereview Bot reviewed May 11, 2026

View reviewed changes

tarekziade force-pushed the main branch from de8fa3e to 99cf592 Compare May 11, 2026 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Granite 4.1 Vision (granite4_vision)#16

Add Granite 4.1 Vision (granite4_vision)#16
tarekziade wants to merge 10 commits into
mainfrom
pr-45597-clone

tarekziade commented May 7, 2026

Uh oh!

tarekziade commented May 7, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot May 7, 2026

Uh oh!

github-actions Bot May 7, 2026

Uh oh!

github-actions Bot May 7, 2026

Uh oh!

sergereview Bot left a comment

Uh oh!

sergereview Bot May 11, 2026

Uh oh!

sergereview Bot May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tarekziade commented May 7, 2026

What does this PR do?

Architecture highlights

Files added

Auto-registration

Tests

Before submitting

Related

Uh oh!

tarekziade commented May 7, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

sergereview Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sergereview Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

sergereview Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant