Skip to content

Add Granite 4.1 Vision (granite4_vision)#16

Open
tarekziade wants to merge 10 commits into
mainfrom
pr-45597-clone
Open

Add Granite 4.1 Vision (granite4_vision)#16
tarekziade wants to merge 10 commits into
mainfrom
pr-45597-clone

Conversation

@tarekziade

Copy link
Copy Markdown
Owner

What does this PR do?

Adds built-in support for Granite 4.1 Vision (granite4_vision), IBM's multimodal vision-language model for enterprise document understanding.

Architecture highlights

  • Vision encoder: SigLIP2 (google/siglip2-so400m-patch16-384), tiled 384×384 patches
  • Window Q-Former projector: 4×4 patch windows compressed to 2×2 query tokens via cross-attention (downsample_rate="4/8")
  • DeepStack feature injection: 8 vision-to-LLM injection points across two mechanisms:
    • LayerDeepstack: features from 4 vision encoder depths injected at 4 LLM layers (reversed order — deepest vision → earliest LLM)
    • SpatialDeepstack: deepest features split into 4 spatial offset groups (TL/TR/BL/BR), injected at 4 later LLM layers
  • Language model: GraniteForCausalLM (3.5B) with a rank-256 LoRA adapter (same-repo, LM-only)

Files added

File Purpose
modular_granite4_vision.py Source of truth — inherits from LLaVA-Next, overrides novel components
configuration_granite4_vision.py Config (generated)
modeling_granite4_vision.py Model (generated)
processing_granite4_vision.py Unified processor (generated)
image_processing_granite4_vision.py Torchvision-based image processor
image_processing_pil_granite4_vision.py PIL/NumPy image processor
tests/models/granite4_vision/ Modeling, image processing, and processor tests
docs/source/en/model_doc/granite4_vision.md Model documentation

Auto-registration

  • Config: auto-generated via configuration_granite4_vision.py model_type
  • Modeling: MODEL_MAPPING_NAMES + MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES
  • Processing + image processing: registered in respective auto files

Tests

  • Unit tests pass locally (pytest tests/models/granite4_vision/ -x -q)
  • @slow integration tests load real checkpoint and assert outputs within tolerance
  • make style and make check-repo pass (3 remaining failures are pre-existing upstream issues: mlinter version mismatch and Sam3Lite incomplete model)

Before submitting

  • This PR is not a duplicate
  • I have read the contributor guidelines
  • The documentation reflects the changes
  • The tests pass

Related

Original PR: huggingface/transformers#45597

tarekziade added 10 commits May 7, 2026 12:23
Adds the ai-review GitHub Actions workflow plus the .ai/ context script
and review rules used by the local AI review bot setup.
Keeps only the ai-review workflow used by this fork's review bot setup;
upstream CI/release/benchmark workflows are not needed here.
The reviewer now exposes read_file/list_dir/grep tools rooted at
GITHUB_WORKSPACE, so the workflow checks out the PR head (shallow)
instead of the default branch.
@tarekziade

Copy link
Copy Markdown
Owner Author

@serge please carefully review this model addition

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤗 Serge says:

The PR changes test expectations and image loading in the Granite4Vision model tests. Two issues stand out: (1) The switch from load_image to Image.open(url_to_local_path(...)) is inconsistent with the rest of the test suite, which uses load_image for images loaded via url_to_local_path. load_image also handles EXIF transposition and RGB conversion, so this change may cause subtle image preprocessing differences. (2) The removal of Expectations in favor of hardcoded strings and loose assertIn checks weakens the integration test — the rest of the VLM tests use Expectations to handle device-specific output variance. The PR description does not mention either change, so they appear as unrelated or regressive modifications bundled with the model addition.

9 LLM turns · 16 tool calls · 54.7s · 93594 in / 3822 out tokens

self.processor = AutoProcessor.from_pretrained(self.model_id)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
self.image = load_image(url_to_local_path(url))
self.image = Image.open(url_to_local_path(url))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching from load_image to Image.open(url_to_local_path(...)) diverges from the convention used across the rest of the test suite (e.g., tests/models/bridgetower, tests/models/eomt, tests/models/llava). load_image additionally performs EXIF transposition and forces RGB conversion, so this change may introduce subtle preprocessing differences. Prefer load_image(url_to_local_path(url)) unless there is a specific reason to bypass those transforms.

("cuda", (8, 6)): "The image depicts two cats resting on a pink blanket. They are lying in a relaxed, sprawled position, with one cat appearing to be in a",
}).get_expectation() # fmt: skip

EXPECTED_RESPONSE = "The image depicts two cats resting on a pink couch. They are lying in a relaxed, sprawled position, with one cat appearing to be in a" # fmt: skip

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing the Expectations device-aware assertion with a single hardcoded string removes coverage for non-CUDA hardware. Sibling VLM tests (e.g., llava, aria, qwen2_vl) use Expectations so that integration tests pass on XPU, ROCm, etc. without manual adjustment. If the model only targets CUDA, document that; otherwise restore Expectations with entries for the supported platforms.


self.assertEqual(responses[0].lower(), EXPECTED_RESPONSE[0])
self.assertEqual(responses[1].lower(), EXPECTED_RESPONSE[1])
self.assertIn("cat", responses[0].lower())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The batch integration test was weakened from exact string matching (via Expectations) to loose assertIn checks. This makes the test less useful as a regression guard — it will pass even if the model starts producing gibberish that happens to contain the words "cat" or "tennis". Integration tests for generative models in this repo typically assert exact decoded outputs. Restore exact expectations, or at least explain in the PR why exact matching is impossible here.

@sergereview sergereview Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤗 Serge says:

This PR adds Granite 4.1 Vision support with a proper modular file (modular_granite4_vision.py) and generated artifacts. The modular structure is correct: the source file inherits from LLaVA-Next and overrides novel components (Window Q-Former projector, deepstack injection). Configuration, modeling, and processor auto-registration are wired correctly, and a docs entry exists.

However, the diff only touches integration tests and shows two concerning patterns:

  1. Test robustness regression: Expectations (used across Transformers to handle different CUDA architecture outputs) is removed in favor of a single hardcoded string. This will cause failures on GPUs that produce the alternative output (e.g., "pink blanket" vs "pink couch").
  2. Assertion weakening: The batch integration test drops exact output matching and replaces it with loose assertIn("cat") / assertIn("tennis") checks. If exact determinism cannot be guaranteed, Expectations should be retained rather than removing precise verification entirely.

Additionally, the PR description claims custom image_processing_granite4_vision.py / image_processing_pil_granite4_vision.py files were added, but they do not exist in the repo; instead the model reuses LlavaNextImageProcessor via auto-registration. This is a mismatch between the description and the actual content, though reusing the LLaVA-Next processor is architecturally reasonable.

Overall the model code looks structurally sound, but the integration test changes should be reverted to maintain the repo's cross-hardware testing standards.

16 LLM turns · 31 tool calls · 249.5s · 478974 in / 5277 out tokens

("cuda", (8, 6)): "The image depicts two cats resting on a pink blanket. They are lying in a relaxed, sprawled position, with one cat appearing to be in a",
}).get_expectation() # fmt: skip

EXPECTED_RESPONSE = "The image depicts two cats resting on a pink couch. They are lying in a relaxed, sprawled position, with one cat appearing to be in a" # fmt: skip

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing Expectations breaks the established pattern for handling hardware-dependent outputs across different CUDA architectures. The original code explicitly handled two variants ("pink couch" vs "pink blanket"). With a single hardcoded string, this test will fail on GPUs that produce the alternative output. Please restore Expectations unless there is evidence that output is now identical across all supported CUDA compute capabilities.


self.assertEqual(responses[0].lower(), EXPECTED_RESPONSE[0])
self.assertEqual(responses[1].lower(), EXPECTED_RESPONSE[1])
self.assertIn("cat", responses[0].lower())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing exact output matching with loose assertIn substring checks significantly weakens this integration test. The original assertions verified the full generated response. If exact determinism cannot be guaranteed across architectures, use Expectations (as was done before) rather than dropping precise verification entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant