Add Granite 4.1 Vision (granite4_vision)#16
Conversation
Adds the ai-review GitHub Actions workflow plus the .ai/ context script and review rules used by the local AI review bot setup.
Keeps only the ai-review workflow used by this fork's review bot setup; upstream CI/release/benchmark workflows are not needed here.
The reviewer now exposes read_file/list_dir/grep tools rooted at GITHUB_WORKSPACE, so the workflow checks out the PR head (shallow) instead of the default branch.
|
@serge please carefully review this model addition |
There was a problem hiding this comment.
🤗 Serge says:
The PR changes test expectations and image loading in the Granite4Vision model tests. Two issues stand out: (1) The switch from load_image to Image.open(url_to_local_path(...)) is inconsistent with the rest of the test suite, which uses load_image for images loaded via url_to_local_path. load_image also handles EXIF transposition and RGB conversion, so this change may cause subtle image preprocessing differences. (2) The removal of Expectations in favor of hardcoded strings and loose assertIn checks weakens the integration test — the rest of the VLM tests use Expectations to handle device-specific output variance. The PR description does not mention either change, so they appear as unrelated or regressive modifications bundled with the model addition.
9 LLM turns · 16 tool calls · 54.7s · 93594 in / 3822 out tokens
| self.processor = AutoProcessor.from_pretrained(self.model_id) | ||
| url = "http://images.cocodataset.org/val2017/000000039769.jpg" | ||
| self.image = load_image(url_to_local_path(url)) | ||
| self.image = Image.open(url_to_local_path(url)) |
There was a problem hiding this comment.
Switching from load_image to Image.open(url_to_local_path(...)) diverges from the convention used across the rest of the test suite (e.g., tests/models/bridgetower, tests/models/eomt, tests/models/llava). load_image additionally performs EXIF transposition and forces RGB conversion, so this change may introduce subtle preprocessing differences. Prefer load_image(url_to_local_path(url)) unless there is a specific reason to bypass those transforms.
| ("cuda", (8, 6)): "The image depicts two cats resting on a pink blanket. They are lying in a relaxed, sprawled position, with one cat appearing to be in a", | ||
| }).get_expectation() # fmt: skip | ||
|
|
||
| EXPECTED_RESPONSE = "The image depicts two cats resting on a pink couch. They are lying in a relaxed, sprawled position, with one cat appearing to be in a" # fmt: skip |
There was a problem hiding this comment.
Replacing the Expectations device-aware assertion with a single hardcoded string removes coverage for non-CUDA hardware. Sibling VLM tests (e.g., llava, aria, qwen2_vl) use Expectations so that integration tests pass on XPU, ROCm, etc. without manual adjustment. If the model only targets CUDA, document that; otherwise restore Expectations with entries for the supported platforms.
|
|
||
| self.assertEqual(responses[0].lower(), EXPECTED_RESPONSE[0]) | ||
| self.assertEqual(responses[1].lower(), EXPECTED_RESPONSE[1]) | ||
| self.assertIn("cat", responses[0].lower()) |
There was a problem hiding this comment.
The batch integration test was weakened from exact string matching (via Expectations) to loose assertIn checks. This makes the test less useful as a regression guard — it will pass even if the model starts producing gibberish that happens to contain the words "cat" or "tennis". Integration tests for generative models in this repo typically assert exact decoded outputs. Restore exact expectations, or at least explain in the PR why exact matching is impossible here.
There was a problem hiding this comment.
🤗 Serge says:
This PR adds Granite 4.1 Vision support with a proper modular file (modular_granite4_vision.py) and generated artifacts. The modular structure is correct: the source file inherits from LLaVA-Next and overrides novel components (Window Q-Former projector, deepstack injection). Configuration, modeling, and processor auto-registration are wired correctly, and a docs entry exists.
However, the diff only touches integration tests and shows two concerning patterns:
- Test robustness regression:
Expectations(used across Transformers to handle different CUDA architecture outputs) is removed in favor of a single hardcoded string. This will cause failures on GPUs that produce the alternative output (e.g., "pink blanket" vs "pink couch"). - Assertion weakening: The batch integration test drops exact output matching and replaces it with loose
assertIn("cat")/assertIn("tennis")checks. If exact determinism cannot be guaranteed,Expectationsshould be retained rather than removing precise verification entirely.
Additionally, the PR description claims custom image_processing_granite4_vision.py / image_processing_pil_granite4_vision.py files were added, but they do not exist in the repo; instead the model reuses LlavaNextImageProcessor via auto-registration. This is a mismatch between the description and the actual content, though reusing the LLaVA-Next processor is architecturally reasonable.
Overall the model code looks structurally sound, but the integration test changes should be reverted to maintain the repo's cross-hardware testing standards.
16 LLM turns · 31 tool calls · 249.5s · 478974 in / 5277 out tokens
| ("cuda", (8, 6)): "The image depicts two cats resting on a pink blanket. They are lying in a relaxed, sprawled position, with one cat appearing to be in a", | ||
| }).get_expectation() # fmt: skip | ||
|
|
||
| EXPECTED_RESPONSE = "The image depicts two cats resting on a pink couch. They are lying in a relaxed, sprawled position, with one cat appearing to be in a" # fmt: skip |
There was a problem hiding this comment.
Removing Expectations breaks the established pattern for handling hardware-dependent outputs across different CUDA architectures. The original code explicitly handled two variants ("pink couch" vs "pink blanket"). With a single hardcoded string, this test will fail on GPUs that produce the alternative output. Please restore Expectations unless there is evidence that output is now identical across all supported CUDA compute capabilities.
|
|
||
| self.assertEqual(responses[0].lower(), EXPECTED_RESPONSE[0]) | ||
| self.assertEqual(responses[1].lower(), EXPECTED_RESPONSE[1]) | ||
| self.assertIn("cat", responses[0].lower()) |
There was a problem hiding this comment.
Replacing exact output matching with loose assertIn substring checks significantly weakens this integration test. The original assertions verified the full generated response. If exact determinism cannot be guaranteed across architectures, use Expectations (as was done before) rather than dropping precise verification entirely.
What does this PR do?
Adds built-in support for Granite 4.1 Vision (
granite4_vision), IBM's multimodal vision-language model for enterprise document understanding.Architecture highlights
google/siglip2-so400m-patch16-384), tiled 384×384 patchesdownsample_rate="4/8")Files added
modular_granite4_vision.pyconfiguration_granite4_vision.pymodeling_granite4_vision.pyprocessing_granite4_vision.pyimage_processing_granite4_vision.pyimage_processing_pil_granite4_vision.pytests/models/granite4_vision/docs/source/en/model_doc/granite4_vision.mdAuto-registration
configuration_granite4_vision.pymodel_typeMODEL_MAPPING_NAMES+MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMESTests
pytest tests/models/granite4_vision/ -x -q)@slowintegration tests load real checkpoint and assert outputs within tolerancemake styleandmake check-repopass (3 remaining failures are pre-existing upstream issues:mlinterversion mismatch andSam3Liteincomplete model)Before submitting
Related
Original PR: huggingface/transformers#45597