[llava] one pixel is missing from padding when length is odd#37819
[llava] one pixel is missing from padding when length is odd#37819zucchini-nlp merged 15 commits intohuggingface:mainfrom
Conversation
|
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Thanks for submitting PR, I agree it has to be padded in square. Can you update the model code as well, where we do unpadding and mark the PR as ready for review when finalized?
| max_num_patches = int(vision_aspect_ratio.strip("anyres_max_")) | ||
| ratio = math.sqrt(current_height * current_width / (max_num_patches * patches_height**2)) |
There was a problem hiding this comment.
afaik we don't have hf-style checkpoints with different anyres patching. In that case there's no need to support since we try to not bloat code with unused features
There was a problem hiding this comment.
Oh I understood your intention.
However I think supporting "vision_aspect_ratio" argument is a good idea in the sense of consistency as the folloing code already allow variable vision_aspect_ratio.
There was a problem hiding this comment.
Ah oke, if code uses different anyres we can keep it :)
Okay I will. Thanks for the review tho. |
|
Maybe logic in LLaVA-NeXT project should be fixed too |
There was a problem hiding this comment.
Thanks a lot, looks good to me! Hm, just checked LLaVA NeXT repo and seems they unpad image by removing same pixels from height/width. We try to stay close to the original implementation, so would be great to flag it to authors yeah
Also, if you can add a small test for edge cases with odd shapes? As an example, a prev issue (#34522) on edge case image size where model fails to unpad correctly. After testing, we're good to merge :)
(llava_onevision)
|
Looks like "vision_aspect_ratio" is better to put in processor not in image_processor |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Oke, sounds good to keep in processor. Thanks for the tests!
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…face#37819) * [fix] one pixel should be added when length is odd * [fix] add vision_aspect_ratio args & typo * [fix] style * [fix] do not fix fast file directly * [fix] convert using modular * remove duplicate codes * match unpad logic with pad logic * test odd-sized images for llava & aria * test unpad odd-sized padding for llava family * fix style * add kwarg to onvision modular * move vision_aspect_ratio from image_processor to processor (llava_onevision)
One pixel is missing from padding when length is odd in llava family and aria model which is different from the logic in LLaVA-NeXT repo. In LLaVA-NeXT, this is automatically done as its logic resize image first and paste (or overwrite) original image to it (https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/llava/mm_utils.py#L186).
Also add vision_aspect_ratio argument to LlavaOnevisionImageProcessor to regulate max_num_patches during unpadding. It was hard-coded to "9" before. (https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/llava/model/llava_arch.py#L387)
And fix typo "processinf" to "processing" in documents.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@amyeroberts, @qubvel