Conversation
|
@kamila-chay is this ready for review? |
|
@jackzhxng sorry for the delay, I had trouble with GPU access, today/tomorrow I should be able to wrap everything up and it will be ready :) |
57adb18 to
15ceb09
Compare
|
Hi @zucchini-nlp, we can start the review process, everything is ready except for the tests. I'm writing them now but I think I can do that in the meantime while responding to any comments from your end:) |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Thanks @kamila-chay , looks really nice and clean. Let's standardize a bit and then we can request core maintainer's review
docs/source/en/model_doc/fast_vlm.md
Outdated
| ## Resources | ||
|
|
||
| A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with image-to-text transformers (here using Llava as an example). | ||
|
|
||
| <PipelineTag pipeline="image-to-text"/> | ||
|
|
||
| - A [Google Colab demo](https://colab.research.google.com/drive/1qsl6cd2c8gGtEW1xV5io7S8NHh-Cp1TV?usp=sharing) on how to run Llava on a free-tier Google colab instance leveraging 4-bit inference. | ||
| - A [similar notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa/Inference_with_LLaVa_for_multimodal_generation.ipynb) showcasing batched inference. 🌎 | ||
|
|
There was a problem hiding this comment.
not needed unless it is directly related to FastVLM
| # only this value makes sense in FastVLM (we can't have a CLS token in conv layers) | ||
| if vision_feature_select_strategy != "full": | ||
| raise ValueError( | ||
| f"Unexpected select feature strategy: {vision_feature_select_strategy}, Only 'full' is supported in FastVLM." | ||
| ) | ||
|
|
||
| if any( | ||
| layer >= 0 | ||
| for layer in ( | ||
| vision_feature_layer if isinstance(vision_feature_layer, Iterable) else [vision_feature_layer] | ||
| ) | ||
| ): | ||
| raise ValueError(f"Only negative vision feature layer values are supported. Got {vision_feature_layer}") | ||
|
|
There was a problem hiding this comment.
let's delete this. Having in config init is enough for now
| >>> prompt = "<|im_start|>user\n<image>\nWhat's the content of the image?<|im_end|>\n<|im_start|>assistant\n" | ||
| >>> url = "https://www.ilankelman.org/stopsigns/australia.jpg" | ||
| >>> image = Image.open(requests.get(url, stream=True).raw) | ||
|
|
||
| >>> inputs = processor(images=image, text=prompt, return_tensors="pt").to(device) |
There was a problem hiding this comment.
lets use chat template for code snippet
ed83be0 to
ef4bce6
Compare
|
I did I am sorry |
ArthurZucker
left a comment
There was a problem hiding this comment.
LGTM perfect 🤗 Thanks @zucchini-nlp for the review!
|
run-slow: auto, fast_vlm |
|
This comment contains models: ["models/auto", "models/fast_vlm"] |
CI ResultsModel CI Report❌ Failed tests
|
|
@kamila-chay we can merge when the CI turns green, The weight tying failures aren't related to the model specifically, I will rebase and see if it was fixed on main already. There was a huge refactor recently and the main branch is a bit unstable upd: We need to run |
|
@bot /style |
|
Style fix runs successfully without any file modified. |
|
oh wow, style bot doesn't run fix copies, didn't know about that 🥲 |
|
Okk let me look at this issue and fix it, i've been away for a few days and didn't see |
|
A simple fix-copies might fix it all :) |
|
run-slow: fast_vlm |
|
This comment contains models: ["models/fast_vlm"] |
CI ResultsModel CI Report❌ Failed tests
|
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, fast_vlm |
c53f1c5 to
e73894c
Compare
e73894c to
8e8c12a
Compare
|
Everything's green now, slow tests should be ok too :) @zucchini-nlp |
|
run-slow: fast_vlm |
|
This comment contains models: ["models/fast_vlm"] |
CI Results✅ No failing test specific to this PR 🎉 ! |
|
Thanks, great work @kamila-chay ! Let's merge 🚀 |
* Added an initial conversion script * Added a modular where FastVLM is different from LlaVA * Improved the conversion script * Adjusted the conversion script * Removed redundant labels from FastViT & improved the template * Added docs and changed default config * Fix default config * Fix default config * Fixed layer feature handling and more docs * Fixed documentation * Style fixed * Some small fixes * Improved the example script to be more inclusive * Fixes after the rebase * Made the code and docs more readable and consistent * Some fixes from the review * Reverted back to last layer only * Typos fixed * added initial tests - some still failing * Style and quality fixes * Updated modular according to the review * Tests passing and some suggested generic improvements * Docs updated with another usage tip and an auto model * Reversed changes to test_can_intialize_on_meta becuase it's not fully compatible with one existing model * Some tweaks * Typo fix * Consistency fixed * Review comment * Redundant config attr deleted * Consistency fixed * Fixed integration tests after rebase --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
* Added an initial conversion script * Added a modular where FastVLM is different from LlaVA * Improved the conversion script * Adjusted the conversion script * Removed redundant labels from FastViT & improved the template * Added docs and changed default config * Fix default config * Fix default config * Fixed layer feature handling and more docs * Fixed documentation * Style fixed * Some small fixes * Improved the example script to be more inclusive * Fixes after the rebase * Made the code and docs more readable and consistent * Some fixes from the review * Reverted back to last layer only * Typos fixed * added initial tests - some still failing * Style and quality fixes * Updated modular according to the review * Tests passing and some suggested generic improvements * Docs updated with another usage tip and an auto model * Reversed changes to test_can_intialize_on_meta becuase it's not fully compatible with one existing model * Some tweaks * Typo fix * Consistency fixed * Review comment * Redundant config attr deleted * Consistency fixed * Fixed integration tests after rebase --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
What does this PR do?
This PR adds FastVLM from Apple. The model's architecture is very similar to LlaVA, the main difference is that it uses a very fast hybrid encoder called FastViTHD. Timm's FastViT implementation is used and the LlaVA modality connector has been slightly modified.
Addresses (#38765)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@zucchini-nlp @ariG23498