-
Notifications
You must be signed in to change notification settings - Fork 0
Feat: Image Processor for CLIP #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @@ -0,0 +1,28 @@ | |||
| CLIP_IMAGE_CONFIG = {'openai/clip-vit-base-patch32': { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to pull these from HuggingFace instead of hardcoding constants. This solves the problem, but hurts maintenance.
| @@ -0,0 +1,5 @@ | |||
| from typing import List, Union | |||
|
|
|||
| ImageInput = Union[ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can include this within the image_processor.
| @@ -0,0 +1,252 @@ | |||
| from image_config import CLIP_IMAGE_CONFIG | |||
| from jmage_constants import ImageInput | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
gboduljak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution. I will merge your change because my comments are nits. However, please ensure we test correctness as well. It is quite easy to mess this up. The errors in preprocessing can propagate further and can be difficult to debug. Here is a concrete test:
def test_image_processor():
mx_image_proc = CLIPImageProcessor('openai/clip-vit-base-patch32')
tf_image_proc = transformers.CLIPImageProcessor.from_pretrained(
'openai/clip-vit-base-patch32')
image = Image.open("cats.jpeg")
data = mx_image_proc([image])
mx_pixels = mx.array(data['pixel_values'])
tf_pixels = mx.array(np.array(tf_image_proc([image])[
'pixel_values']).transpose((0, 2, 3, 1)))
assert mx.array_equal(mx_pixels, tf_pixels)
@nkasmanoff: * clip image processor * added example usage
* probably approximatelly correct CLIPTextEncoder * implemented CLIPEncoderLayer as built-in nn.TransformerEncoderLayer * replaced embedding layer with simple matrix * implemented ViT * added ViT tests * fixed tests * added pooler_output for text * implemented complete CLIPModel * implemented init * implemented convert.py and from_pretrained * fixed some minor bugs and added the README.md * removed tokenizer unused comments * removed unused deps * updated ACKNOWLEDGEMENTS.md * Feat: Image Processor for CLIP (#1) @nkasmanoff: * clip image processor * added example usage * refactored image preprocessing * deleted unused image_config.py * removed preprocessing port * added dependency to mlx-data * fixed attribution and moved photos to assets * implemented a simple port of CLIPImageProcessor * review changes * PR review changes * renamed too verbose arg * updated README.md * nits in readme / conversion * simplify some stuff, remove unneeded inits * remove more init stuff * more simplify * make test a unit test * update main readme * readme nits --------- Co-authored-by: Noah Kasmanoff <[email protected]> Co-authored-by: Awni Hannun <[email protected]>
No description provided.