Skip to content

Conversation

@nkasmanoff
Copy link

No description provided.

@@ -0,0 +1,28 @@
CLIP_IMAGE_CONFIG = {'openai/clip-vit-base-patch32': {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to pull these from HuggingFace instead of hardcoding constants. This solves the problem, but hurts maintenance.

@@ -0,0 +1,5 @@
from typing import List, Union

ImageInput = Union[
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can include this within the image_processor.

@@ -0,0 +1,252 @@
from image_config import CLIP_IMAGE_CONFIG
from jmage_constants import ImageInput
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Copy link
Owner

@gboduljak gboduljak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. I will merge your change because my comments are nits. However, please ensure we test correctness as well. It is quite easy to mess this up. The errors in preprocessing can propagate further and can be difficult to debug. Here is a concrete test:

def test_image_processor():
    mx_image_proc = CLIPImageProcessor('openai/clip-vit-base-patch32')
    tf_image_proc = transformers.CLIPImageProcessor.from_pretrained(
        'openai/clip-vit-base-patch32')
    image = Image.open("cats.jpeg")

    data = mx_image_proc([image])
    mx_pixels = mx.array(data['pixel_values'])
    tf_pixels = mx.array(np.array(tf_image_proc([image])[
        'pixel_values']).transpose((0, 2, 3, 1)))

    assert mx.array_equal(mx_pixels, tf_pixels)

@gboduljak gboduljak merged this pull request into gboduljak:clip Jan 16, 2024
awni pushed a commit that referenced this pull request Jan 31, 2024
@nkasmanoff:
* clip image processor
* added example usage
gboduljak added a commit that referenced this pull request Feb 3, 2024
* probably approximatelly correct CLIPTextEncoder

* implemented CLIPEncoderLayer as built-in nn.TransformerEncoderLayer

* replaced embedding layer with simple matrix

* implemented ViT

* added ViT tests

* fixed tests

* added pooler_output for text

* implemented complete CLIPModel

* implemented init

* implemented convert.py and from_pretrained

* fixed some minor bugs and added the README.md

* removed tokenizer unused comments

* removed unused deps

* updated ACKNOWLEDGEMENTS.md

* Feat: Image Processor for CLIP (#1)

@nkasmanoff:
* clip image processor
* added example usage

* refactored image preprocessing

* deleted unused image_config.py

* removed preprocessing port

* added dependency to mlx-data

* fixed attribution and moved photos to assets

* implemented a simple port of CLIPImageProcessor

* review changes

* PR review changes

* renamed too verbose arg

* updated README.md

* nits in readme / conversion

* simplify some stuff, remove unneeded inits

* remove more init stuff

* more simplify

* make test a unit test

* update main readme

* readme nits

---------

Co-authored-by: Noah Kasmanoff <[email protected]>
Co-authored-by: Awni Hannun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants