LlaVA in MLX #461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

awni merged 35 commits into ml-explore:main from nkasmanoff:main

Mar 1, 2024

README.md

-Original file line number
+Diff line change
@@ Expand Up / @@ -31,6 +31,7 @@ Some more useful examples are listed below. @@
     ### Multimodal models
     - Joint text and image embeddings with [CLIP](clip).
+    - Text generation from image and text inputs with [LLaVA](llava).
     ### Other Models
@@ Expand Down @@

llava/.gitignore

Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		**.ipynb

llava/README.md

-Original file line number
+Diff line change
@@ -0,0 +1,61 @@
+    # LLaVA
+    An example of LLaVA: Large Language and Vision Assistant in MLX.[^1] LLlava is
+    a multimodal model that can generate text given combined image and text inputs.
+    ## Setup
+    Install the dependencies:
+    ```bash
+    pip install -r requirements.txt
+    ```
+    ## Run
+    You can use LLaVA to ask questions about images.
+    For example, using the command line:
+    ```bash
+    python generate.py \
+      --model llava-hf/llava-1.5-7b-hf \
+      --image "http://images.cocodataset.org/val2017/000000039769.jpg" \
+      --prompt "USER: <image>\nWhat are these?\nASSISTANT:" \
+      --max-tokens 128 \
+      --temp 0
+    ```
+    This uses the following image:
+    ![alt text](http://images.cocodataset.org/val2017/000000039769.jpg)
+    And generates the output:
+    ```
+    These are two cats lying on a pink couch.
+    ```
+    You can also use LLaVA in Python:
+    ```python
+    from generate import load_model, prepare_inputs, generate_text
+    processor, model = load_model("llava-hf/llava-1.5-7b-hf")
+    max_tokens, temperature = 128, 0.0
+    prompt = "USER: <image>\nWhat are these?\nASSISTANT:"
+    image = "http://images.cocodataset.org/val2017/000000039769.jpg"
+    input_ids, pixel_values = prepare_inputs(processor, image, prompt)
+    reply = generate_text(
+        input_ids, pixel_values, model, processor, max_tokens, temperature
+    )
+    print(reply)
+    ```
+    [^1]:
+        Refer to [LLaVA project webpage](https://llava-vl.github.io/) for more
+        information.

llava/generate.py

-Original file line number
+Diff line change
@@ -0,0 +1,130 @@
+    # Copyright © 2024 Apple Inc.
+    import argparse
+    import codecs
+    from pathlib import Path
+    import mlx.core as mx
+    import requests
+    from PIL import Image
+    from transformers import AutoProcessor
+    from llava import LlavaModel
+    def parse_arguments():
+        parser = argparse.ArgumentParser(
+            description="Generate text from an image using a model."
+        )
+        parser.add_argument(
+            "--model",
+            type=str,
+            default="llava-hf/llava-1.5-7b-hf",
+            help="The path to the local model directory or Hugging Face repo.",
+        )
+        parser.add_argument(
+            "--image",
+            type=str,
+            default="http://images.cocodataset.org/val2017/000000039769.jpg",
+            help="URL or path of the image to process.",
+        )
+        parser.add_argument(
+            "--prompt",
+            type=str,
+            default="USER: <image>\nWhat are these?\nASSISTANT:",
+            help="Message to be processed by the model.",
+        )
+        parser.add_argument(
+            "--max-tokens",
+            type=int,
+            default=100,
+            help="Maximum number of tokens to generate.",
+        )
+        parser.add_argument(
+            "--temp", type=float, default=0.3, help="Temperature for sampling."
+        )
+        return parser.parse_args()
+    def load_image(image_source):
+        """
+        Helper function to load an image from either a URL or file.
+        """
+        if image_source.startswith(("http://", "https://")):
+            try:
+                response = requests.get(image_source, stream=True)
+                response.raise_for_status()
+                return Image.open(response.raw)
+            except Exception as e:
+                raise ValueError(
+                    f"Failed to load image from URL: {image_source} with error {e}"
+                )
+        elif Path(image_source).is_file():
+            try:
+                return Image.open(image_source)
+            except IOError as e:
+                raise ValueError(f"Failed to load image {image_source} with error: {e}")
+        else:
+            raise ValueError(
+                f"The image {image_source} must be a valid URL or existing file."
+            )
+    def prepare_inputs(processor, image, prompt):
+        if isinstance(image, str):
+            image = load_image(image)
+        inputs = processor(prompt, image, return_tensors="np")
+        pixel_values = mx.array(inputs["pixel_values"])
+        input_ids = mx.array(inputs["input_ids"])
+        return input_ids, pixel_values
+    def load_model(model_path):
+        processor = AutoProcessor.from_pretrained(model_path)
+        model = LlavaModel.from_pretrained(model_path)
+        return processor, model
+    def sample(logits, temperature=0.0):
+        if temperature == 0:
+            return mx.argmax(logits, axis=-1)
+        else:
+            return mx.random.categorical(logits * (1 / temperature))
+    def generate_text(input_ids, pixel_values, model, processor, max_tokens, temperature):
+        logits, cache = model(input_ids, pixel_values)
+        logits = logits[:, -1, :]
+        y = sample(logits, temperature=temperature)
+        tokens = [y.item()]
+        for n in range(max_tokens - 1):
+            logits, cache = model.language_model(y[None], cache=cache)
+            logits = logits[:, -1, :]
+            y = sample(logits, temperature)
+            token = y.item()
+            if token == processor.tokenizer.eos_token_id:
+                break
+            tokens.append(token)
+        return processor.tokenizer.decode(tokens)
+    def main():
+        args = parse_arguments()
+        processor, model = load_model(args.model)
+        prompt = codecs.decode(args.prompt, "unicode_escape")
+        input_ids, pixel_values = prepare_inputs(processor, args.image, prompt)
+        print(prompt)
+        generated_text = generate_text(
+            input_ids, pixel_values, model, processor, args.max_tokens, args.temp
+        )
+        print(generated_text)
+    if __name__ == "__main__":
+        main()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LlaVA in MLX #461

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!