llava: fix clip-model-is-vision flag in README.md #5509

danbev · 2024-02-15T13:27:12Z

This commit fixes the flag --clip_model_is_vision in README.md which is does not match the actual flag:

$ python convert-image-encoder-to-gguf.py --help
...
  --clip-model-is-vision
                        The clip model is a pure vision model (ShareGPT4V vision extract for example)

This commit fixes the flag `--clip_model_is_vision` in README.md which is does not match the actual flag: ```console $ python convert-image-encoder-to-gguf.py --help ... --clip-model-is-vision The clip model is a pure vision model (ShareGPT4V vision extract for example) ``` Signed-off-by: Daniel Bevenius <[email protected]>

cmp-nct · 2024-02-15T13:42:47Z

The arg is checked like this in the python script:
if args.clip_model_is_vision
But maybe both ways work in python

That said: the argument as I wrote it in the readme works. Maybe both variants work ?
If only the original variant works we should update the python script instead (the help message shown)

danbev · 2024-02-15T13:56:14Z

That said: the argument as I wrote it in the readme works. Maybe both variants work ?

I ran into an issue copying the command from README and got the following error:

(llava-venv) $ python ./examples/llava/convert-image-encoder-to-gguf.py -m vit --llava-projector vit/llava.projector --output-dir vit --clip_model_is_vision
usage: convert-image-encoder-to-gguf.py [-h] -m MODEL_DIR [--use-f32] [--text-only] [--vision-only]
                                        [--clip-model-is-vision] [--clip-model-is-openclip]
                                        [--llava-projector LLAVA_PROJECTOR] [--projector-type {mlp,ldp}]
                                        [-o OUTPUT_DIR] [--image-mean IMAGE_MEAN [IMAGE_MEAN ...]]
                                        [--image-std IMAGE_STD [IMAGE_STD ...]]
convert-image-encoder-to-gguf.py: error: unrecognized arguments: --clip_model_is_vision

I might be missing something as it works for you. I'd be happy to update if needed, just wondering about consistency with the other options which seem to use - and not _.

cmp-nct · 2024-02-15T14:13:25Z

It looks like the python file was indeed changed, on my side it was with underscores still!

It looks like python adapts arguments with a "-" internally to underscores as the minus is illegal in variable names.
Feels a bit wrong to me to use a different argument as the variable turns up but your change indeed is fixing it.

danbev · 2024-02-15T14:22:55Z

@cmp-nct May I bother you with another issue I ran into (after the issue in this PR) which is:

Traceback (most recent call last):
  File "/home/danielbevenius/work/ai/llama.cpp/./examples/llava/convert-image-encoder-to-gguf.py", line 199, in <module>
    fout.add_uint32("clip.vision.image_size", v_hparams["image_size"])
                                              ~~~~~~~~~^^^^^^^^^^^^^^
KeyError: 'image_size'

I was able to work around this by changing:

(llava-venv) $ git diff examples
diff --git a/examples/llava/convert-image-encoder-to-gguf.py b/examples/llava/convert-image-encoder-to-gguf.py
index c69f89ac..94754a47 100644
--- a/examples/llava/convert-image-encoder-to-gguf.py
+++ b/examples/llava/convert-image-encoder-to-gguf.py
@@ -117,7 +117,7 @@ else:
 with open(dir_model + "/config.json", "r", encoding="utf-8") as f:
     config = json.load(f)
     if args.clip_model_is_vision:
-        v_hparams = config
+        v_hparams = config["vision_config"]
         t_hparams = None
     else:
         v_hparams = config["vision_config"]

I'm using the config.json suggested in the README but not sure if this is a correct "fix". With this change I'm able to run the llava-cli with llava-v1.6-vicuna-7b/ggml-model-f16.gguf 🎉

cmp-nct · 2024-02-15T14:35:16Z

@cmp-nct May I bother you with another issue I ran into (after the issue in this PR) which is:

Traceback (most recent call last):
  File "/home/danielbevenius/work/ai/llama.cpp/./examples/llava/convert-image-encoder-to-gguf.py", line 199, in <module>
    fout.add_uint32("clip.vision.image_size", v_hparams["image_size"])
                                              ~~~~~~~~~^^^^^^^^^^^^^^
KeyError: 'image_size'

I was able to work around this by changing:

(llava-venv) $ git diff examples
diff --git a/examples/llava/convert-image-encoder-to-gguf.py b/examples/llava/convert-image-encoder-to-gguf.py
index c69f89ac..94754a47 100644
--- a/examples/llava/convert-image-encoder-to-gguf.py
+++ b/examples/llava/convert-image-encoder-to-gguf.py
@@ -117,7 +117,7 @@ else:
 with open(dir_model + "/config.json", "r", encoding="utf-8") as f:
     config = json.load(f)
     if args.clip_model_is_vision:
-        v_hparams = config
+        v_hparams = config["vision_config"]
         t_hparams = None
     else:
         v_hparams = config["vision_config"]

I'm using the config.json suggested in the README but not sure if this is a correct "fix". With this change I'm able to run the llava-cli with llava-v1.6-vicuna-7b/ggml-model-f16.gguf 🎉

The original was correct, what the script could need is automatic detection. Though there is a lot that is needed to be done with the current llava and clip implementation ;)

When you use --model-is-vision the conversion script expects a Vision tower from CLIP and a configuration for that vision tower.
When you omit this argument the conversion script expects a full CLIP model and its configuration file.

So config["vision_config"] is the configuration of the vision tower inside a full CLIP model config.

Now the problem is that my config.json on the huggingface is not the one I used for the vit, I just noticed that.
I uploaded a new config_vit.json (https://huggingface.co/cmp-nct/llava-1.6-gguf/blob/main/config_vit.json) which contains the config you should use for the pure vision tower (needs to be renamed).
Can you add that change into your PR ?

examples/llava/README.md

Signed-off-by: Daniel Bevenius <[email protected]>

* llava: fix clip-model-is-vision flag in README.md This commit fixes the flag `--clip_model_is_vision` in README.md which is does not match the actual flag: ```console $ python convert-image-encoder-to-gguf.py --help ... --clip-model-is-vision The clip model is a pure vision model (ShareGPT4V vision extract for example) ``` Signed-off-by: Daniel Bevenius <[email protected]> * llava: update link to vit config in README.md Signed-off-by: Daniel Bevenius <[email protected]> --------- Signed-off-by: Daniel Bevenius <[email protected]>

cmp-nct approved these changes Feb 15, 2024

View reviewed changes

cmp-nct reviewed Feb 15, 2024

View reviewed changes

examples/llava/README.md Outdated Show resolved Hide resolved

llava: update link to vit config in README.md

4453456

Signed-off-by: Daniel Bevenius <[email protected]>

cmp-nct approved these changes Feb 15, 2024

View reviewed changes

ggerganov merged commit 60ed04c into ggerganov:master Feb 16, 2024
21 checks passed

danbev deleted the llava-clip-model-is-vision branch February 16, 2024 11:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llava: fix clip-model-is-vision flag in README.md #5509

llava: fix clip-model-is-vision flag in README.md #5509

danbev commented Feb 15, 2024

cmp-nct commented Feb 15, 2024 •

edited

Loading

danbev commented Feb 15, 2024

cmp-nct commented Feb 15, 2024

danbev commented Feb 15, 2024

cmp-nct commented Feb 15, 2024 •

edited

Loading

llava: fix clip-model-is-vision flag in README.md #5509

llava: fix clip-model-is-vision flag in README.md #5509

Conversation

danbev commented Feb 15, 2024

cmp-nct commented Feb 15, 2024 • edited Loading

danbev commented Feb 15, 2024

cmp-nct commented Feb 15, 2024

danbev commented Feb 15, 2024

cmp-nct commented Feb 15, 2024 • edited Loading

cmp-nct commented Feb 15, 2024 •

edited

Loading

cmp-nct commented Feb 15, 2024 •

edited

Loading