WIP: Infer with LLaVA-RLHF #2

monatis · 2023-10-01T11:45:14Z

This is still WIP

After I implemented the GGUF support in clip.cpp, now it's time to combine clip.cpp + llama.cpp = llava.cpp (the first model to be supported in this repo).

For now, I copy CLIP conversion + model loading + inference code from clip.cpp and make necessary changes. In the future, these changes may be merged upstream and clip.cpp may be a submodule in this repo.

LLaVA surgery: merge base and LoRA weights, strip the multimodal projector.
Convert the LLaMA part with llama.cpp.
Update CLIP conversion script to save a LLaVA encoder model in GGUF.
Load CLIP vision model with LLaVA projector in clip.cpp.
Update clip_image_encode function to get image hidden states from layers[-2].
Write a simple example for end-to-end LLaVA infrence.

I think This is enough for the initial release. I will streamline the implementation afterwards.

WIP: Infer with LLaVA-RLHF

09f09e7

This was referenced Oct 1, 2023

Metal support? monatis/clip.cpp#79

Open

llama : add multimodal support (LLaVA) ggerganov/llama.cpp#3332

Closed

Implement multimodal models (LLaVA) ggerganov/llama.cpp#3436

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Infer with LLaVA-RLHF #2

WIP: Infer with LLaVA-RLHF #2

monatis commented Oct 1, 2023

WIP: Infer with LLaVA-RLHF #2

Are you sure you want to change the base?

WIP: Infer with LLaVA-RLHF #2

Conversation

monatis commented Oct 1, 2023