Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Infer with LLaVA-RLHF #2

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

WIP: Infer with LLaVA-RLHF #2

wants to merge 1 commit into from

Conversation

monatis
Copy link
Owner

@monatis monatis commented Oct 1, 2023

This is still WIP

After I implemented the GGUF support in clip.cpp, now it's time to combine clip.cpp + llama.cpp = llava.cpp (the first model to be supported in this repo).

For now, I copy CLIP conversion + model loading + inference code from clip.cpp and make necessary changes. In the future, these changes may be merged upstream and clip.cpp may be a submodule in this repo.

  • LLaVA surgery: merge base and LoRA weights, strip the multimodal projector.
  • Convert the LLaMA part with llama.cpp.
  • Update CLIP conversion script to save a LLaVA encoder model in GGUF.
  • Load CLIP vision model with LLaVA projector in clip.cpp.
  • Update clip_image_encode function to get image hidden states from layers[-2].
  • Write a simple example for end-to-end LLaVA infrence.

I think This is enough for the initial release. I will streamline the implementation afterwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant