Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multimodal Eval Enablement (Looking for Developer to Implement Design) #1334

Open
Olivia-liu opened this issue Oct 29, 2024 · 13 comments
Open
Labels
actionable Items in the backlog waiting for an appropriate impl/fix enhancement New feature or request good first issue Good for newcomers Llama 3.2- Multimodal Issues related to Multimodal of Llama3.2

Comments

@Olivia-liu
Copy link
Contributor

Olivia-liu commented Oct 29, 2024

🚀 The feature, motivation and pitch

Please note that since the actual implementation is going to be simple, and the design has already been reviewed, the purpose of this GitHub Issue is to look for a developer to implement this feature ASAP.

LLM eval stands for the process of assessing the perplexity, performance and capabilities of LLMs, usually by having the model complete one or a series of tasks and assigning them scores. Torchchat is already using EleutherAI’s lm-evaluation-harness to do eval on text LLM (code pointer). Recently, torchtune has worked with EleutherAI to enable eval on text-image models in the harness, and has integrated this feature into torchtune (code pointer). Torchchat wants to just copy that solution from torchtune for text-image models.

Without the ability to do eval on multimodal LLMs, the enablement of multimodal LLMs on torchchat is incomplete. It’s critical to understand how well torchchat performs with image inputs.

Additional context

Assumptions

  • The eval for text LLMs is already enabled on torchchat. Code pointer to the core eval function and the main function.
  • The Llama 3.2-11b multimodal model has been onboarded to torchchat, and in the future there will be more multimodal LLMs on torchchat.
  • EleutherAI’s lm-evaluation-harness has enabled eval on llama3.2-11b, thus we don’t need to make code changes in EleutherAI repo.

The Main Goal

A torchchat user can run eval on the llama 3.2-11b model (which image-text-in, text-out). Note that we don’t need to worry about the internals of how the eval happens because we will only be calling the EleutherAI’s eval libraries and report the metrics it returns.

The user interface will be a commandline python torchchat.py eval <model-name> with additional arguments specifying detailed requirements for the eval tasks.

The result will be printed out on the terminal which include the following metrics:

  • Tasks that have been run
  • The score to each task
  • The time it took to run each task

RFC (Optional)

Design

Overview

In this design, the multimodal eval in torchchat will borrow from the implementation of multimodal eval in torchtune which utilizes EleutherAI’s lm-evaluation-harness. The reason we can do this is that torchchat uses the same Llama 3.2-11b model definition as torchtune.

Details

The Core Eval Implementation

[Preferred] Approach A: import the implementation of HFMultimodalLM from torchtune directly

The easiest implementation is to import the implementation of HFMultimodalLM directly from torchtune, then call evaluate() with this wrapper class passed in.

Here’s torchtune’s implementation of HFMultimodalLM: code pointer.

Pseudocode:

# In eval.py
from torchtune.recipes.eleuther_eval import _VLMEvalWrapper

if model is text-based:
   do the existing text-based model eval
elif model is text-image-based:
   eval_results = evaluate(_VLMEvalWrapper(...))

The pros and cons of this solution is discussed in the following “Alternatives Discussion” section. This solution should be the one to start with given how quick it can enable multimodal eval on torchchat. If for some unforeseen reason that it doesn’t work, then take the following approach that requires more work.

Approach B: copy the implementation of HFMultimodalLM from torchtune

  1. Creating a wrapper class that overrides class HFMultimodalLM, which is an abstract Hugging Face model class for multimodal models. The implementation of this class can be copied from torchtune, code pointer.
  2. Then call evaluate() with this wrapper class passed in.

Pseudocode:

# In eval.py
from lm_eval.models.hf_vlms import HFMultimodalLM
from lm_eval.evaluator import evaluate

class VLMEvalWrapper(HFMultimodalLM):
   ...# implementation

if model is text-based:
   do the existing text-based model eval
elif model is text-image-based:
   eval_results = evaluate(VLMEvalWrapper(...))

The Commandline Arguments

User command should be python torchchat.py eval llama3.2-11b + some optional arguments.

In terms of implementation, reuse the same cli entry point as the text eval: torchchat.py, eval.py. Then in def eval(), have an if-else to decide which eval wrapper (GPTFastEvalWrapper or the new VLMEvalWrapper) to use based on model type.

Alternatives Discussion

Discuss the pros and cons of importing torchtune’s implementation directly

Pro:

  1. Easy to implement because it’s just an import
  2. Consistency between torchchat and torchtune
  3. Easy maintenance for us
  4. Torchtune has a better relationship with EleutherAI

Cons:

  1. Hard to customize the implementation for torchchat’s needs
  2. For some models, we use model definitions that are different from torchtune’s
  3. We rely on the compatibility on their side
  4. We have more dependency on torchtune

Testing & Tooling Plan

Run command python torchchat.py eval llama3.2-11b with different parameter combinations.

The expected output is the tasks that have been run, their scores and the time it took to run each task.

@Olivia-liu Olivia-liu added enhancement New feature or request good first issue Good for newcomers actionable Items in the backlog waiting for an appropriate impl/fix Llama 3.2- Multimodal Issues related to Multimodal of Llama3.2 labels Oct 29, 2024
@Gasoonjia
Copy link
Contributor

Thanks Olivia for the RFC!

I would like to provide the third option: creating our own VLMEvalWrapper, but instead of coping-and-pasting the implementation of HFMultimodalLM from torchtune, we can make it inherit from HFMultimodalLM.

I think such way can have following benefits:

  1. Easy to implement. First version we can make VLMEvalWrapper a simple wrapper on the top of HFMultimodalLM without any other add-ons, so that it is just an import at very beginning
  2. Easy to maintain for us and deduplicate code across different repos
  3. Make torchchat as lean as possible. The starting point of torchchat would always a small codebase showcasing the ability to run large language models (LLMs) seamlessly.
  4. Keep the customizing ability. We can always add new functions or even overwrite the existing functions for our own model definitions or other purpose
  5. Make eval.py lean and easy to maintain. Personally, I would like to eliminate as much if-else statement as possible during the validating process, to avoid bloated code and unclear logic (that's what we plan to do on generate.py and build.py). Creating our own VLMEvalWrapper can help us create a structural validation logic (e.g. absorb current text-only validation logic into the class.)

Please let me how's that feel

@Vishnu-sai-teja
Copy link

I would like to take this up, can try to help out in this.

@Olivia-liu
Copy link
Contributor Author

Thanks Olivia for the RFC!

I would like to provide the third option: creating our own VLMEvalWrapper, but instead of coping-and-pasting the implementation of HFMultimodalLM from torchtune, we can make it inherit from HFMultimodalLM.

I think such way can have following benefits:

  1. Easy to implement. First version we can make VLMEvalWrapper a simple wrapper on the top of HFMultimodalLM without any other add-ons, so that it is just an import at very beginning
  2. Easy to maintain for us and deduplicate code across different repos
  3. Make torchchat as lean as possible. The starting point of torchchat would always a small codebase showcasing the ability to run large language models (LLMs) seamlessly.
  4. Keep the customizing ability. We can always add new functions or even overwrite the existing functions for our own model definitions or other purpose
  5. Make eval.py lean and easy to maintain. Personally, I would like to eliminate as much if-else statement as possible during the validating process, to avoid bloated code and unclear logic (that's what we plan to do on generate.py and build.py). Creating our own VLMEvalWrapper can help us create a structural validation logic (e.g. absorb current text-only validation logic into the class.)

Please let me how's that feel

I think this makes a lot of sense! Thanks for writing it up. Let's prefer this over the Approach B above. I'll still prefer to get Approach A work first if possible, given how simple that can be.

@Olivia-liu
Copy link
Contributor Author

I would like to take this up, can try to help out in this.

@Vishnu-sai-teja That'd be awesome! Please go ahead and take it. Looking forward to it!

@Olivia-liu
Copy link
Contributor Author

I would like to take this up, can try to help out in this.

@Vishnu-sai-teja That'd be awesome! Please go ahead and take it. Looking forward to it!

@Vishnu-sai-teja Once you have an ETA for a PR, please kindly let us know!

@Jack-Khuu Jack-Khuu added the RFC Request for Comment label Oct 30, 2024
@Vishnu-sai-teja
Copy link

Hi @Olivia-liu,

I plan to submit the initial PR in 3-4 days. Here's the brief timeline:

Day 1-2: Study existing eval implementation and HFMultimodalLM class
Day 3-4: Implement VLMEvalWrapper and integrate with eval.py, along with basic testing

As a newbie to the codebase, this timeline ensures a quality implementation while I thoroughly understand it. Let me know if you need any adjustments.

Thanks!

@Jack-Khuu
Copy link
Contributor

Sounds great, thanks for contributing!

@Olivia-liu is OOO for a bit, so I can help with any questions/blockers you might run into

@Gasoonjia
Copy link
Contributor

You are also welcome to join our slack channel to chat with us.

Please see https://github.com/pytorch/torchchat?tab=readme-ov-file#community-contributions for more info.

@byjlw
Copy link
Contributor

byjlw commented Nov 5, 2024

@Vishnu-sai-teja, hey checking in! Still able to take this one on?

@Vishnu-sai-teja
Copy link

@Vishnu-sai-teja, hey checking in! Still able to take this one on?

Hey tried to implement both ways, getting errors in the torch time imports while running the evaluation for the torchchat.

@byjlw
Copy link
Contributor

byjlw commented Nov 6, 2024

@Vishnu-sai-teja, hey checking in! Still able to take this one on?

Hey tried to implement both ways, getting errors in the torch time imports while running the evaluation for the torchchat.

Easiest to discuss if you join the slack channel.

But we can do it here if that works best.

Can you share a link to your branch, the commands you ran and the full output you got?

@Gasoonjia
Copy link
Contributor

Gasoonjia commented Nov 11, 2024

hi @Vishnu-sai-teja just wanna check if everything is ok here.
If you need help, feel free to share branch or any info here, or using the slack channel to share with us.

@Olivia-liu Olivia-liu removed the RFC Request for Comment label Nov 12, 2024
@Olivia-liu Olivia-liu changed the title RFC: Multimodal Eval Enablement (Looking for Developer to Implement Design) Multimodal Eval Enablement (Looking for Developer to Implement Design) Nov 12, 2024
@Olivia-liu
Copy link
Contributor Author

Olivia-liu commented Nov 12, 2024

Looking for new assignee(s) of this Issue. Is anyone interested in taking it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
actionable Items in the backlog waiting for an appropriate impl/fix enhancement New feature or request good first issue Good for newcomers Llama 3.2- Multimodal Issues related to Multimodal of Llama3.2
Projects
None yet
Development

No branches or pull requests

5 participants