Lisa Dunlap, Krishna Mandal, Trevor Darrell, Jacob Steinhardt, Joseph E. Gonzalez
Paper link here, joke version of paper coming soon
Still cleaning this up: I got distracted trying to implement some causal inference stuff...
- Link to chatbot arena data
- Human VS GPT (HC3)
- HELM Predictions (fair warning, this is a real pain to download)
- (Recommended) Create a new conda environment.
conda create -n myenv python=3.10 -y
conda activate myenv
- Installation (please make a PR if I forgot any imports!)
pip install -r requirements.txt
-
Create a weights and biases account if you dont already have one
-
Copy this into a file named serve/global_vars.py and set your openai key
# LLM API (if you want to use a local LLM, use vLLM)
LLAMA_URL = "http://localhost:8001/v1"
VICUNA_URL = "http://localhost:8001"
LLM_CACHE_FILE = "cache/cache_llm"
LLM_EMBED_CACHE_FILE = "cache/cache_llm_embed"
OPENAI_API_KEY = [put your key here]
ANTHROPIC_API_KEY = [put your key here]
- Run a config
python main.py --config configs/base.yaml wandb=True
This runs a toy example on LLM outputs, one model is prompted to be friendly, the other cold and factual. I randomly assigned preference so friendly results are favored 80% of the time
All data needs to contain the columns "question", model_name_1, model_name_2, and optionally "preference". If the preference column is not provided, running main will compute the preference via LLM as a jude (warning the LLMs are hardcoded in the file)
Say your two models are gpt-4o and gemini-1.5-flash. Your CSV should have the columns "question", "gpt-4o", "gemini-1.5-flash" and in your config, set your data path and set models: [gpt-4o, gemini-1.5-flash]
. Sometime soon I will add an option to only optimize for model matching if you only care to find differentiating qualities, so get excited for that.
This code structure is loosely modeled off the VisDiff repo
Here are the core components:
- Proposer: takes in prompt, output_a, output_b triplets and return a list of axes
- Reducer: takes a long list of axes and returns a shorter list of representative axes
- Ranker: takes in a triplet and an axis and produces a score
If you use this repo in your research, please cite it as follows and ideally use the word 'vibe' in said research:
@article{dunlap_vibecheck,
title={VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models},
author={Lisa Dunlap and Krishna Mandal and Trevor Darrell and Jacob Steinhardt and Joseph E Gonzalez},
journal={arXiv preprint arXiv:2312.02974},
year={2024},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.12851},
}