A comprehensive framework for aggregating, comparing, and evaluating Large Language Models (LLMs) through benchmark performance data from multiple sources.
GTLLMZoo provides a unified platform for comparing LLMs across multiple dimensions including performance, efficiency, and safety. The framework aggregates data from various benchmark sources to enable researchers, developers, and decision-makers to make informed selections based on their specific requirements.
Key features:
- Unified Benchmarks: Combines data from Open LLM Leaderboard, LLM Safety Leaderboard, LLM Performance Leaderboard, and Chatbot Arena
- Interactive UI: Intuitive filtering and selection interface built with Gradio
- Comprehensive Metrics: Compare models across performance, safety, efficiency, and user preference metrics
- Customizable Views: Select specific metrics and model attributes for focused comparison
- Python >= 3.9
- Gradio
- Pandas
- Beautiful Soup (for data scraping)
git clone https://github.com/git-disl/GTLLMZoo.git
cd GTLLMZoo
pip install -r requirements.txt
To run the application locally:
python app.py
For development with hot reloading:
gradio app.py
Compare LLMs based on:
- Basic Information: Model name, parameter count, hub popularity
- Benchmark Performance: Scores on ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K
- Efficiency Metrics: Prefill time, decode speed, memory usage, energy efficiency
- Safety Metrics: Non-toxicity, non-stereotype, fairness, ethics
- Arena Performance: Chatbot arena ranking, ELO scores, user votes
Filter models by:
- Model type
- Architecture
- Precision
- License
- Weight type
Export filtered data to CSV for further analysis.
GTLLMZoo aggregates data from:
app.py
: Main Gradio UI applicationleaderboard.py
: Functions to load and process leaderboard datacontrol.py
: UI control callbacks and filtering functionsdata_structures.py
: Data structure definitions for LLMs and datasetsutils.py
: Utility functions and enum classesscrape_llm_lb.py
: Scripts to scrape latest leaderboard datamerge.py
: Functions to merge data from different sourcesassets.py
: Custom CSS and UI assets
llm.json
: LLM metadatadset.json
: Dataset informationmerged.csv
: Merged data from all leaderboards
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Project Link: https://github.com/git-disl/GTLLMZoo
- HuggingFace for hosting the original leaderboards
- All benchmark creators and maintainers
- The open-source LLM community