Skip to content

GTLLMZoo: A comprehensive framework that aggregates LLM benchmark data from multiple sources with an interactive UI for efficient model comparison, filtering, and evaluation across performance, safety, and efficiency metrics.

Notifications You must be signed in to change notification settings

git-disl/GTLLMZoo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GTLLMZoo 🦙

A comprehensive framework for aggregating, comparing, and evaluating Large Language Models (LLMs) through benchmark performance data from multiple sources.

GTLLMZoo Banner

📋 Overview

GTLLMZoo provides a unified platform for comparing LLMs across multiple dimensions including performance, efficiency, and safety. The framework aggregates data from various benchmark sources to enable researchers, developers, and decision-makers to make informed selections based on their specific requirements.

Key features:

  • Unified Benchmarks: Combines data from Open LLM Leaderboard, LLM Safety Leaderboard, LLM Performance Leaderboard, and Chatbot Arena
  • Interactive UI: Intuitive filtering and selection interface built with Gradio
  • Comprehensive Metrics: Compare models across performance, safety, efficiency, and user preference metrics
  • Customizable Views: Select specific metrics and model attributes for focused comparison

🚀 Getting Started

Prerequisites

  • Python >= 3.9
  • Gradio
  • Pandas
  • Beautiful Soup (for data scraping)

Installation

git clone https://github.com/git-disl/GTLLMZoo.git
cd GTLLMZoo
pip install -r requirements.txt

Running the Application

To run the application locally:

python app.py

For development with hot reloading:

gradio app.py

🔍 Features

LLM Comparison Tab

Compare LLMs based on:

  • Basic Information: Model name, parameter count, hub popularity
  • Benchmark Performance: Scores on ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K
  • Efficiency Metrics: Prefill time, decode speed, memory usage, energy efficiency
  • Safety Metrics: Non-toxicity, non-stereotype, fairness, ethics
  • Arena Performance: Chatbot arena ranking, ELO scores, user votes

Control Panel

Filter models by:

  • Model type
  • Architecture
  • Precision
  • License
  • Weight type

Data Export

Export filtered data to CSV for further analysis.

📊 Data Sources

GTLLMZoo aggregates data from:

🏗️ Project Structure

  • app.py: Main Gradio UI application
  • leaderboard.py: Functions to load and process leaderboard data
  • control.py: UI control callbacks and filtering functions
  • data_structures.py: Data structure definitions for LLMs and datasets
  • utils.py: Utility functions and enum classes
  • scrape_llm_lb.py: Scripts to scrape latest leaderboard data
  • merge.py: Functions to merge data from different sources
  • assets.py: Custom CSS and UI assets

💾 Data Files

  • llm.json: LLM metadata
  • dset.json: Dataset information
  • merged.csv: Merged data from all leaderboards

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📬 Contact

Project Link: https://github.com/git-disl/GTLLMZoo

🙏 Acknowledgements

  • HuggingFace for hosting the original leaderboards
  • All benchmark creators and maintainers
  • The open-source LLM community

About

GTLLMZoo: A comprehensive framework that aggregates LLM benchmark data from multiple sources with an interactive UI for efficient model comparison, filtering, and evaluation across performance, safety, and efficiency metrics.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages