Skip to content

moirai-internal/genai-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GenAI Evaluation Framework

A comprehensive testing framework for LLM models and implementations, providing automated sanity checks for version mismatch, response consistency, and performance validation across different model implementations.

Overview

The GenAI Evaluation Framework ensures:

  • Version Consistency: Compare outputs between different versions of LLM models and servers.
  • Request Consistency: Compare responses across multiple requests to the same endpoint to ensure consistency.
  • Embedding Alignment: Validate embedding consistency across platforms
  • LM Evaluation: Run language model accuracy check using lm-evaluation-harness
  • Multimodality tasts: Test model handling of images with different dimensions and quantities

Quick Start

Installation

# Clone the repository
git clone <repository-url>
cd genai-eval

# Install as a package
pip install -e .

Basic Usage

Using the CLI (Recommended):

# Use the `.sanity-check.yaml` file to configure hyperparameters.
genai-eval evaluation      # Runs the full evaluation suite of vllm_completion, csv_similarity, chat_consistency
genai-eval vllm_completion # Run the vLLM completion check
genai-eval semantic_similarity  # Run the semantic similarity check
genai-eval chat_consistency # Run the chat consistency check
genai-eval lm_eval # Run the LM evaluation harness check
genai-eval image_size_check # Run the handling of images with different dimensions check
genai-eval image_number_check # Run the handling of multiple images in a single request check

Using Individual Scripts:

# Version comparison
 python3 genai_evaluation/version_consistency/vllm_completion_sanity_check.py \
 --total-runs 3 \
 --output-folder /Users/results \
 --vllm-old-version 0.7.3.post1 --vllm-new-version 0.9.1 \
 --vllm-old-api http://localhost:8081/v1/chat/completions \
 --vllm-new-api http://localhost:8080/v1/chat/completions \
 --model-name /models/Llama-3-70B-Instruct

# Semantic similarity analysis
python3 genai_evaluation/version_consistency/semantic_similarity_check.py \
  --folder ./results/vllm_completion_detailed_report_20250805_162453 \
  --old-version-col "vllm 0.7.3.post1 output" \
  --new-version-col "vllm 0.9.1 output" \
  --mismatch-file "0.9.1_0.7.3.post1_all_run_details.csv"

# Request consistency
python genai_evaluation/request_consistency/openai_chat_consistency_check.py

# Check handling of images with different dimensions
python genai_evaluation/multimodality/image_size_check.py

# Check handling of multiple images in a single request
python genai_evaluation/multimodality/image_number_check.py

LM Evaluation Harness Integration:

The framework includes direct integration with lm-eval-harness, which allows evaluating language models on various benchmark tasks.

python3 genai_evaluation/external/lm-eval-harness/lm_eval_harness.py \
  --model_args "model=vllm-model,base_url=http://localhost:8081/v1/completions,tokenizer=/path/to/tokenizer" \
  --tasks mmlu \
  --output_path lm-eval-results

Common benchmark tasks include: arc_challenge, hellaswag, truthfulqa_mc, winogrande, gsm8k, mmlu. You can specify multiple tasks using comma-separated values.

See genai_evaluation/external/lm-eval-harness/README.md for more detailed usage information.

Available Validation Checks

Default CLI Workflow

When using the CLI without specific checks, it runs a comprehensive three-stage workflow:

  1. vLLM Completion: Compares outputs between old and new models or servers.
  2. CSV Similarity: Analyzes semantic similarity of the generated outputs
  3. Chat Consistency: Tests consistency for each endpoint.

Additionally, the following checks can be run individually:

  1. LM Eval: Runs language model benchmarks using lm-evaluation-harness
  2. Image Size Check: Tests if model can process images of different dimensions
  3. Image Number Check: Tests if model can process multiple images in a single request

Individual Validation Scripts

Script Purpose
semantic_similarity_check.py Analyzes semantic similarity of the generated outputs
vllm_completion_sanity_check.py Compares outputs between old and new models or servers.
openai_chat_consistency_check.py Tests consistency for each endpoint.
openai_embeddings_sanity_check.py Compare embeddings from the same model hosted on HuggingFace and OpenAI
lm_eval_harness.py Run lm-eval-harness benchmarks for model assessment
image_size_check.py Tests if model can process images of different dimensions
image_number_check.py Tests if model can process multiple images in a single request

CLI Check Options

# List all available checks
python sanity_check_cli/cli.py check list

# Run specific individual checks
python sanity_check_cli/cli.py check run --checks csv_similarity
python sanity_check_cli/cli.py check run --checks vllm_completion  
python sanity_check_cli/cli.py check run --checks chat_consistency
python sanity_check_cli/cli.py check run --checks embeddings_check
python sanity_check_cli/cli.py check run --checks lm_eval
python sanity_check_cli/cli.py check run --checks image_size_check
python sanity_check_cli/cli.py check run --checks image_number_check

Command Reference

CLI Commands

# Basic usage
python sanity_check_cli/cli.py --help                    # Show help
python sanity_check_cli/cli.py check list                # List available checks
python sanity_check_cli/cli.py check providers           # List configured providers

# Advanced options
python sanity_check_cli/cli.py check run \
  --old-endpoint http://server1:8000/v1/chat/completions \
  --new-endpoint http://server2:8000/v1/chat/completions \
  --format json,csv,table,chart \
  --output-dir ./custom-results

Parameter Reference

Parameter Description Default
--num-runs Number of runs per prompt 5
--max-tokens Maximum tokens per response 100
--temperature Model temperature 1.0
--threshold Similarity threshold 0.8
--verbose, -v Enable verbose output False

Script Output Files

vllm_completion_sanity_check.py

  • {new_version}_{old_version}_all_run_details.csv: Documented complete mismatch data between new and old versions. Every run is documented as a row.

    • Columns: run #, prompt #, max_tokens, first_diff_token_index, vllm {new_version} output, vllm {old_version} output, {new_version} top 5 token logprobs, {old_version} top 5 token logprobs
  • {new_version}_{old_version}_mismatch.csv: Documented complete mismatch data between new and old versions. Only containing records where mismatch occurred and the new version's output occurred for the first time.

    • Columns: Same as all_run_details.csv, but only containing records where outputs differ
  • inconsistency.csv: Cases where the new version produces different outputs in the current run and the previous run.

    • Columns: run #, prompt #, max_tokens, first_diff_token_index, vllm {new_version} output, vllm {new_version} previous run, {new_version} top 5 token logprobs, {new_version} previous run top 5 token logprobs
  • summary_overall.csv: Aggregated metrics across all runs

    • Columns: metric_name, value, percentage
    • Metrics include: Total Cases, Mismatch Cases, Inconsistency Cases, Average Mismatch First Diff Tokens, Average Inconsistency First Diff Tokens
  • summary_per_prompt.csv: Detailed metrics broken down by individual prompts

    • Columns: prompt_idx, parameters, accuracy, consistency, total_runs, min_first_diff_tokens, max_first_diff_tokens, avg_first_diff_tokens, mode_first_diff_tokens, frequency
    • Metrics include: accuracy (percentage of runs without mismatch), consistency (percentage of runs without inconsistency), min first diff tokens (for consistency), max first diff tokens (for consistency), avg first diff tokens (for consistency), mode first diff tokens (for consistency), frequency (frequency of first diff token)
  • accuracy hist.png: Histogram visualization of accuracy across prompts.

  • consistency hist.png: Histogram visualization of consistency across prompts.

semantic_similarity_check.py

Enhances existing CSV files (usually the mismatch file) with additional semantic metrics:

  • Added columns:

    • Similarity Score: Numeric similarity score between outputs (0-1)
    • Semantic Similarity: Pass/Fail classification based on configured threshold
    • average similarity score: Mean similarity across all comparisons
    • similarity score std: Standard deviation of similarity scores
    • confidence interval: Statistical confidence interval for similarity scores
  • similarity score hist.png: Histogram visualization of similarity scores across prompts.

openai_chat_consistency_check.py

  • all_run_details_{version}.csv: Details of all runs for a specific model or server version

    • Columns: test_type (sequential/concurrent), endpoint, server_version, prompt_idx, prompt, run_idx, response
  • summary_overall_{version}.csv: Aggregated metrics across all runs

    • Columns: test_type, consistent_count, total_count, consistent_rate, server_version
  • summary_per_prompt_{version}.csv: Prompt-level metrics

    • Columns: test_type, endpoint, server_version, prompt_idx, prompt, consistent_count, total_count, consistent_ratio
  • consistent_ratio hist.png: Histogram visualization of consistency ratios across prompts

lm_eval_harness.py

  • lm-eval-results/<model_name>/lmeval_results_<timestamp>.csv: Processed evaluation results
    • Columns: task, n_shot, n_samples_effective, plus all metrics from the evaluation (e.g., acc_norm)
    • One row per task, or multiple rows when using comma-separated tasks
  • lm-eval-results/<model_name>/*.json: Raw benchmark results from lm-eval-harness

image_size_check.py

  • image_size_check_{server_engine}_{server_version}.csv: CSV file containing test results
    • Columns: image_dimensions, width, height, success, response
    • Each row represents results for a different image dimension
    • success indicates whether the model processed the image correctly
    • response contains the model's output or error message

image_number_check.py

  • image_number_check_{server_engine}_{server_version}.csv: CSV file containing test results
    • Columns: num_images, image_size, success, response
    • Each row represents a test with a different number of images
    • success indicates whether the model processed all images
    • response contains the model's output describing the images or error message

Project Structure

genai-eval/
├── README.md                           # This comprehensive guide
├── requirements.txt                    # Python dependencies
├── pyproject.toml                      # Project configuration
├── .sanity-check.yaml                  # Default configuration
│
├── sanity_check_cli/                   # Modern CLI interface
│   ├── cli.py                          # Main CLI entry point
│   ├── commands/                       # Command implementations
│   │   ├── check.py                    # Check command logic
│   │   └── report.py                   # Report generation
│   ├── runners/                        # Test execution engines
│   │   ├── vllm_completion.py          # vLLM comparison runner
│   │   ├── csv_similarity.py           # Similarity analysis runner
│   │   ├── chat_consistency.py         # Consistency test runner
│   │   ├── embeddings_check.py         # Embedding validation runner
│   │   ├── image_size_check.py         # Image dimensions runner
│   │   ├── image_number_check.py       # Multiple images runner
│   │   └── lm_eval.py                  # LM evaluation harness runner
│   ├── clients/                        # API client implementations
│   │   ├── openai_client.py            # OpenAI API client
│   │   └── vllm_client.py              # vLLM API client
│   ├── output/                         # Output formatting
│   │   ├── charts.py                   # Chart generation
│   │   ├── reports.py                  # Report generation
│   │   └── tables.py                   # Table formatting
│   └── utils/                          # Utility functions
│       ├── config.py                   # Configuration handling
│       └── results.py                  # Result processing
│
├── genai_evaluation/                   # Core evaluation scripts
    ├── version_consistency/            # Version comparison
    │   ├── vllm_completion_sanity_check.py
    │   └── semantic_similarity_check.py
    ├── request_consistency/            # Request consistency
    │   ├── openai_chat_consistency_check.py
    │   ├── openai_embeddings_sanity_check.py
    │   └── test_data.py                # Test prompts and data
    ├── image_check/                    # Multimodality testing
    │   ├── image_size_check.py         # Tests different image dimensions
    │   └── image_number_check.py       # Tests multiple images in requests
    └── external/                       # External integrations
        ├── cohere_sanity_check/
        └── lm-eval-harness/             # lm-eval-harness integration
            ├── README.md                # Usage documentation
            └── lm_eval_harness.py       # Implementation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages