GenAI Evaluation Framework

A comprehensive testing framework for LLM models and implementations, providing automated sanity checks for version mismatch, response consistency, and performance validation across different model implementations.

Overview

The GenAI Evaluation Framework ensures:

Version Consistency: Compare outputs between different versions of LLM models and servers.
Request Consistency: Compare responses across multiple requests to the same endpoint to ensure consistency.
Embedding Alignment: Validate embedding consistency across platforms
LM Evaluation: Run language model accuracy check using lm-evaluation-harness
Multimodality tasts: Test model handling of images with different dimensions and quantities

Quick Start

Installation

# Clone the repository
git clone <repository-url>
cd genai-eval

# Install as a package
pip install -e .

Basic Usage

Using the CLI (Recommended):

# Use the `.sanity-check.yaml` file to configure hyperparameters.
genai-eval evaluation      # Runs the full evaluation suite of vllm_completion, csv_similarity, chat_consistency
genai-eval vllm_completion # Run the vLLM completion check
genai-eval semantic_similarity  # Run the semantic similarity check
genai-eval chat_consistency # Run the chat consistency check
genai-eval lm_eval # Run the LM evaluation harness check
genai-eval image_size_check # Run the handling of images with different dimensions check
genai-eval image_number_check # Run the handling of multiple images in a single request check

Using Individual Scripts:

# Version comparison
 python3 genai_evaluation/version_consistency/vllm_completion_sanity_check.py \
 --total-runs 3 \
 --output-folder /Users/results \
 --vllm-old-version 0.7.3.post1 --vllm-new-version 0.9.1 \
 --vllm-old-api http://localhost:8081/v1/chat/completions \
 --vllm-new-api http://localhost:8080/v1/chat/completions \
 --model-name /models/Llama-3-70B-Instruct

# Semantic similarity analysis
python3 genai_evaluation/version_consistency/semantic_similarity_check.py \
  --folder ./results/vllm_completion_detailed_report_20250805_162453 \
  --old-version-col "vllm 0.7.3.post1 output" \
  --new-version-col "vllm 0.9.1 output" \
  --mismatch-file "0.9.1_0.7.3.post1_all_run_details.csv"

# Request consistency
python genai_evaluation/request_consistency/openai_chat_consistency_check.py

# Check handling of images with different dimensions
python genai_evaluation/multimodality/image_size_check.py

# Check handling of multiple images in a single request
python genai_evaluation/multimodality/image_number_check.py

LM Evaluation Harness Integration:

The framework includes direct integration with lm-eval-harness, which allows evaluating language models on various benchmark tasks.

python3 genai_evaluation/external/lm-eval-harness/lm_eval_harness.py \
  --model_args "model=vllm-model,base_url=http://localhost:8081/v1/completions,tokenizer=/path/to/tokenizer" \
  --tasks mmlu \
  --output_path lm-eval-results

Common benchmark tasks include: arc_challenge, hellaswag, truthfulqa_mc, winogrande, gsm8k, mmlu. You can specify multiple tasks using comma-separated values.

See genai_evaluation/external/lm-eval-harness/README.md for more detailed usage information.

Available Validation Checks

Default CLI Workflow

When using the CLI without specific checks, it runs a comprehensive three-stage workflow:

vLLM Completion: Compares outputs between old and new models or servers.
CSV Similarity: Analyzes semantic similarity of the generated outputs
Chat Consistency: Tests consistency for each endpoint.

Additionally, the following checks can be run individually:

LM Eval: Runs language model benchmarks using lm-evaluation-harness
Image Size Check: Tests if model can process images of different dimensions
Image Number Check: Tests if model can process multiple images in a single request

Individual Validation Scripts

Script	Purpose
semantic_similarity_check.py	Analyzes semantic similarity of the generated outputs
vllm_completion_sanity_check.py	Compares outputs between old and new models or servers.
openai_chat_consistency_check.py	Tests consistency for each endpoint.
openai_embeddings_sanity_check.py	Compare embeddings from the same model hosted on HuggingFace and OpenAI
lm_eval_harness.py	Run lm-eval-harness benchmarks for model assessment
image_size_check.py	Tests if model can process images of different dimensions
image_number_check.py	Tests if model can process multiple images in a single request

CLI Check Options

# List all available checks
python sanity_check_cli/cli.py check list

# Run specific individual checks
python sanity_check_cli/cli.py check run --checks csv_similarity
python sanity_check_cli/cli.py check run --checks vllm_completion  
python sanity_check_cli/cli.py check run --checks chat_consistency
python sanity_check_cli/cli.py check run --checks embeddings_check
python sanity_check_cli/cli.py check run --checks lm_eval
python sanity_check_cli/cli.py check run --checks image_size_check
python sanity_check_cli/cli.py check run --checks image_number_check

Command Reference

CLI Commands

# Basic usage
python sanity_check_cli/cli.py --help                    # Show help
python sanity_check_cli/cli.py check list                # List available checks
python sanity_check_cli/cli.py check providers           # List configured providers

# Advanced options
python sanity_check_cli/cli.py check run \
  --old-endpoint http://server1:8000/v1/chat/completions \
  --new-endpoint http://server2:8000/v1/chat/completions \
  --format json,csv,table,chart \
  --output-dir ./custom-results

Parameter Reference

Parameter	Description	Default
`--num-runs`	Number of runs per prompt	5
`--max-tokens`	Maximum tokens per response	100
`--temperature`	Model temperature	1.0
`--threshold`	Similarity threshold	0.8
`--verbose, -v`	Enable verbose output	False

Script Output Files

vllm_completion_sanity_check.py

{new_version}_{old_version}_all_run_details.csv: Documented complete mismatch data between new and old versions. Every run is documented as a row.
- Columns: run #, prompt #, max_tokens, first_diff_token_index, vllm {new_version} output, vllm {old_version} output, {new_version} top 5 token logprobs, {old_version} top 5 token logprobs
{new_version}_{old_version}_mismatch.csv: Documented complete mismatch data between new and old versions. Only containing records where mismatch occurred and the new version's output occurred for the first time.
- Columns: Same as all_run_details.csv, but only containing records where outputs differ
inconsistency.csv: Cases where the new version produces different outputs in the current run and the previous run.
- Columns: run #, prompt #, max_tokens, first_diff_token_index, vllm {new_version} output, vllm {new_version} previous run, {new_version} top 5 token logprobs, {new_version} previous run top 5 token logprobs
summary_overall.csv: Aggregated metrics across all runs
- Columns: metric_name, value, percentage
- Metrics include: Total Cases, Mismatch Cases, Inconsistency Cases, Average Mismatch First Diff Tokens, Average Inconsistency First Diff Tokens
summary_per_prompt.csv: Detailed metrics broken down by individual prompts
- Columns: prompt_idx, parameters, accuracy, consistency, total_runs, min_first_diff_tokens, max_first_diff_tokens, avg_first_diff_tokens, mode_first_diff_tokens, frequency
- Metrics include: accuracy (percentage of runs without mismatch), consistency (percentage of runs without inconsistency), min first diff tokens (for consistency), max first diff tokens (for consistency), avg first diff tokens (for consistency), mode first diff tokens (for consistency), frequency (frequency of first diff token)
accuracy hist.png: Histogram visualization of accuracy across prompts.
consistency hist.png: Histogram visualization of consistency across prompts.

semantic_similarity_check.py

Enhances existing CSV files (usually the mismatch file) with additional semantic metrics:

Added columns:
- Similarity Score: Numeric similarity score between outputs (0-1)
- Semantic Similarity: Pass/Fail classification based on configured threshold
- average similarity score: Mean similarity across all comparisons
- similarity score std: Standard deviation of similarity scores
- confidence interval: Statistical confidence interval for similarity scores
similarity score hist.png: Histogram visualization of similarity scores across prompts.

openai_chat_consistency_check.py

all_run_details_{version}.csv: Details of all runs for a specific model or server version
- Columns: test_type (sequential/concurrent), endpoint, server_version, prompt_idx, prompt, run_idx, response
summary_overall_{version}.csv: Aggregated metrics across all runs
- Columns: test_type, consistent_count, total_count, consistent_rate, server_version
summary_per_prompt_{version}.csv: Prompt-level metrics
- Columns: test_type, endpoint, server_version, prompt_idx, prompt, consistent_count, total_count, consistent_ratio
consistent_ratio hist.png: Histogram visualization of consistency ratios across prompts

lm_eval_harness.py

lm-eval-results/<model_name>/lmeval_results_<timestamp>.csv: Processed evaluation results
- Columns: task, n_shot, n_samples_effective, plus all metrics from the evaluation (e.g., acc_norm)
- One row per task, or multiple rows when using comma-separated tasks
lm-eval-results/<model_name>/*.json: Raw benchmark results from lm-eval-harness

image_size_check.py

image_size_check_{server_engine}_{server_version}.csv: CSV file containing test results
- Columns: image_dimensions, width, height, success, response
- Each row represents results for a different image dimension
- success indicates whether the model processed the image correctly
- response contains the model's output or error message

image_number_check.py

image_number_check_{server_engine}_{server_version}.csv: CSV file containing test results
- Columns: num_images, image_size, success, response
- Each row represents a test with a different number of images
- success indicates whether the model processed all images
- response contains the model's output describing the images or error message

Project Structure

genai-eval/
├── README.md                           # This comprehensive guide
├── requirements.txt                    # Python dependencies
├── pyproject.toml                      # Project configuration
├── .sanity-check.yaml                  # Default configuration
│
├── sanity_check_cli/                   # Modern CLI interface
│   ├── cli.py                          # Main CLI entry point
│   ├── commands/                       # Command implementations
│   │   ├── check.py                    # Check command logic
│   │   └── report.py                   # Report generation
│   ├── runners/                        # Test execution engines
│   │   ├── vllm_completion.py          # vLLM comparison runner
│   │   ├── csv_similarity.py           # Similarity analysis runner
│   │   ├── chat_consistency.py         # Consistency test runner
│   │   ├── embeddings_check.py         # Embedding validation runner
│   │   ├── image_size_check.py         # Image dimensions runner
│   │   ├── image_number_check.py       # Multiple images runner
│   │   └── lm_eval.py                  # LM evaluation harness runner
│   ├── clients/                        # API client implementations
│   │   ├── openai_client.py            # OpenAI API client
│   │   └── vllm_client.py              # vLLM API client
│   ├── output/                         # Output formatting
│   │   ├── charts.py                   # Chart generation
│   │   ├── reports.py                  # Report generation
│   │   └── tables.py                   # Table formatting
│   └── utils/                          # Utility functions
│       ├── config.py                   # Configuration handling
│       └── results.py                  # Result processing
│
├── genai_evaluation/                   # Core evaluation scripts
    ├── version_consistency/            # Version comparison
    │   ├── vllm_completion_sanity_check.py
    │   └── semantic_similarity_check.py
    ├── request_consistency/            # Request consistency
    │   ├── openai_chat_consistency_check.py
    │   ├── openai_embeddings_sanity_check.py
    │   └── test_data.py                # Test prompts and data
    ├── image_check/                    # Multimodality testing
    │   ├── image_size_check.py         # Tests different image dimensions
    │   └── image_number_check.py       # Tests multiple images in requests
    └── external/                       # External integrations
        ├── cohere_sanity_check/
        └── lm-eval-harness/             # lm-eval-harness integration
            ├── README.md                # Usage documentation
            └── lm_eval_harness.py       # Implementation

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
genai_evaluation		genai_evaluation
tests		tests
.gitignore		.gitignore
.sanity-check.yaml		.sanity-check.yaml
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GenAI Evaluation Framework

Overview

Quick Start

Installation

Basic Usage

Available Validation Checks

Default CLI Workflow

Individual Validation Scripts

CLI Check Options

Command Reference

CLI Commands

Parameter Reference

Script Output Files

vllm_completion_sanity_check.py

semantic_similarity_check.py

openai_chat_consistency_check.py

lm_eval_harness.py

image_size_check.py

image_number_check.py

Project Structure

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

moirai-internal/genai-eval

Folders and files

Latest commit

History

Repository files navigation

GenAI Evaluation Framework

Overview

Quick Start

Installation

Basic Usage

Available Validation Checks

Default CLI Workflow

Individual Validation Scripts

CLI Check Options

Command Reference

CLI Commands

Parameter Reference

Script Output Files

vllm_completion_sanity_check.py

semantic_similarity_check.py

openai_chat_consistency_check.py

lm_eval_harness.py

image_size_check.py

image_number_check.py

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages