A comprehensive testing framework for LLM models and implementations, providing automated sanity checks for version mismatch, response consistency, and performance validation across different model implementations.
The GenAI Evaluation Framework ensures:
- Version Consistency: Compare outputs between different versions of LLM models and servers.
- Request Consistency: Compare responses across multiple requests to the same endpoint to ensure consistency.
- Embedding Alignment: Validate embedding consistency across platforms
- LM Evaluation: Run language model accuracy check using lm-evaluation-harness
- Multimodality tasts: Test model handling of images with different dimensions and quantities
# Clone the repository
git clone <repository-url>
cd genai-eval
# Install as a package
pip install -e .Using the CLI (Recommended):
# Use the `.sanity-check.yaml` file to configure hyperparameters.
genai-eval evaluation # Runs the full evaluation suite of vllm_completion, csv_similarity, chat_consistency
genai-eval vllm_completion # Run the vLLM completion check
genai-eval semantic_similarity # Run the semantic similarity check
genai-eval chat_consistency # Run the chat consistency check
genai-eval lm_eval # Run the LM evaluation harness check
genai-eval image_size_check # Run the handling of images with different dimensions check
genai-eval image_number_check # Run the handling of multiple images in a single request checkUsing Individual Scripts:
# Version comparison
python3 genai_evaluation/version_consistency/vllm_completion_sanity_check.py \
--total-runs 3 \
--output-folder /Users/results \
--vllm-old-version 0.7.3.post1 --vllm-new-version 0.9.1 \
--vllm-old-api http://localhost:8081/v1/chat/completions \
--vllm-new-api http://localhost:8080/v1/chat/completions \
--model-name /models/Llama-3-70B-Instruct
# Semantic similarity analysis
python3 genai_evaluation/version_consistency/semantic_similarity_check.py \
--folder ./results/vllm_completion_detailed_report_20250805_162453 \
--old-version-col "vllm 0.7.3.post1 output" \
--new-version-col "vllm 0.9.1 output" \
--mismatch-file "0.9.1_0.7.3.post1_all_run_details.csv"
# Request consistency
python genai_evaluation/request_consistency/openai_chat_consistency_check.py
# Check handling of images with different dimensions
python genai_evaluation/multimodality/image_size_check.py
# Check handling of multiple images in a single request
python genai_evaluation/multimodality/image_number_check.pyLM Evaluation Harness Integration:
The framework includes direct integration with lm-eval-harness, which allows evaluating language models on various benchmark tasks.
python3 genai_evaluation/external/lm-eval-harness/lm_eval_harness.py \
--model_args "model=vllm-model,base_url=http://localhost:8081/v1/completions,tokenizer=/path/to/tokenizer" \
--tasks mmlu \
--output_path lm-eval-resultsCommon benchmark tasks include: arc_challenge, hellaswag, truthfulqa_mc, winogrande, gsm8k, mmlu. You can specify multiple tasks using comma-separated values.
See genai_evaluation/external/lm-eval-harness/README.md for more detailed usage information.
When using the CLI without specific checks, it runs a comprehensive three-stage workflow:
- vLLM Completion: Compares outputs between old and new models or servers.
- CSV Similarity: Analyzes semantic similarity of the generated outputs
- Chat Consistency: Tests consistency for each endpoint.
Additionally, the following checks can be run individually:
- LM Eval: Runs language model benchmarks using lm-evaluation-harness
- Image Size Check: Tests if model can process images of different dimensions
- Image Number Check: Tests if model can process multiple images in a single request
| Script | Purpose |
|---|---|
| semantic_similarity_check.py | Analyzes semantic similarity of the generated outputs |
| vllm_completion_sanity_check.py | Compares outputs between old and new models or servers. |
| openai_chat_consistency_check.py | Tests consistency for each endpoint. |
| openai_embeddings_sanity_check.py | Compare embeddings from the same model hosted on HuggingFace and OpenAI |
| lm_eval_harness.py | Run lm-eval-harness benchmarks for model assessment |
| image_size_check.py | Tests if model can process images of different dimensions |
| image_number_check.py | Tests if model can process multiple images in a single request |
# List all available checks
python sanity_check_cli/cli.py check list
# Run specific individual checks
python sanity_check_cli/cli.py check run --checks csv_similarity
python sanity_check_cli/cli.py check run --checks vllm_completion
python sanity_check_cli/cli.py check run --checks chat_consistency
python sanity_check_cli/cli.py check run --checks embeddings_check
python sanity_check_cli/cli.py check run --checks lm_eval
python sanity_check_cli/cli.py check run --checks image_size_check
python sanity_check_cli/cli.py check run --checks image_number_check# Basic usage
python sanity_check_cli/cli.py --help # Show help
python sanity_check_cli/cli.py check list # List available checks
python sanity_check_cli/cli.py check providers # List configured providers
# Advanced options
python sanity_check_cli/cli.py check run \
--old-endpoint http://server1:8000/v1/chat/completions \
--new-endpoint http://server2:8000/v1/chat/completions \
--format json,csv,table,chart \
--output-dir ./custom-results| Parameter | Description | Default |
|---|---|---|
--num-runs |
Number of runs per prompt | 5 |
--max-tokens |
Maximum tokens per response | 100 |
--temperature |
Model temperature | 1.0 |
--threshold |
Similarity threshold | 0.8 |
--verbose, -v |
Enable verbose output | False |
-
{new_version}_{old_version}_all_run_details.csv: Documented complete mismatch data between new and old versions. Every run is documented as a row.- Columns:
run #,prompt #,max_tokens,first_diff_token_index,vllm {new_version} output,vllm {old_version} output,{new_version} top 5 token logprobs,{old_version} top 5 token logprobs
- Columns:
-
{new_version}_{old_version}_mismatch.csv: Documented complete mismatch data between new and old versions. Only containing records where mismatch occurred and the new version's output occurred for the first time.- Columns: Same as all_run_details.csv, but only containing records where outputs differ
-
inconsistency.csv: Cases where the new version produces different outputs in the current run and the previous run.- Columns:
run #,prompt #,max_tokens,first_diff_token_index,vllm {new_version} output,vllm {new_version} previous run,{new_version} top 5 token logprobs,{new_version} previous run top 5 token logprobs
- Columns:
-
summary_overall.csv: Aggregated metrics across all runs- Columns:
metric_name,value,percentage - Metrics include: Total Cases, Mismatch Cases, Inconsistency Cases, Average Mismatch First Diff Tokens, Average Inconsistency First Diff Tokens
- Columns:
-
summary_per_prompt.csv: Detailed metrics broken down by individual prompts- Columns:
prompt_idx,parameters,accuracy,consistency,total_runs,min_first_diff_tokens,max_first_diff_tokens,avg_first_diff_tokens,mode_first_diff_tokens,frequency - Metrics include: accuracy (percentage of runs without mismatch), consistency (percentage of runs without inconsistency), min first diff tokens (for consistency), max first diff tokens (for consistency), avg first diff tokens (for consistency), mode first diff tokens (for consistency), frequency (frequency of first diff token)
- Columns:
-
accuracy hist.png: Histogram visualization of accuracy across prompts. -
consistency hist.png: Histogram visualization of consistency across prompts.
Enhances existing CSV files (usually the mismatch file) with additional semantic metrics:
-
Added columns:
Similarity Score: Numeric similarity score between outputs (0-1)Semantic Similarity: Pass/Fail classification based on configured thresholdaverage similarity score: Mean similarity across all comparisonssimilarity score std: Standard deviation of similarity scoresconfidence interval: Statistical confidence interval for similarity scores
-
similarity score hist.png: Histogram visualization of similarity scores across prompts.
-
all_run_details_{version}.csv: Details of all runs for a specific model or server version- Columns:
test_type(sequential/concurrent),endpoint,server_version,prompt_idx,prompt,run_idx,response
- Columns:
-
summary_overall_{version}.csv: Aggregated metrics across all runs- Columns:
test_type,consistent_count,total_count,consistent_rate,server_version
- Columns:
-
summary_per_prompt_{version}.csv: Prompt-level metrics- Columns:
test_type,endpoint,server_version,prompt_idx,prompt,consistent_count,total_count,consistent_ratio
- Columns:
-
consistent_ratio hist.png: Histogram visualization of consistency ratios across prompts
lm-eval-results/<model_name>/lmeval_results_<timestamp>.csv: Processed evaluation results- Columns:
task,n_shot,n_samples_effective, plus all metrics from the evaluation (e.g.,acc_norm) - One row per task, or multiple rows when using comma-separated tasks
- Columns:
lm-eval-results/<model_name>/*.json: Raw benchmark results from lm-eval-harness
image_size_check_{server_engine}_{server_version}.csv: CSV file containing test results- Columns:
image_dimensions,width,height,success,response - Each row represents results for a different image dimension
successindicates whether the model processed the image correctlyresponsecontains the model's output or error message
- Columns:
image_number_check_{server_engine}_{server_version}.csv: CSV file containing test results- Columns:
num_images,image_size,success,response - Each row represents a test with a different number of images
successindicates whether the model processed all imagesresponsecontains the model's output describing the images or error message
- Columns:
genai-eval/
├── README.md # This comprehensive guide
├── requirements.txt # Python dependencies
├── pyproject.toml # Project configuration
├── .sanity-check.yaml # Default configuration
│
├── sanity_check_cli/ # Modern CLI interface
│ ├── cli.py # Main CLI entry point
│ ├── commands/ # Command implementations
│ │ ├── check.py # Check command logic
│ │ └── report.py # Report generation
│ ├── runners/ # Test execution engines
│ │ ├── vllm_completion.py # vLLM comparison runner
│ │ ├── csv_similarity.py # Similarity analysis runner
│ │ ├── chat_consistency.py # Consistency test runner
│ │ ├── embeddings_check.py # Embedding validation runner
│ │ ├── image_size_check.py # Image dimensions runner
│ │ ├── image_number_check.py # Multiple images runner
│ │ └── lm_eval.py # LM evaluation harness runner
│ ├── clients/ # API client implementations
│ │ ├── openai_client.py # OpenAI API client
│ │ └── vllm_client.py # vLLM API client
│ ├── output/ # Output formatting
│ │ ├── charts.py # Chart generation
│ │ ├── reports.py # Report generation
│ │ └── tables.py # Table formatting
│ └── utils/ # Utility functions
│ ├── config.py # Configuration handling
│ └── results.py # Result processing
│
├── genai_evaluation/ # Core evaluation scripts
├── version_consistency/ # Version comparison
│ ├── vllm_completion_sanity_check.py
│ └── semantic_similarity_check.py
├── request_consistency/ # Request consistency
│ ├── openai_chat_consistency_check.py
│ ├── openai_embeddings_sanity_check.py
│ └── test_data.py # Test prompts and data
├── image_check/ # Multimodality testing
│ ├── image_size_check.py # Tests different image dimensions
│ └── image_number_check.py # Tests multiple images in requests
└── external/ # External integrations
├── cohere_sanity_check/
└── lm-eval-harness/ # lm-eval-harness integration
├── README.md # Usage documentation
└── lm_eval_harness.py # Implementation