LightSpeed Evaluation Framework

A comprehensive framework for evaluating GenAI applications.

This is WIP; We are actively working to add more features, fix any issues & add more examples. Please give a try, provide your feedback & report any bug.

🎯 Key Features

Multi-Framework Support: Seamlessly use metrics from Ragas, DeepEval, and custom implementations
Turn & Conversation-Level Evaluation: Support for both individual queries and multi-turn conversations
LLM Provider Flexibility: OpenAI, Anthropic, Watsonx, Azure, Gemini, Ollama via LiteLLM
Flexible Configuration: Configurable environment & metric metadata
Rich Output: CSV, JSON, TXT reports + visualization graphs (pass rates, distributions, heatmaps)
Early Validation: Catch configuration errors before expensive LLM calls
Statistical Analysis: Statistics for every metric with score distribution analysis
Agent Evaluation: Framework for evaluating AI agent performance (future integration planned)

🚀 Quick Start

Installation

# From Git
pip install git+https://github.com/lightspeed-core/lightspeed-evaluation.git

# Local Development
pip install uv
uv sync

Basic Usage

# Set API key
export OPENAI_API_KEY="your-key"

# Run evaluation
lightspeed-eval --system-config config/system.yaml --eval-data config/evaluation_data.yaml

📊 Supported Metrics

Turn-Level (Single Query)

Ragas
- Response Evaluation
  - faithfulness
  - response_relevancy
- Context Evaluation
  - context_recall
  - context_relevance
  - context_precision_without_reference
  - context_precision_with_reference
Custom
- Response Evaluation
  - answer_correctness

Conversation-Level (Multi-turn)

DeepEval
- conversation_completeness
- conversation_relevancy
- knowledge_retention

⚙️ Configuration

System Config (`config/system.yaml`)

llm:
  provider: "openai"
  model: "gpt-4o-mini"
  temperature: 0.0
  timeout: 120

metrics_metadata:
  turn_level:
    "ragas:faithfulness":
      threshold: 0.8
      type: "turn"
      framework: "ragas"
  
  conversation_level:
    "deepeval:conversation_completeness":
      threshold: 0.8
      type: "conversation"
      framework: "deepeval"

Evaluation Data (`config/evaluation_data.yaml`)

- conversation_group_id: "test_conversation"
  description: "Sample evaluation"
  
  # Turn-level metrics (empty list = skip turn evaluation)
  turn_metrics:
    - "ragas:faithfulness"
    - "custom:answer_correctness"
  
  # Turn-level metrics metadata (threshold + other properties)
  turn_metrics_metadata:
    "ragas:response_relevancy": 
      threshold: 0.8
      weight: 1.0
    "custom:answer_correctness": 
      threshold: 0.75
  
  # Conversation-level metrics (empty list = skip conversation evaluation)   
  conversation_metrics:
    - "deepeval:conversation_completeness"
  
  turns:
    - turn_id: 1
      query: "What is OpenShift?"
      response: "Red Hat OpenShift powers the entire application lifecycle...."
      contexts:
        - content: "Red Hat OpenShift powers...."
      expected_response: "Red Hat OpenShift...."

📈 Output & Visualization

Generated Reports

CSV: Detailed results with status, scores, reasons
JSON: Summary statistics with score distributions
TXT: Human-readable summary
PNG: 4 visualization types (pass rates, score distributions, heatmaps, status breakdown)

Key Metrics in Output

PASS/FAIL/ERROR: Status based on thresholds
Actual Reasons: DeepEval provides LLM-generated explanations, Custom metrics provide detailed reasoning
Score Statistics: Mean, median, standard deviation, min/max for every metric

🧪 Development

Development Tools

uv sync --group dev
make format
make pylint
make pyright
make docstyle
make check-types

uv run pytest tests --cov=src

Agent Evaluation

For a detailed walkthrough of the new agent-evaluation framework, refer lsc_agent_eval/README.md

Generate answers (optional - for creating test data)

For generating answers (optional) refer README-generate-answers

📄 License & Contributing

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Contributions welcome - see development setup above for code quality tools.

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
.github/workflows		.github/workflows
archive		archive
config		config
eval_data		eval_data
lsc_agent_eval		lsc_agent_eval
src		src
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README-generate-answers.md		README-generate-answers.md
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LightSpeed Evaluation Framework

🎯 Key Features

🚀 Quick Start

Installation

Basic Usage

📊 Supported Metrics

Turn-Level (Single Query)

Conversation-Level (Multi-turn)

⚙️ Configuration

System Config (`config/system.yaml`)

Evaluation Data (`config/evaluation_data.yaml`)

📈 Output & Visualization

Generated Reports

Key Metrics in Output

🧪 Development

Development Tools

Agent Evaluation

Generate answers (optional - for creating test data)

📄 License & Contributing

About

Uh oh!

Releases

Packages

Languages

License

asimurka/lightspeed-evaluation

Folders and files

Latest commit

History

Repository files navigation

LightSpeed Evaluation Framework

🎯 Key Features

🚀 Quick Start

Installation

Basic Usage

📊 Supported Metrics

Turn-Level (Single Query)

Conversation-Level (Multi-turn)

⚙️ Configuration

System Config (config/system.yaml)

Evaluation Data (config/evaluation_data.yaml)

📈 Output & Visualization

Generated Reports

Key Metrics in Output

🧪 Development

Development Tools

Agent Evaluation

Generate answers (optional - for creating test data)

📄 License & Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

System Config (`config/system.yaml`)

Evaluation Data (`config/evaluation_data.yaml`)

Packages