-
Notifications
You must be signed in to change notification settings - Fork 22
PR 3/3 — Unit tests for GEval #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR 3/3 — Unit tests for GEval #98
Conversation
WalkthroughThe PR integrates GEval metrics into the evaluation system through a registry-based approach. It introduces configuration models, a new GEvalHandler for metric evaluation, registry loading logic, and routing updates to support both turn-level and conversation-level GEval metrics alongside existing DeepEval metrics. Changes
Sequence Diagram(s)sequenceDiagram
actor Evaluator
participant DeepEvalMetrics
participant GEvalHandler
participant Registry
participant GEval
Evaluator->>DeepEvalMetrics: evaluate(metric_name)
alt metric in standard metrics
DeepEvalMetrics->>DeepEvalMetrics: use standard handler
else metric not found
DeepEvalMetrics->>GEvalHandler: evaluate(metric_name, conv_data, turn_idx, ...)
GEvalHandler->>GEvalHandler: _get_geval_config(metric_name)
alt turn metadata exists
GEvalHandler->>GEvalHandler: use turn metadata config
else conversation metadata exists
GEvalHandler->>GEvalHandler: use conversation metadata config
else registry available
GEvalHandler->>Registry: load metric config
end
GEvalHandler->>GEvalHandler: _convert_evaluation_params(params)
alt is_conversation
GEvalHandler->>GEvalHandler: _evaluate_conversation()
else turn-level
GEvalHandler->>GEvalHandler: _evaluate_turn()
end
GEvalHandler->>GEval: metric.measure(test_case)
GEval-->>GEvalHandler: score, reason
GEvalHandler-->>DeepEvalMetrics: score, reason
end
DeepEvalMetrics-->>Evaluator: result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Areas requiring extra attention:
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
config/registry/geval_metrics.yaml(1 hunks)config/system.yaml(1 hunks)src/lightspeed_evaluation/core/metrics/__init__.py(1 hunks)src/lightspeed_evaluation/core/metrics/deepeval.py(3 hunks)src/lightspeed_evaluation/core/metrics/geval.py(1 hunks)src/lightspeed_evaluation/core/models/__init__.py(2 hunks)src/lightspeed_evaluation/core/models/system.py(2 hunks)src/lightspeed_evaluation/core/system/loader.py(5 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(1 hunks)tests/unit/core/metrics/test_geval.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
src/lightspeed_evaluation/**
📄 CodeRabbit inference engine (AGENTS.md)
Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/models/system.pysrc/lightspeed_evaluation/core/metrics/deepeval.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/metrics/__init__.pysrc/lightspeed_evaluation/core/metrics/geval.pysrc/lightspeed_evaluation/core/system/loader.py
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/models/system.pysrc/lightspeed_evaluation/core/metrics/deepeval.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/metrics/__init__.pysrc/lightspeed_evaluation/core/metrics/geval.pysrc/lightspeed_evaluation/core/system/loader.py
config/system.yaml
📄 CodeRabbit inference engine (AGENTS.md)
config/system.yaml: Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)
Add metric metadata to the metrics_metadata section in config/system.yaml when introducing new metrics
Files:
config/system.yaml
config/**/*.{yaml,yml}
📄 CodeRabbit inference engine (AGENTS.md)
Use YAML for configuration files under config/
Files:
config/system.yamlconfig/registry/geval_metrics.yaml
tests/unit/**
📄 CodeRabbit inference engine (AGENTS.md)
Place unit tests under tests/unit/ mirroring the source structure
Files:
tests/unit/core/metrics/test_geval.py
tests/**/test_*.py
📄 CodeRabbit inference engine (AGENTS.md)
Name test files as test_*.py
Files:
tests/unit/core/metrics/test_geval.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest for mocking (pytest-mock’s mocker), not unittest.mock
Name test functions as test_*
Name test classes as Test*
Add comprehensive tests for new features with mocked LLM calls using pytest
Files:
tests/unit/core/metrics/test_geval.py
src/lightspeed_evaluation/core/metrics/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Register new metrics in MetricManager’s supported_metrics dictionary
Files:
src/lightspeed_evaluation/core/metrics/deepeval.pysrc/lightspeed_evaluation/core/metrics/__init__.pysrc/lightspeed_evaluation/core/metrics/geval.py
🧠 Learnings (11)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/models/system.pysrc/lightspeed_evaluation/core/system/loader.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/metrics/deepeval.pysrc/lightspeed_evaluation/core/metrics/__init__.pysrc/lightspeed_evaluation/core/system/loader.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.
Applied to files:
src/lightspeed_evaluation/core/models/system.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/system/loader.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Add metric metadata to the metrics_metadata section in config/system.yaml when introducing new metrics
Applied to files:
config/system.yamlconfig/registry/geval_metrics.yaml
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/evaluation_data.yaml : Keep evaluation data in config/evaluation_data.yaml (conversation groups, turns, overrides, scripts)
Applied to files:
config/system.yamlconfig/registry/geval_metrics.yaml
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)
Applied to files:
config/system.yaml
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: When adding a new feature, also add a new sample evaluation data YAML file
Applied to files:
config/registry/geval_metrics.yaml
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to tests/**/*.py : Add comprehensive tests for new features with mocked LLM calls using pytest
Applied to files:
tests/unit/core/metrics/test_geval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Applied to files:
src/lightspeed_evaluation/core/metrics/deepeval.pysrc/lightspeed_evaluation/core/system/loader.py
🧬 Code graph analysis (6)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
src/lightspeed_evaluation/core/metrics/deepeval.py (1)
DeepEvalMetrics(30-184)
tests/unit/core/metrics/test_geval.py (2)
src/lightspeed_evaluation/core/metrics/geval.py (3)
GEvalHandler(20-499)_convert_evaluation_params(195-238)_get_geval_config(435-499)tests/conftest.py (1)
mock_llm_manager(84-103)
src/lightspeed_evaluation/core/metrics/deepeval.py (3)
src/lightspeed_evaluation/core/metrics/geval.py (2)
GEvalHandler(20-499)evaluate(120-193)src/lightspeed_evaluation/core/llm/manager.py (4)
LLMManager(10-117)get_config(105-107)get_model_name(91-93)get_llm_params(95-103)src/lightspeed_evaluation/core/llm/deepeval.py (1)
DeepEvalLLMManager(8-42)
src/lightspeed_evaluation/core/models/__init__.py (1)
src/lightspeed_evaluation/core/models/system.py (1)
GEvalConfig(267-287)
src/lightspeed_evaluation/core/metrics/geval.py (2)
src/lightspeed_evaluation/core/llm/deepeval.py (1)
DeepEvalLLMManager(8-42)src/lightspeed_evaluation/core/metrics/deepeval.py (1)
evaluate(92-129)
src/lightspeed_evaluation/core/system/loader.py (1)
src/lightspeed_evaluation/core/models/system.py (1)
GEvalConfig(267-287)
| # Handle missing or invalid registry | ||
| if path is None or not path.exists(): | ||
| logger.warning( | ||
| f"GEval metric registry not found at expected locations. " | ||
| f"Tried: {[str(p) for p in possible_paths]}. " | ||
| f"Will fall back to runtime metadata only." | ||
| ) | ||
| GEvalHandler._registry = {} | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix NameError when registry_path is provided.
If callers supply a registry path that doesn’t exist, this logger.warning references possible_paths, which is undefined in that code path (only set in the else branch above). That raises a NameError, preventing the handler from falling back to runtime metadata. Wrap the message to use either the provided path or the fallback list, e.g.:
- if path is None or not path.exists():
- logger.warning(
- f"GEval metric registry not found at expected locations. "
- f"Tried: {[str(p) for p in possible_paths]}. "
- f"Will fall back to runtime metadata only."
- )
+ if path is None or not path.exists():
+ tried = (
+ [str(path)] if registry_path else [str(p) for p in possible_paths]
+ )
+ logger.warning(
+ "GEval metric registry not found at expected locations. "
+ f"Tried: {tried}. Will fall back to runtime metadata only."
+ )🤖 Prompt for AI Agents
In src/lightspeed_evaluation/core/metrics/geval.py around lines 95 to 103, the
logger.warning uses possible_paths which is only defined in the other branch and
causes a NameError when a caller passes a non-existent registry path; change the
warning to reference the provided path when present (e.g., str(path) or the
provided path variable) or fall back to the possible_paths list only when it
exists — construct the message conditionally so it never references an undefined
variable, then leave GEvalHandler._registry = {} and return as before.
| registry_path: str = Field( | ||
| default="config/geval_metrics.yaml", | ||
| description="Path to GEval metrics registry YAML file", | ||
| ) | ||
| default_turn_metrics: list[str] = Field( | ||
| default_factory=list, | ||
| description="Default turn-level GEval metrics to auto-apply (e.g., ['geval:technical_accuracy'])", | ||
| ) | ||
| default_conversation_metrics: list[str] = Field( | ||
| default_factory=list, | ||
| description="Default conversation-level GEval metrics to auto-apply (e.g., ['geval:conversation_coherence'])", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the default GEval registry path
The default should point to the actual registry at config/registry/geval_metrics.yaml. As written it targets config/geval_metrics.yaml, so GEvalHandler loads an empty registry and caches that result; subsequent attempts with the correct path never reload because the class-level cache is already set. This breaks default configs that rely on the built-in registry. Please update the default path.
- registry_path: str = Field(
- default="config/geval_metrics.yaml",
+ registry_path: str = Field(
+ default="config/registry/geval_metrics.yaml",🤖 Prompt for AI Agents
In src/lightspeed_evaluation/core/models/system.py around lines 276 to 287, the
default registry_path is incorrect; change the Field default from
"config/geval_metrics.yaml" to "config/registry/geval_metrics.yaml" so the
built-in GEval registry is loaded by default (this prevents the class-level
cache from storing an empty registry); update the default value only and keep
the rest of the Field definition unchanged.
| # Load GEval metrics from registry if enabled | ||
| if system_config.geval.enabled: | ||
| _load_geval_metrics(system_config.geval.registry_path) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registry path mismatch breaks GEval discovery.
GEvalConfig.registry_path still defaults to "config/geval_metrics.yaml" (see src/lightspeed_evaluation/core/models/system.py), but this PR ships the registry at config/registry/geval_metrics.yaml. With the current code, system_config.geval.enabled + default settings will never load the new metrics, so every geval:* metric will fail validation. Please either move the YAML to the default location or update both the default value and any config files to the new path before merging.
🤖 Prompt for AI Agents
In src/lightspeed_evaluation/core/system/loader.py around lines 108-111 the code
calls _load_geval_metrics(system_config.geval.registry_path) but the
GEvalConfig.registry_path default still points to "config/geval_metrics.yaml"
while the shipped registry lives at "config/registry/geval_metrics.yaml"; update
the GEvalConfig default to "config/registry/geval_metrics.yaml" (edit
src/lightspeed_evaluation/core/models/system.py) and update any config files
that set the old path to the new one so system_config.geval.registry_path
matches the actual file location.
| from pathlib import Path | ||
| from unittest.mock import MagicMock, patch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align mocks with pytest guidelines.
Please switch these mocks to pytest-mock (the mocker fixture) instead of importing MagicMock/patch from unittest.mock. The repo guideline for tests/**/*.py explicitly requires pytest-based mocking, so sticking with MagicMock here will block compliance. Refactor to accept mocker where needed (e.g., mocker.MagicMock(), mocker.patch(...)) and drop the direct unittest.mock imports.
🤖 Prompt for AI Agents
In tests/unit/core/metrics/test_geval.py lines 3-4, remove the direct imports of
MagicMock and patch from unittest.mock and refactor tests to use the pytest-mock
fixture instead: replace any MagicMock() calls with mocker.MagicMock() and any
patch(...) usages with mocker.patch(...), add the mocker parameter to the test
functions that use these mocks, and delete the unused unittest.mock imports from
the file.
|
Consolidating into one comprehensive PR: #97 |
Part 3 of 3 in the GEval Integration Series
This is the third and final pull request in the series introducing GEval (DeepEval) LLM-as-a-judge support into the LSC Evaluation Framework. This PR focuses on robust unit testing and validation coverage.
Summary
Adds a comprehensive unit test suite for all major GEval functionality
Why? 🤔
Key Changes
tests/unit/core/metrics/test_geval.pySummary by CodeRabbit
New Features
Tests