Skip to content

Conversation

@bsatapat-jpg
Copy link
Collaborator

@bsatapat-jpg bsatapat-jpg commented Sep 1, 2025

[DESC] [0] Generated test cases using cursor
[1] Removed unnecessary test cases generated by cursor

Summary by CodeRabbit

  • New Features
    • Added a CLI test runner to easily run all, fast, or category-specific tests, with optional coverage and parallel execution.
  • Tests
    • Introduced comprehensive test suites covering configuration, data validation, models, LLM manager behaviors, custom metrics, and output statistics.
    • Standardized Pytest configuration with discovery settings, strict options, durations, colors, and custom markers (unit, integration, slow, llm, etc.).
  • Documentation
    • Added a Test Suite README with structure, markers, fixtures, running instructions, CI guidance, and contributor tips.

[DESC] [0] Generated test cases using cursor
       [1] Removed unnecessary test cases generated by cursor
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 1, 2025

Walkthrough

Adds a Pytest setup and a comprehensive test suite for the LSC Evaluation Framework, including fixtures, unit/integration tests across core, LLM managers, metrics, and output utilities, plus a CLI test runner. Documentation for tests and package initializers are included. No production code changes.

Changes

Cohort / File(s) Summary
Pytest configuration
lsc_eval/pytest.ini
Introduces Pytest config: discovery patterns, addopts, markers, minversion, ignore paths, and warning filters; optional commented coverage.
Test documentation
lsc_eval/tests/README.md
Adds README detailing test structure, markers, running strategies, fixtures, mock data, CI usage, and contribution guidelines.
Test package initializers
lsc_eval/tests/__init__.py, lsc_eval/tests/core/__init__.py, lsc_eval/tests/llm_managers/__init__.py, lsc_eval/tests/metrics/__init__.py, lsc_eval/tests/output/__init__.py
Adds module docstrings to mark packages; no logic changes.
Shared fixtures
lsc_eval/tests/conftest.py
Adds extensive fixtures: temp dirs, sample/invalid configs and data, YAML writers, and environment cleanup.
Core tests: config & models
lsc_eval/tests/core/test_config_loader.py, lsc_eval/tests/core/test_models.py
Adds tests for config loading, logging, metric mappings/validation, defaults, Pydantic models, and output config handling.
Core tests: data validation
lsc_eval/tests/core/test_data_validator.py
Adds tests for DataValidator: load, validation flows, metrics availability/requirements, error aggregation.
LLM manager tests
lsc_eval/tests/llm_managers/test_llm_manager.py
Adds tests for LLMConfig/LLMManager: provider init, env handling, errors, accessors, model naming, and messages.
Metrics tests
lsc_eval/tests/metrics/test_custom_metrics.py
Adds tests for CustomMetrics and EvaluationPromptParams: prompt creation, LLM call flow (mocked), score parsing/normalization, evaluation paths.
Output utilities tests
lsc_eval/tests/output/test_utils_streamlined.py
Adds tests for basic/detailed stats, score statistics, and evaluation scope across edge cases.
Test runner CLI
lsc_eval/tests/test_runner.py
Adds pytest-based CLI with commands (all/unit/integration/etc.), optional coverage, parallel, fail-fast, and path execution.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor Dev as Developer
    participant CLI as test_runner.py (CLI)
    participant Pytest as pytest
    participant Cov as pytest-cov (optional)

    Dev->>CLI: main(cmd, args)
    alt command == "file"/"all"/marker-specific
        CLI->>CLI: build pytest args (-v, --tb=short, --strict-config, --durations)
        opt coverage requested
            CLI->>Cov: import pytest_cov
            alt available
                CLI->>CLI: add --cov, html/xml/term reports
            else not available
                CLI-->>Dev: print("coverage plugin not installed")
            end
        end
        opt parallel requested
            CLI->>CLI: add -n auto
        end
        opt fail-fast requested
            CLI->>CLI: add -x
        end
        CLI->>Pytest: run(args)
        Pytest-->>CLI: exit code
        CLI-->>Dev: exit code
    else unknown command
        CLI-->>Dev: print usage + non-zero code
    end

    note over CLI,Pytest: Markers (unit, integration, slow, llm, etc.) are passed via -m
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • Generic eval tool #28 — Adds a similar comprehensive test suite and pytest tooling targeting the same modules and flows (core, llm managers, metrics, output, runner).

Suggested reviewers

  • tisnik
  • VladimirKadlec
  • Anxhela21

Poem

A rabbit taps keys in a testing spree,
Markers hop: unit, slow, and LLM with glee.
Fixtures burrow configs deep,
Pytest wakes from docile sleep.
Stats crunch, prompts rhyme—
Green ticks bloom, right on time. 🐇✅

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (31)
lsc_eval/tests/core/__init__.py (1)

1-2: Unify docstring phrasing across test packages.

Consider standardizing on “Core tests package.”, matching others.

lsc_eval/tests/__init__.py (1)

1-2: Minor consistency nit: align wording with subpackages.

Others use “ tests package.” Suggest a shorter, consistent phrasing.

Apply:

-"""Test package for LSC Evaluation Framework."""
+"""LSC evaluation tests package."""
lsc_eval/tests/llm_managers/__init__.py (1)

1-2: Tiny style nit: capitalize consistently.

If you adopt the previous suggestion, ensure all subpackages follow the same pattern (e.g., “LLM managers tests package.”).

lsc_eval/tests/metrics/__init__.py (1)

1-2: Optional: standardize wording.

Keep phrasing consistent across all test packages.

lsc_eval/tests/output/__init__.py (1)

1-2: Optional: consistency pass.

If you rename the root docstring, mirror the style here as well.

lsc_eval/tests/README.md (2)

7-22: Add language to fenced block and fix filename in tree.

Addresses MD040 and aligns with actual file added.

-```
+```text
 tests/
 ├── conftest.py              # Pytest fixtures and configuration
 ├── test_runner.py           # Test runner script
 ├── README.md               # This file
 ├── core/                   # Core functionality tests
 │   ├── test_config_loader.py    # ConfigLoader class tests
 │   ├── test_models.py           # Pydantic models tests
 │   └── test_data_validator.py   # DataValidator class tests
 ├── llm_managers/           # LLM manager tests
 │   └── test_llm_manager.py      # LLMManager class tests
 ├── metrics/                # Metrics component tests
 │   └── test_custom_metrics.py   # Custom metrics tests
 └── output/                 # Output component tests
-    └── test_utils.py            # Output utilities tests
+    └── test_utils_streamlined.py # Output utilities tests

---

`39-61`: **Clarify working directory and config usage in commands.**

The ini lives at lsc_eval/pytest.ini; from repo root prefer passing -c or using full paths.

```diff
-# Run all tests
-pytest tests/
+# Run all tests (repo root)
+pytest -c lsc_eval/pytest.ini lsc_eval/tests/
 
-# Run with verbose output
-pytest -v tests/
+# Run with verbose output
+pytest -c lsc_eval/pytest.ini -v lsc_eval/tests/
 
-# Run specific test file
-pytest tests/core/test_config_loader.py
+# Run specific test file
+pytest -c lsc_eval/pytest.ini lsc_eval/tests/core/test_config_loader.py

Also applies to: 63-83

lsc_eval/tests/conftest.py (3)

195-201: Use safe_dump for YAML.

Safer serializer; same output for these structures.

 with open(config_file, "w", encoding="utf-8") as f:
-    yaml.dump(sample_system_config, f)
+    yaml.safe_dump(sample_system_config, f, sort_keys=False)

204-209: Use safe_dump for YAML.

 with open(data_file, "w", encoding="utf-8") as f:
-    yaml.dump(sample_evaluation_data, f)
+    yaml.safe_dump(sample_evaluation_data, f, sort_keys=False)

128-131: Logger names casing (nit).

“LiteLLM”/“DeepEval” may not match actual logger names (“litellm”/“deepeval”). If log filtering relies on exact names, consider normalizing.

Also applies to: 213-229

lsc_eval/tests/core/test_data_validator.py (3)

8-14: Mark tests for selection via -m.

Matches your pytest.ini markers and README flows.

 import pytest
 import yaml
 
+pytestmark = [pytest.mark.unit, pytest.mark.validation]

79-86: Specify encoding when writing YAML.

Prevents platform-dependent defaults.

 invalid_file = temp_dir / "invalid_structure.yaml"
-with open(invalid_file, "w") as f:
+with open(invalid_file, "w", encoding="utf-8") as f:
     yaml.dump(invalid_data, f)

112-120: Specify encoding when writing YAML.

 with open(invalid_file, "w") as f:
-    yaml.dump(invalid_data, f)
+    yaml.safe_dump(invalid_data, f, sort_keys=False)
lsc_eval/tests/output/test_utils_streamlined.py (1)

3-10: Mark tests for selection via -m output.

 import pytest
 from lsc_eval.output.utils import (
     EvaluationScope,
     calculate_basic_stats,
     calculate_detailed_stats,
 )
 from lsc_eval.core.models import EvaluationResult, TurnData
 
+pytestmark = [pytest.mark.unit, pytest.mark.output]
lsc_eval/tests/llm_managers/test_llm_manager.py (3)

95-115: Parametrize provider-init test for clearer failures and simpler loops

Use pytest.parametrize instead of a manual loop to get per-case reporting and simpler debugging.

-    def test_llm_manager_provider_initialization(self):
-        """Test LLMManager initialization with different providers."""
-        test_cases = [
-            ("openai", "gpt-4o-mini", {"OPENAI_API_KEY": "key"}, "gpt-4o-mini"),
-            ("azure", "gpt-4", {"AZURE_OPENAI_API_KEY": "key", "AZURE_OPENAI_ENDPOINT": "endpoint"}, "azure/gpt-4"),
-            ("anthropic", "claude-3-sonnet", {"ANTHROPIC_API_KEY": "key"}, "anthropic/claude-3-sonnet"),
-            ("gemini", "gemini-pro", {"GOOGLE_API_KEY": "key"}, "gemini/gemini-pro"),
-            ("watsonx", "llama-2", {"WATSONX_API_KEY": "key", "WATSONX_API_BASE": "base", "WATSONX_PROJECT_ID": "proj"}, "watsonx/llama-2"),
-            ("ollama", "llama2", {}, "ollama/llama2"),
-            ("custom", "model", {}, "custom/model"),
-        ]
-
-        for provider, model, env_vars, expected_model_name in test_cases:
-            config = LLMConfig(provider=provider, model=model)
-            
-            with patch.dict(os.environ, env_vars, clear=True):
-                with patch('builtins.print'):
-                    manager = LLMManager(config)
-
-            assert manager.model_name == expected_model_name
+    @pytest.mark.parametrize(
+        "provider,model,env_vars,expected_model_name",
+        [
+            ("openai", "gpt-4o-mini", {"OPENAI_API_KEY": "key"}, "gpt-4o-mini"),
+            ("azure", "gpt-4", {"AZURE_OPENAI_API_KEY": "key", "AZURE_OPENAI_ENDPOINT": "endpoint"}, "azure/gpt-4"),
+            ("anthropic", "claude-3-sonnet", {"ANTHROPIC_API_KEY": "key"}, "anthropic/claude-3-sonnet"),
+            ("gemini", "gemini-pro", {"GOOGLE_API_KEY": "key"}, "gemini/gemini-pro"),
+            ("watsonx", "llama-2", {"WATSONX_API_KEY": "key", "WATSONX_API_BASE": "base", "WATSONX_PROJECT_ID": "proj"}, "watsonx/llama-2"),
+            ("ollama", "llama2", {}, "ollama/llama2"),
+            ("custom", "model", {}, "custom/model"),
+        ],
+    )
+    def test_llm_manager_provider_initialization(self, provider, model, env_vars, expected_model_name):
+        """Test LLMManager initialization with different providers."""
+        config = LLMConfig(provider=provider, model=model)
+        with patch.dict(os.environ, env_vars, clear=True):
+            with patch('builtins.print'):
+                manager = LLMManager(config)
+        assert manager.model_name == expected_model_name

262-265: Avoid confusion with variable name ‘call’

‘call’ is also a unittest.mock helper. Rename the loop variable for readability.

-        print_calls = [call.args[0] for call in mock_print.call_args_list]
-        assert any("⚠️ Using generic provider format for unknown_provider" in call for call in print_calls)
-        assert any("✅ LLM Manager: unknown_provider/custom-model -> unknown_provider/custom-model" in call for call in print_calls)
+        print_calls = [entry.args[0] for entry in mock_print.call_args_list]
+        assert any("⚠️ Using generic provider format for unknown_provider" in s for s in print_calls)
+        assert any("✅ LLM Manager: unknown_provider/custom-model -> unknown_provider/custom-model" in s for s in print_calls)

287-306: Add a test for Azure deployment-name override

Cover AZURE_OPENAI_DEPLOYMENT_NAME behavior to lock provider-specific model-name construction.

def test_azure_uses_deployment_name_when_set():
    config = LLMConfig(provider="azure", model="ignored-model")
    with patch.dict(os.environ, {
        "AZURE_OPENAI_API_KEY": "k",
        "AZURE_OPENAI_ENDPOINT": "e",
        "AZURE_OPENAI_DEPLOYMENT_NAME": "dep-name",
    }, clear=True):
        with patch('builtins.print'):
            manager = LLMManager(config)
    assert manager.model_name == "azure/dep-name"
lsc_eval/tests/core/test_models.py (4)

155-160: Assert error message for whitespace-only conversation_group_id

Strengthens the contract and keeps checks consistent with the empty-string case.

-        with pytest.raises(ValidationError) as exc_info:
-            EvaluationData(
-                conversation_group_id="   ",  # Invalid: whitespace only
-                turns=[turn_data]
-            )
+        with pytest.raises(ValidationError) as exc_info:
+            EvaluationData(
+                conversation_group_id="   ",  # Invalid: whitespace only
+                turns=[turn_data]
+            )
+        assert "Conversation group ID cannot be empty" in str(exc_info.value)

344-353: Also assert the validation message for score > 1

This mirrors the check already done for score < 0.

         with pytest.raises(ValidationError) as exc_info:
             EvaluationResult(
                 conversation_group_id="conv_group_1",
                 turn_id=1,
                 metric_identifier="ragas:faithfulness",
                 result="PASS",
                 score=1.1  # Invalid: above 1
             )
+        assert "Score must be between 0 and 1" in str(exc_info.value)

445-456: Validate temperature bounds error messages

Make expectations explicit for both out-of-range cases.

         with pytest.raises(ValidationError) as exc_info:
             LLMConfig(
                 model_name="gpt-4o-mini",
                 temperature=-0.1  # Invalid: below 0
             )
+        assert "Temperature must be between 0.0 and 2.0" in str(exc_info.value)
 
         with pytest.raises(ValidationError) as exc_info:
             LLMConfig(
                 model_name="gpt-4o-mini",
                 temperature=2.1  # Invalid: above 2
             )
+        assert "Temperature must be between 0.0 and 2.0" in str(exc_info.value)

20-52: Consider adding negative-context tests for contexts validator

Add cases for a context missing “content” and for empty “content” to fully cover TurnData.contexts validation.

def test_turn_data_context_missing_content_key():
    with pytest.raises(ValidationError) as exc_info:
        TurnData(turn_id=1, query="q", response="r", contexts=[{}])
    assert 'must have a "content" field' in str(exc_info.value)

def test_turn_data_context_empty_content():
    with pytest.raises(ValidationError) as exc_info:
        TurnData(turn_id=1, query="q", response="r", contexts=[{"content": "   "}])
    assert "content cannot be empty" in str(exc_info.value)
lsc_eval/tests/metrics/test_custom_metrics.py (3)

140-144: Assert LiteLLM call parameters are passed through

Verify model/temperature/max_tokens/timeout/num_retries wiring.

         result = custom_metrics._call_llm("Test prompt")
 
-        assert result == "Test response content"
-        mock_completion.assert_called_once()
+        assert result == "Test response content"
+        mock_completion.assert_called_once()
+        kwargs = mock_completion.call_args.kwargs
+        assert kwargs["model"] == "gpt-4o-mini"
+        assert kwargs["temperature"] == 0.0
+        assert kwargs["max_tokens"] == 512
+        assert kwargs["timeout"] == 300
+        assert kwargs["num_retries"] == 3

158-165: Access mock call kwargs directly (clearer than tuple indices)

Small readability improvement when inspecting messages.

-        call_args = mock_completion.call_args
-        messages = call_args[1]['messages']
+        kwargs = mock_completion.call_args.kwargs
+        messages = kwargs['messages']

223-234: Add zero-denominator cases to parsing tests

Guard against division-by-zero returning None as intended.

def test_parse_score_response_zero_denominator_fraction(custom_metrics):
    score, reason = custom_metrics._parse_score_response("Score: 8/0 - invalid")
    assert score is None

def test_extract_score_from_text_zero_denominator(custom_metrics):
    assert custom_metrics._extract_score_from_text("Rating 5 out of 0") is None
lsc_eval/tests/core/test_config_loader.py (3)

31-44: Use pytest monkeypatch for env isolation

Manipulating os.environ directly can leak across tests if a failure occurs mid-assertion. Prefer monkeypatch.setenv/delenv for safety and readability.

Example for one test (pattern applies to others):

-    def test_setup_environment_variables_success(self, system_config_file: Path):
+    def test_setup_environment_variables_success(self, system_config_file: Path, monkeypatch):
         """Test successful environment variable setup."""
-        # Clear any existing env vars
-        env_vars = ["DEEPEVAL_TELEMETRY_OPT_OUT", "DEEPEVAL_DISABLE_PROGRESS_BAR", "LITELLM_LOG_LEVEL"]
-        for var in env_vars:
-            if var in os.environ:
-                del os.environ[var]
+        for var in ["DEEPEVAL_TELEMETRY_OPT_OUT", "DEEPEVAL_DISABLE_PROGRESS_BAR", "LITELLM_LOG_LEVEL"]:
+            monkeypatch.delenv(var, raising=False)

Also applies to: 45-53, 54-65, 66-85


91-110: Case-sensitivity of package overrides may surprise users

setup_logging treats logger names case-sensitively, while sample config uses mixed-case keys (e.g., "LiteLLM", "DeepEval"). Consider normalizing override keys to match expected logger names (lowercase) or documenting case sensitivity. Optionally add a test asserting behavior for mixed-case overrides.

Would you like me to submit a follow-up patch to normalize override keys within setup_logging?

Also applies to: 111-132


118-122: Brittle expectation on invalid source_level raising AttributeError

Today getattr(...) raises, but if setup_logging later hardens to default instead of throwing, this test will fail. Consider asserting that invalid levels are rejected (via exception or warning + fallback) using capsys to capture the warning, or scoping the expectation to current behavior in a comment.

lsc_eval/tests/test_runner.py (4)

70-73: Gate parallel execution on pytest-xdist availability

Avoid hard-fail when xdist isn’t installed.

-    if parallel:
-        args.extend(["-n", "auto"])
+    if parallel:
+        try:
+            import xdist  # type: ignore
+            args.extend(["-n", "auto"])
+        except ImportError:
+            print("⚠️  pytest-xdist not installed. Install with: pip install pytest-xdist")
+            print("Running tests without parallelization...")

129-139: Deduplicate coverage plugin check

run_tests already guards coverage via import pytest_cov; run_with_coverage can directly call run_tests(coverage=True) and let run_tests handle fallback to reduce drift.

 def run_with_coverage() -> int:
     """Run all tests with coverage reporting."""
-    try:
-        import pytest_cov
-        return run_tests(coverage=True)
-    except ImportError:
-        print("⚠️  pytest-cov not installed for coverage reporting.")
-        print("Install with: pip install pytest-cov")
-        print("Running tests without coverage instead...")
-        return run_tests()
+    return run_tests(coverage=True)

167-178: Tweak sys.path insertion and remove unused variable

  • Avoid duplicate entries in sys.path.
  • test_dir is unused.
-    test_dir = Path(__file__).parent  # tests/ directory
@@
-    if src_path.exists():
-        sys.path.insert(0, str(src_path))
+    if src_path.exists():
+        src_str = str(src_path)
+        if src_str not in sys.path:
+            sys.path.insert(0, src_str)

141-164: Return 0 on usage to avoid false negatives in tooling

Printing usage with exit code 1 can be treated as a failure by wrappers. Consider exit 0 unless you intend to signal error.

-        return 1
+        return 0
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ca9a863 and 6afd14f.

⛔ Files ignored due to path filters (1)
  • lsc_eval/pdm.lock is excluded by !**/*.lock
📒 Files selected for processing (15)
  • lsc_eval/pytest.ini (1 hunks)
  • lsc_eval/tests/README.md (1 hunks)
  • lsc_eval/tests/__init__.py (1 hunks)
  • lsc_eval/tests/conftest.py (1 hunks)
  • lsc_eval/tests/core/__init__.py (1 hunks)
  • lsc_eval/tests/core/test_config_loader.py (1 hunks)
  • lsc_eval/tests/core/test_data_validator.py (1 hunks)
  • lsc_eval/tests/core/test_models.py (1 hunks)
  • lsc_eval/tests/llm_managers/__init__.py (1 hunks)
  • lsc_eval/tests/llm_managers/test_llm_manager.py (1 hunks)
  • lsc_eval/tests/metrics/__init__.py (1 hunks)
  • lsc_eval/tests/metrics/test_custom_metrics.py (1 hunks)
  • lsc_eval/tests/output/__init__.py (1 hunks)
  • lsc_eval/tests/output/test_utils_streamlined.py (1 hunks)
  • lsc_eval/tests/test_runner.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

  • lsc_eval/tests/__init__.py
🧬 Code graph analysis (6)
lsc_eval/tests/core/test_data_validator.py (4)
lsc_eval/src/lsc_eval/core/data_validator.py (5)
  • DataValidator (11-82)
  • load_evaluation_data (19-37)
  • validate_evaluation_data (39-54)
  • _validate_metrics_availability (56-72)
  • _validate_metric_requirements (74-82)
lsc_eval/src/lsc_eval/core/models.py (2)
  • EvaluationData (47-130)
  • TurnData (8-44)
lsc_eval/src/lsc_eval/core/config_loader.py (1)
  • populate_metric_mappings (114-129)
lsc_eval/tests/conftest.py (3)
  • evaluation_data_file (204-209)
  • sample_system_config (19-124)
  • temp_dir (12-15)
lsc_eval/tests/core/test_config_loader.py (2)
lsc_eval/src/lsc_eval/core/config_loader.py (7)
  • ConfigLoader (193-275)
  • SystemConfig (151-190)
  • setup_environment_variables (30-47)
  • setup_logging (50-111)
  • populate_metric_mappings (114-129)
  • load_system_config (202-259)
  • get_llm_config_dict (261-275)
lsc_eval/tests/conftest.py (3)
  • system_config_file (195-200)
  • temp_dir (12-15)
  • sample_system_config (19-124)
lsc_eval/tests/llm_managers/test_llm_manager.py (1)
lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (5)
  • LLMManager (36-173)
  • LLMError (8-9)
  • from_dict (24-33)
  • get_litellm_params (154-162)
  • get_config (164-166)
lsc_eval/tests/core/test_models.py (1)
lsc_eval/src/lsc_eval/core/models.py (5)
  • TurnData (8-44)
  • EvaluationData (47-130)
  • EvaluationSystemConfig (199-208)
  • OutputConfig (211-224)
  • validate_metric_requirements (93-130)
lsc_eval/tests/metrics/test_custom_metrics.py (4)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (7)
  • CustomMetrics (29-251)
  • EvaluationPromptParams (14-26)
  • _call_llm (65-89)
  • _parse_score_response (91-130)
  • _extract_score_from_text (132-162)
  • _create_evaluation_prompt (164-197)
  • _evaluate_answer_correctness (199-245)
lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (2)
  • LLMManager (36-173)
  • get_litellm_params (154-162)
lsc_eval/src/lsc_eval/core/models.py (1)
  • TurnData (8-44)
lsc_eval/src/lsc_eval/output/utils.py (1)
  • EvaluationScope (11-16)
lsc_eval/tests/output/test_utils_streamlined.py (2)
lsc_eval/src/lsc_eval/output/utils.py (3)
  • EvaluationScope (11-16)
  • calculate_basic_stats (19-45)
  • calculate_detailed_stats (48-68)
lsc_eval/src/lsc_eval/core/models.py (1)
  • TurnData (8-44)
🪛 LanguageTool
lsc_eval/tests/README.md

[grammar] ~28-~28: There might be a mistake here.
Context: ...**: Unit tests for individual components - integration: Integration tests across components - ...

(QB_NEW_EN)


[grammar] ~29-~29: There might be a mistake here.
Context: ...**: Integration tests across components - **config`**: Configuration loading and validation t...

(QB_NEW_EN)


[grammar] ~30-~30: There might be a mistake here.
Context: ...nfiguration loading and validation tests - models: Pydantic model validation tests - **`v...

(QB_NEW_EN)


[grammar] ~31-~31: There might be a mistake here.
Context: ...dels**: Pydantic model validation tests - **validation**: Data validation tests - **output`**: ...

(QB_NEW_EN)


[grammar] ~32-~32: There might be a mistake here.
Context: ... validation: Data validation tests - output: Output generation and formatting tests...

(QB_NEW_EN)


[grammar] ~33-~33: There might be a mistake here.
Context: ...: Output generation and formatting tests - slow: Tests that take longer to run - **`llm...

(QB_NEW_EN)


[grammar] ~34-~34: There might be a mistake here.
Context: ...*slow**: Tests that take longer to run - llm: Tests requiring LLM API calls (may be ...

(QB_NEW_EN)


[grammar] ~91-~91: There might be a mistake here.
Context: ...environment variables before/after tests - temp_dir: Provides temporary directory for test ...

(QB_NEW_EN)


[grammar] ~92-~92: There might be a mistake here.
Context: ...vides temporary directory for test files - sample_system_config: Provides sample system configuration -...

(QB_NEW_EN)


[grammar] ~93-~93: There might be a mistake here.
Context: ...**: Provides sample system configuration - sample_evaluation_data: Provides sample evaluation data ### M...

(QB_NEW_EN)


[grammar] ~100-~100: There might be a mistake here.
Context: ...ropic, Gemini, WatsonX, Ollama providers - Metrics: Ragas, DeepEval, and Custom m...

(QB_NEW_EN)


[grammar] ~101-~101: There might be a mistake here.
Context: ...Custom metrics as defined in system.yaml - Output Formats: CSV, JSON, TXT formats...

(QB_NEW_EN)


[grammar] ~102-~102: There might be a mistake here.
Context: ..., TXT formats with visualization options - Evaluation Data: Multi-turn conversati...

(QB_NEW_EN)


[grammar] ~109-~109: There might be a mistake here.
Context: ... test suite covers: ### Core Components - ConfigLoader: System configuration loa...

(QB_NEW_EN)


[grammar] ~110-~110: There might be a mistake here.
Context: ...environment setup, logging configuration - Models: Pydantic model validation for ...

(QB_NEW_EN)


[grammar] ~111-~111: There might be a mistake here.
Context: ...rnData, EvaluationData, EvaluationResult - DataValidator: Evaluation data validat...

(QB_NEW_EN)


[grammar] ~114-~114: There might be a mistake here.
Context: ... requirements checking ### LLM Managers - LLMManager: Provider-specific configur...

(QB_NEW_EN)


[grammar] ~117-~117: There might be a mistake here.
Context: ...on, model name construction ### Metrics - CustomMetrics: LLM-based evaluation, s...

(QB_NEW_EN)


[grammar] ~120-~120: There might be a mistake here.
Context: ...prompt generation ### Output Components - Utils: Statistics calculation, result ...

(QB_NEW_EN)


[grammar] ~125-~125: There might be a mistake here.
Context: ...est Scenarios ### Configuration Testing - Valid and invalid system configurations ...

(QB_NEW_EN)


[grammar] ~126-~126: There might be a mistake here.
Context: ... Valid and invalid system configurations - Environment variable setup and validatio...

(QB_NEW_EN)


[grammar] ~127-~127: There might be a mistake here.
Context: ...nvironment variable setup and validation - Logging configuration with different lev...

(QB_NEW_EN)


[grammar] ~128-~128: There might be a mistake here.
Context: ...ging configuration with different levels - Metric mapping and validation ### Model...

(QB_NEW_EN)


[grammar] ~131-~131: There might be a mistake here.
Context: ...validation ### Model Validation Testing - Field validation for all Pydantic models...

(QB_NEW_EN)


[grammar] ~132-~132: There might be a mistake here.
Context: ...Field validation for all Pydantic models - Edge cases and boundary conditions - Req...

(QB_NEW_EN)


[grammar] ~133-~133: There might be a mistake here.
Context: ...els - Edge cases and boundary conditions - Required field validation - Data type va...

(QB_NEW_EN)


[grammar] ~134-~134: There might be a mistake here.
Context: ...y conditions - Required field validation - Data type validation ### Data Validatio...

(QB_NEW_EN)


[grammar] ~137-~137: There might be a mistake here.
Context: ... validation ### Data Validation Testing - Evaluation data structure validation - M...

(QB_NEW_EN)


[grammar] ~138-~138: There might be a mistake here.
Context: ...g - Evaluation data structure validation - Metric requirement checking - Context an...

(QB_NEW_EN)


[grammar] ~139-~139: There might be a mistake here.
Context: ...validation - Metric requirement checking - Context and expected response validation...

(QB_NEW_EN)


[grammar] ~140-~140: There might be a mistake here.
Context: ...Context and expected response validation - Multi-conversation validation ### LLM M...

(QB_NEW_EN)


[grammar] ~143-~143: There might be a mistake here.
Context: ...tion validation ### LLM Manager Testing - Provider-specific environment validation...

(QB_NEW_EN)


[grammar] ~144-~144: There might be a mistake here.
Context: ...Provider-specific environment validation - Model name construction for different pr...

(QB_NEW_EN)


[grammar] ~145-~145: There might be a mistake here.
Context: ...ame construction for different providers - Error handling for missing credentials -...

(QB_NEW_EN)


[grammar] ~146-~146: There might be a mistake here.
Context: ...- Error handling for missing credentials - Configuration parsing ### Metrics Testi...

(QB_NEW_EN)


[grammar] ~149-~149: There might be a mistake here.
Context: ...nfiguration parsing ### Metrics Testing - Custom metric evaluation - LLM response ...

(QB_NEW_EN)


[grammar] ~150-~150: There might be a mistake here.
Context: ...trics Testing - Custom metric evaluation - LLM response parsing - Score normalizati...

(QB_NEW_EN)


[grammar] ~151-~151: There might be a mistake here.
Context: ...metric evaluation - LLM response parsing - Score normalization - Error handling for...

(QB_NEW_EN)


[grammar] ~152-~152: There might be a mistake here.
Context: ...M response parsing - Score normalization - Error handling for failed evaluations #...

(QB_NEW_EN)


[grammar] ~155-~155: There might be a mistake here.
Context: ...r failed evaluations ### Output Testing - Statistics calculation - Result aggregat...

(QB_NEW_EN)


[grammar] ~156-~156: There might be a mistake here.
Context: ... Output Testing - Statistics calculation - Result aggregation by metric and convers...

(QB_NEW_EN)


[grammar] ~157-~157: There might be a mistake here.
Context: ...t aggregation by metric and conversation - Score statistics computation - Edge case...

(QB_NEW_EN)


[grammar] ~158-~158: There might be a mistake here.
Context: ...versation - Score statistics computation - Edge cases with empty or error results ...

(QB_NEW_EN)


[grammar] ~191-~191: There might be a mistake here.
Context: ...ed in system.yaml with proper thresholds - Providers: All LLM providers with requ...

(QB_NEW_EN)


[grammar] ~192-~192: There might be a mistake here.
Context: ...ders with required environment variables - Output Formats: All output formats and...

(QB_NEW_EN)


[grammar] ~193-~193: There might be a mistake here.
Context: ...output formats and visualization options - Evaluation Scenarios: Multi-turn conve...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
lsc_eval/tests/README.md

7-7: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (22)
lsc_eval/tests/core/__init__.py (1)

1-2: LGTM — explicit test package marker.

Docstring is fine and helps tooling; no runtime impact.

lsc_eval/tests/__init__.py (1)

1-2: LGTM.

Package init present; harmless and aids discovery.

lsc_eval/tests/llm_managers/__init__.py (1)

1-2: LGTM — matches style used elsewhere.

lsc_eval/tests/metrics/__init__.py (1)

1-2: LGTM.

Clear, minimal package marker.

lsc_eval/tests/output/__init__.py (1)

1-2: LGTM.

Consistent with other test package initializers.

lsc_eval/pytest.ini (1)

5-9: Confirm config discovery path (pytest.ini under lsc_eval/).

Running pytest from repo root won’t load this ini unless CI/tools pass -c lsc_eval/pytest.ini or cd into lsc_eval/. Consider moving this file to repo root, or update docs/test runner to use -c.

Also applies to: 11-17, 29-31, 38-45, 46-47

lsc_eval/tests/conftest.py (1)

127-131: Typing requires Python ≥3.9.

You use built-in generics (list[...]). Confirm CI Python version; else switch to typing.List for broader compatibility.

-def sample_evaluation_data() -> list[Dict[str, Any]]:
+from typing import List
+def sample_evaluation_data() -> List[Dict[str, Any]]:
 
-def invalid_evaluation_data() -> list[Dict[str, Any]]:
+def invalid_evaluation_data() -> List[Dict[str, Any]]:

Also applies to: 232-257

lsc_eval/tests/core/test_data_validator.py (1)

121-149: Solid coverage and structure.

Good separation of success/failure paths, internal helper testing, and error clearing.

Also applies to: 178-205, 206-247, 248-271, 272-295, 296-323, 324-351, 352-384, 385-420

lsc_eval/tests/output/test_utils_streamlined.py (2)

35-85: LGTM — scenarios are representative and assertions are precise.

Also applies to: 87-192


15-33: EvaluationScope correctly supports kwargs
The class is decorated with @dataclass, so its generated __init__ accepts turn_idx, turn_data, and is_conversation as used in the tests—no changes needed.

lsc_eval/tests/metrics/test_custom_metrics.py (1)

100-106: EvaluationScope is a dataclass—keyword args are supported.
No workaround needed; the generated init accepts turn_idx, turn_data, and is_conversation.

lsc_eval/tests/core/test_config_loader.py (9)

1-24: Good coverage and structure for config loader tests

Solid end-to-end and unit coverage across environment, logging, metrics, model, and loader paths. Nice use of fixtures and patching.


139-160: Metric mapping tests look correct and isolate state

populate_metric_mappings clears sets; your tests also probe edge-cases. LGTM.

Also applies to: 161-173, 174-192


200-249: Validation test suite is precise and readable

Clear expectations and error messages. LGTM.


257-270: SystemConfig model defaults and overrides validated well

Covers defaults, overrides, and basic type validation. LGTM.

Also applies to: 271-288, 289-303


318-331: Loader success/error-path tests are on point

Happy-path plus file-not-found and invalid YAML are exercised correctly. LGTM.

Also applies to: 332-348


349-377: Partial config defaults behavior covered well

Asserts both provided values and fallbacks. LGTM.


378-397: LLM config dict accessors validated, including error path

Good coverage for both presence and absence of system_config. LGTM.

Also applies to: 398-404


405-419: Post-load side effects (metrics + logger + integration with setup_logging) verified

The combination of direct assertions and a targeted patch on setup_logging is effective. LGTM.

Also applies to: 420-428, 429-440


441-456: CSV columns constant and custom override checks are pragmatic

Subset membership avoids ordering brittleness; custom override path validated. LGTM.

Also applies to: 457-475, 476-494

lsc_eval/tests/test_runner.py (2)

78-86: Nice baseline options

--strict-config, colored output, durations are helpful for CI signal.


85-87: Overall runner is practical and CI-friendly

Clear UX, sensible defaults, and minimal dependencies. LGTM after marker fix.

Also applies to: 141-208

@@ -0,0 +1,47 @@
[tool:pytest]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix pytest.ini header: use [pytest], not [tool:pytest].

Pytest won’t read options from this section; config will be ignored.

Apply:

-[tool:pytest]
+[pytest]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
[tool:pytest]
# lsc_eval/pytest.ini
[pytest]
🤖 Prompt for AI Agents
In lsc_eval/pytest.ini around line 1, the file uses the wrong INI section header
"[tool:pytest]" so pytest will ignore the options; change the section header to
"[pytest]" (i.e., replace "[tool:pytest]" with "[pytest]") so pytest will read
the configuration, and keep any existing options under that header.

Comment on lines +11 to +16
addopts =
-v
--tb=short
--strict-config
--color=yes
--durations=10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enforce registered markers.

Add --strict-markers to catch typos (you already define markers below).

 addopts = 
     -v
     --tb=short
     --strict-config
     --color=yes
+    --strict-markers
     --durations=10
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
addopts =
-v
--tb=short
--strict-config
--color=yes
--durations=10
addopts =
-v
--tb=short
--strict-config
--color=yes
--strict-markers
--durations=10
🤖 Prompt for AI Agents
In lsc_eval/pytest.ini around lines 11 to 16, the pytest addopts lacks the
--strict-markers flag so typos in user-defined markers aren’t enforced; update
the addopts value to include --strict-markers (e.g., append --strict-markers to
the existing options) so pytest will fail on unregistered/misspelled markers.

Comment on lines +32 to +37
# Test collection
collect_ignore =
setup.py
build
dist

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Replace invalid ini key collect_ignore (ini) with norecursedirs or move to conftest.

With --strict-config this will error once the header is fixed.

-# Test collection
-collect_ignore = 
-    setup.py
-    build
-    dist
+# Test collection
+norecursedirs = build dist

Note: setup.py is already out of scope with testpaths=tests.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Test collection
collect_ignore =
setup.py
build
dist
# Test collection
norecursedirs = build dist
🤖 Prompt for AI Agents
In lsc_eval/pytest.ini around lines 32 to 37 the file uses the invalid ini key
`collect_ignore` (which will fail with --strict-config); replace that entry with
the valid `norecursedirs` option listing the same directories (e.g. build, dist)
or remove the directive from pytest.ini and implement the ignore via a
conftest.py (using pytest_ignore_collect or by configuring collection there);
also remove `setup.py` from the list (it's already excluded by testpaths=tests)
so the resulting pytest.ini only contains valid keys.

Comment on lines +437 to +444
scope1 = EvaluationScope(turn_idx=0, turn_data=turn_data_no_expected, is_conversation=False)
score1, reason1 = custom_metrics.evaluate("answer_correctness", None, scope1)
assert score1 == 0.7

# Test with no contexts
scope2 = EvaluationScope(turn_idx=0, turn_data=turn_data_no_contexts, is_conversation=False)
score2, reason2 = custom_metrics.evaluate("answer_correctness", None, scope2)
assert score2 == 0.7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

EvaluationScope keyword-init again — mirror the resilient pattern

Align with the fixture approach to avoid TypeError when the class lacks init.

-            # Test with no expected response
-            scope1 = EvaluationScope(turn_idx=0, turn_data=turn_data_no_expected, is_conversation=False)
+            # Test with no expected response
+            try:
+                scope1 = EvaluationScope(turn_idx=0, turn_data=turn_data_no_expected, is_conversation=False)
+            except TypeError:
+                scope1 = EvaluationScope()
+                scope1.turn_idx = 0
+                scope1.turn_data = turn_data_no_expected
+                scope1.is_conversation = False
             score1, reason1 = custom_metrics.evaluate("answer_correctness", None, scope1)
             assert score1 == 0.7
 
-            # Test with no contexts
-            scope2 = EvaluationScope(turn_idx=0, turn_data=turn_data_no_contexts, is_conversation=False)
+            # Test with no contexts
+            try:
+                scope2 = EvaluationScope(turn_idx=0, turn_data=turn_data_no_contexts, is_conversation=False)
+            except TypeError:
+                scope2 = EvaluationScope()
+                scope2.turn_idx = 0
+                scope2.turn_data = turn_data_no_contexts
+                scope2.is_conversation = False
             score2, reason2 = custom_metrics.evaluate("answer_correctness", None, scope2)
             assert score2 == 0.7
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
scope1 = EvaluationScope(turn_idx=0, turn_data=turn_data_no_expected, is_conversation=False)
score1, reason1 = custom_metrics.evaluate("answer_correctness", None, scope1)
assert score1 == 0.7
# Test with no contexts
scope2 = EvaluationScope(turn_idx=0, turn_data=turn_data_no_contexts, is_conversation=False)
score2, reason2 = custom_metrics.evaluate("answer_correctness", None, scope2)
assert score2 == 0.7
# Test with no expected response
try:
scope1 = EvaluationScope(turn_idx=0, turn_data=turn_data_no_expected, is_conversation=False)
except TypeError:
scope1 = EvaluationScope()
scope1.turn_idx = 0
scope1.turn_data = turn_data_no_expected
scope1.is_conversation = False
score1, reason1 = custom_metrics.evaluate("answer_correctness", None, scope1)
assert score1 == 0.7
# Test with no contexts
try:
scope2 = EvaluationScope(turn_idx=0, turn_data=turn_data_no_contexts, is_conversation=False)
except TypeError:
scope2 = EvaluationScope()
scope2.turn_idx = 0
scope2.turn_data = turn_data_no_contexts
scope2.is_conversation = False
score2, reason2 = custom_metrics.evaluate("answer_correctness", None, scope2)
assert score2 == 0.7
🤖 Prompt for AI Agents
In lsc_eval/tests/metrics/test_custom_metrics.py around lines 437 to 444, the
test instantiates EvaluationScope with keyword args which will raise a TypeError
if EvaluationScope does not define an __init__; change the pattern to mirror the
fixture-resilient approach by creating an empty instance (scope =
EvaluationScope() or however the class is default-constructible) and then assign
attributes explicitly (scope.turn_idx = 0; scope.turn_data =
turn_data_no_expected; scope.is_conversation = False) for scope1 and similarly
for scope2 so the test works regardless of whether EvaluationScope defines a
keyword __init__.

Comment on lines +47 to +54
# Markers
if markers:
for marker in markers:
if marker.startswith("not "):
args.extend(["-m", marker])
else:
args.extend(["-m", marker])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Combine multiple markers into a single -m expression

Passing multiple -m options applies only the last one. Join markers with logical AND (or OR) for intended selection.

-    if markers:
-        for marker in markers:
-            if marker.startswith("not "):
-                args.extend(["-m", marker])
-            else:
-                args.extend(["-m", marker])
+    if markers:
+        expr = " and ".join(f"({m})" for m in markers)
+        args.extend(["-m", expr])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Markers
if markers:
for marker in markers:
if marker.startswith("not "):
args.extend(["-m", marker])
else:
args.extend(["-m", marker])
# Markers
if markers:
expr = " and ".join(f"({m})" for m in markers)
args.extend(["-m", expr])
🤖 Prompt for AI Agents
In lsc_eval/tests/test_runner.py around lines 47–54, the code appends multiple
"-m" options separately which causes pytest to only honor the last marker;
instead build a single pytest marker expression by joining all markers with a
logical operator (e.g., " and ") and pass that as a single "-m" argument. For
markers starting with "not ", convert them to "not(<marker>)" form before
joining so negations are preserved, then replace the current loop with logic
that constructs the combined expression and does args.extend(["-m",
combined_expression]).

Copy link
Contributor

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd merge this PR first.

@bsatapat-jpg bsatapat-jpg deleted the dev branch September 1, 2025 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants