Skip to content

Conversation

@asamal4
Copy link
Collaborator

@asamal4 asamal4 commented Aug 26, 2025

Generic & configurable eval tool/framework (no dependency on lightspeed-stack)

Test cases are pending. will add in follow up PRs.
This is still WIP, adding to the repo for others to test.

Summary by CodeRabbit

  • New Features

    • Adds the LSC Evaluation Framework and CLI (lsc-eval) for running turn- and conversation-level evaluations; programmatic runner also available.
    • Built-in metric support: Ragas, DeepEval, and Custom LLM-based metrics with configurable thresholds.
    • Generates CSV/JSON/TXT reports and optional graphs (pass rates, score distribution, conversation heatmap).
    • Includes sample system and evaluation configuration files.
  • Documentation

    • Adds a comprehensive README with quick start, configuration, and usage guidance.
  • Chores

    • Updates .gitignore and expands tooling targets (type checks, linters, auto-format fix).

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 26, 2025

Warning

Rate limit exceeded

@asamal4 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 0 minutes and 8 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between e5fcdee and d94afec.

📒 Files selected for processing (17)
  • lsc_eval/README.md (1 hunks)
  • lsc_eval/pyproject.toml (1 hunks)
  • lsc_eval/runner.py (1 hunks)
  • lsc_eval/src/lsc_eval/__init__.py (1 hunks)
  • lsc_eval/src/lsc_eval/core/__init__.py (1 hunks)
  • lsc_eval/src/lsc_eval/core/config_loader.py (1 hunks)
  • lsc_eval/src/lsc_eval/core/data_validator.py (1 hunks)
  • lsc_eval/src/lsc_eval/core/models.py (1 hunks)
  • lsc_eval/src/lsc_eval/evaluation_engine.py (1 hunks)
  • lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (1 hunks)
  • lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (1 hunks)
  • lsc_eval/src/lsc_eval/metrics/custom_metrics.py (1 hunks)
  • lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (1 hunks)
  • lsc_eval/src/lsc_eval/metrics/ragas_metrics.py (1 hunks)
  • lsc_eval/src/lsc_eval/output/output_handler.py (1 hunks)
  • lsc_eval/src/lsc_eval/output/utils.py (1 hunks)
  • lsc_eval/src/lsc_eval/output/visualization.py (1 hunks)
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 45

🧹 Nitpick comments (42)
.gitignore (2)

181-182: Remove redundant ignore: lsc_eval/eval_output*/ is unnecessary

The existing pattern eval_output*/ (Line 180) already matches directories named eval_output* at any depth in the repo. The added lsc_eval/eval_output*/ is redundant and can be dropped to avoid drift and duplication.

Apply this minimal diff:

- lsc_eval/eval_output*/

183-184: Broaden .deepeval ignore to any location, not just under lsc_eval/

DeepEval may create .deepeval/ at the repo root (or elsewhere). Scoping the ignore to lsc_eval/.deepeval/ could miss those artifacts. Prefer ignoring .deepeval/ globally.

Proposed change:

-# DeepEval telemetry and configuration
-lsc_eval/.deepeval/
+# DeepEval telemetry and configuration
+.deepeval/

If you explicitly want to scope it to any subdir (not only root), you could use **/.deepeval/ instead.

lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py (2)

29-29: Use logging instead of print for library code.

Printing in a library is noisy for CLI/consumers; prefer logging.

-        print(f"✅ DeepEval LLM Manager: {self.model_name}")
+        logger.info("DeepEval LLM Manager initialized for model %s", self.model_name)

31-33: Stringize or postpone annotations to avoid import-time typing issues.

If DeepEval isn’t installed, exported annotations referencing LiteLLMModel can still cause tooling churn. With the guard above, you’re mostly fine; optionally stringize the return type to be extra-safe.

-    def get_llm(self) -> LiteLLMModel:
+    def get_llm(self) -> "LiteLLMModel":
         """Get the configured DeepEval LiteLLM model."""
         return self.llm_model
lsc_eval/README.md (5)

3-3: Tighten intro grammar.

-A comprehensive framework/tool to evaluate GenAI application.
+A comprehensive tool to evaluate GenAI applications.

19-25: Unify install instructions with repo org and mention optional deps.

Ensure readers install the correct repo and, when using Ragas/DeepEval metrics, pull optional dependencies.

-# From Git
-pdm add git+https://github.com/your-org/lightspeed-evaluation.git#subdirectory=lsc_eval
-# or pip install git+https://github.com/lightspeed-core/lightspeed-evaluation.git#subdirectory=lsc_eval
+# From Git (PDM)
+pdm add git+https://github.com/lightspeed-core/lightspeed-evaluation.git#subdirectory=lsc_eval
+# Or via pip
+# pip install "git+https://github.com/lightspeed-core/lightspeed-evaluation.git#subdirectory=lsc_eval"
+
+# Optional metric backends:
+# Ragas + LiteLLM
+# pdm add ragas litellm
+# DeepEval + LiteLLM
+# pdm add deepeval litellm

32-34: Prefer the CLI entrypoint if provided, keep runner.py as an alternative.

The PR summary mentions a console script. Consider showcasing it first.

-# Run evaluation (Create your own data)
-python runner.py --system-config config/system.yaml --eval-data config/evaluation_data.yaml
+# Run evaluation (create your own data)
+# If installed as a package with console script:
+# lsc-eval --system-config config/system.yaml --eval-data config/evaluation_data.yaml
+# Or directly via module:
+python runner.py --system-config config/system.yaml --eval-data config/evaluation_data.yaml

121-125: Clarify “Actual Reasons” wording.

Minor readability tweak.

-**Actual Reasons**: DeepEval provides LLM-generated explanations, Custom metrics provide detailed reasoning
+**Reasons/Explanations**: DeepEval provides LLM-generated explanations; custom metrics include detailed reasoning

126-133: Pin the dev tools order and add pyright/pylint notes.

Not required, but it helps new contributors reproduce CI locally.

 pdm run black .
 pdm run ruff check .
 pdm run mypy .
-pdm run pyright .
-pdm run pylint .
+pdm run pyright .
+pdm run pylint .
+# Tip: if you don't use DeepEval/Ragas, you can still run linters;
+# optional imports in code are guarded to avoid hard failures.
lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (3)

57-59: Use logging for errors instead of print.

Keep libraries quiet unless explicitly configured.

-        except Exception as e:
-            print(f"❌ Ragas LLM failed: {e}")
-            raise RuntimeError(f"Ragas LLM evaluation failed: {str(e)}") from e
+        except Exception as e:
+            logger.error("Ragas LLM failed: %s", e)
+            raise RuntimeError(f"Ragas LLM evaluation failed: {e}") from e

18-18: Use logging instead of print on initialization.

-        print(f"✅ Ragas Custom LLM: {self.model_name}")
+        logger.info("Ragas Custom LLM initialized for model %s", self.model_name)

91-95: Avoid global metric side effects or make them opt-in.

Assigning to answer_relevancy.llm and faithfulness.llm mutates module-level singletons and can surprise callers if multiple managers exist. Prefer per-metric instances or make this opt-in.

-        # Configure Ragas metrics to use our custom LLM
-        answer_relevancy.llm = self.custom_llm
-        faithfulness.llm = self.custom_llM
-        print("✅ Ragas LLM Manager configured")
+        # Optional: configure default Ragas metrics to use our custom LLM
+        if self.litellm_params.get("inject_global_llm", True):
+            answer_relevancy.llm = self.custom_llm
+            faithfulness.llm = self.custom_llm
+        logger.info("Ragas LLM Manager configured (inject_global_llm=%s)", self.litellm_params.get("inject_global_llm", True))

If you keep global assignment, document it in README to set expectations.

lsc_eval/src/lsc_eval/metrics/__init__.py (1)

3-7: Consider lazy imports to avoid heavyweight import costs

If startup time matters for CLI runs that only use a subset of frameworks, switch to local (on-demand) imports inside the evaluation dispatch or use a lightweight registry to avoid importing ragas/deepeval unless needed.

lsc_eval/src/lsc_eval/llm_managers/__init__.py (1)

3-7: Re-export LLMConfig to avoid deep imports

Consumers often need the config type; re-exporting it keeps imports consistent with this package’s public surface.

Apply:

-from .llm_manager import LLMManager
+from .llm_manager import LLMManager, LLMConfig
@@
-__all__ = ["LLMManager", "RagasLLMManager", "DeepEvalLLMManager"]
+__all__ = ["LLMManager", "LLMConfig", "RagasLLMManager", "DeepEvalLLMManager"]
lsc_eval/pyproject.toml (1)

8-8: Use a valid SPDX license identifier

“Apache-2.0” is the canonical SPDX identifier and improves license detection.

-license = {text = "Apache"}
+license = {text = "Apache-2.0"}
lsc_eval/config/evaluation_data.yaml (1)

46-47: Add missing group description for conv_group_3 (consistency with other groups)

Earlier groups include description. If your model validation expects it, the absence here will fail validation. Add a description for consistency.

 - conversation_group_id: "conv_group_3"
- 
+  description: "conversation group description"
lsc_eval/src/lsc_eval/core/__init__.py (3)

3-5: Public API is coherent; one naming collision to consider.

Re-exporting core types is handy. However, exposing LLMConfig from .models can be confused with the different LLMConfig defined in llm_managers/llm_manager.py. Consider dropping LLMConfig from this module’s public surface or renaming one of the two to avoid accidental imports.


7-17: Black reformat failure — run formatter.

CI shows Black would reformat this file. Please run the project’s formatter to unblock CI.


1-1: Black reformat failure — run formatter.

CI flagged Black on this file too. After edits, run Black to satisfy CI.

lsc_eval/src/lsc_eval/output/visualization.py (2)

1-1: Black reformat failure — run formatter.

CI flagged Black on this file. After applying the above changes, please run Black to normalize formatting.


204-213: Duplicate-code warning is acceptable; consider a helper later.

Pylint reports duplicate lines vs. another module. Given this is a new, decoupled tool, we can defer, but extracting shared plotting helpers under output/utils.py would reduce drift long-term.

lsc_eval/runner.py (1)

1-1: Black reformat failure — run formatter.

After the above edits, run Black to satisfy CI.

lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (4)

51-61: Type hints for _evaluate_metric.

Apply:

-    def _evaluate_metric(self, metric, test_case) -> Tuple[float, str]:
+    def _evaluate_metric(self, metric: Any, test_case: Any) -> Tuple[float, str]:

79-94: Type hints for conversation_completeness.

Apply:

-    def _evaluate_conversation_completeness(
+    def _evaluate_conversation_completeness(
         self,
-        conv_data,
+        conv_data: EvaluationData,
         _turn_idx: Optional[int],
         _turn_data: Optional[TurnData],
         is_conversation: bool,
     ) -> Tuple[Optional[float], str]:

95-114: Type hints for conversation_relevancy.

Apply:

-    def _evaluate_conversation_relevancy(
+    def _evaluate_conversation_relevancy(
         self,
-        conv_data,
+        conv_data: EvaluationData,
         _turn_idx: Optional[int],
         _turn_data: Optional[TurnData],
         is_conversation: bool,
     ) -> Tuple[Optional[float], str]:

115-133: Type hints for knowledge_retention.

Apply:

-    def _evaluate_knowledge_retention(
+    def _evaluate_knowledge_retention(
         self,
-        conv_data,
+        conv_data: EvaluationData,
         _turn_idx: Optional[int],
         _turn_data: Optional[TurnData],
         is_conversation: bool,
     ) -> Tuple[Optional[float], str]:
lsc_eval/src/lsc_eval/core/data_validator.py (1)

35-36: Prefer structured logging over prints

Switch to the project’s logging setup for consistent, filterable output (especially useful in CI). If logging is not wired here yet, keep prints for now and follow up later.

lsc_eval/src/lsc_eval/output/output_handler.py (1)

80-82: Remove stale comment about DataFrame

The CSV writer doesn’t use a DataFrame. Removing this avoids confusion.

-        # Move to dataframe for better aggregation
         csv_file = self.output_dir / f"{base_filename}_detailed.csv"
lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (4)

66-72: Use logger.warning for generic provider path; include remediation hint

Using a generic provider silently can lead to confusing runtime errors when LiteLLM rejects the prefix. Emit a warning via logging and suggest valid providers.

-        print(f"⚠️ Using generic provider format for {provider}")
+        logger.warning("Using generic provider format for '%s'. Verify LiteLLM supports this prefix.", provider)

106-112: Demote noisy Ollama notice to debug and route through logger

Running locally without OLLAMA_HOST is normal; keep logs clean by using debug level.

-        if not os.environ.get("OLLAMA_HOST"):
-            print("ℹ️ OLLAMA_HOST not set, using default localhost:11434")
+        if not os.environ.get("OLLAMA_HOST"):
+            logger.debug("OLLAMA_HOST not set; defaulting to localhost:11434")

12-33: Naming collision with another LLMConfig; consolidate or rename

There’s a second LLMConfig (Pydantic) in core/models.py. Two different LLMConfig types across modules will cause confusion and type mismatches in IDEs and reviews.

Options:

  • Prefer a single canonical config model (e.g., the Pydantic one) and derive the LiteLLM params from it.
  • Or rename this dataclass to LLMManagerConfig to prevent ambiguity.

148-156: Double-check LiteLLM parameter names: timeout/request_timeout, num_retries

LiteLLM’s completion kwargs use specific names. Some versions use request_timeout instead of timeout. Validate and adjust to avoid silent no-ops.

If needed:

-            "timeout": self.config.timeout,
+            "request_timeout": self.config.timeout,

Please verify against the LiteLLM version in your pyproject.

lsc_eval/src/lsc_eval/metrics/custom_metrics.py (3)

14-23: Type the contexts field precisely

Use Optional[List[Dict[str, str]]] to align with TurnData.contexts and improve static checks.

-    contexts: Optional[list] = Field(None, description="Context information if available")
+    contexts: Optional[list[dict[str, str]]] = Field(
+        None, description="Context information if available"
+    )

If Python <3.9 support is needed, use typing.List/Dict instead.


77-81: Harden choice/content extraction across providers

Some providers return dicts instead of objects. Add a safe fallback; raise if missing.

-            content = response.choices[0].message.content  # type: ignore
+            choice0 = response.choices[0]  # type: ignore[index]
+            msg = getattr(choice0, "message", getattr(choice0, "delta", None))
+            content = None if msg is None else (getattr(msg, "content", None) or getattr(msg, "text", None) or (isinstance(msg, dict) and msg.get("content")))

And keep the existing empty-response check.


1-41: Run Black; minor nits only

Black check is failing. After the above diffs, please run Black to settle whitespace/line-wrapping.

lsc_eval/src/lsc_eval/core/config_loader.py (3)

179-184: Add return type for init and prefer explicit None assignments

Minor typing/readability polish.

-    def __init__(self):
+    def __init__(self) -> None:
         """Initialize Config Loader."""
         self.system_config = None
         self.evaluation_data = None
         self.logger = None

140-176: Duplicate “system config” models across modules

This file defines SystemConfig while core/models.py defines EvaluationSystemConfig. Two overlapping system config models increase drift risk.

Consider keeping only one canonical system config model (e.g., in core/models.py) and have ConfigLoader populate it. If you keep both, add conversion helpers and tests to ensure parity.


1-1: Run Black to fix formatting

CI shows Black would reformat this file.

lsc_eval/src/lsc_eval/core/models.py (4)

89-124: Validate metric requirements only for turn metrics; consider conversation-level too

Today only turn_metrics are checked for context/expected_response. If any conversation-level metrics require contexts/answers, this will miss them.

Add analogous checks for conversation_metrics if needed by your metric set.


165-183: Name clash: LLMConfig here vs llm_managers.LLMConfig

Two different LLMConfig classes exist across modules. This is confusing for maintainers and tooling.

  • Prefer a single shared LLMConfig (likely this Pydantic model) and have LLMManager accept it.
  • If you need a lightweight runtime class for LLMManager, rename it to LLMManagerConfig.
  • Add tests to ensure config serialization/deserialization remains consistent.

185-212: Duplication with SystemConfig in config_loader

EvaluationSystemConfig overlaps with SystemConfig. Consolidate to one to avoid divergence in future changes.


1-1: Run Black to fix formatting

Black check is failing on this file as well.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ffe1d6a and eac9f72.

📒 Files selected for processing (25)
  • .gitignore (1 hunks)
  • Makefile (3 hunks)
  • lsc_eval/README.md (1 hunks)
  • lsc_eval/config/evaluation_data.yaml (1 hunks)
  • lsc_eval/config/system.yaml (1 hunks)
  • lsc_eval/pyproject.toml (1 hunks)
  • lsc_eval/runner.py (1 hunks)
  • lsc_eval/src/lsc_eval/__init__.py (1 hunks)
  • lsc_eval/src/lsc_eval/core/__init__.py (1 hunks)
  • lsc_eval/src/lsc_eval/core/config_loader.py (1 hunks)
  • lsc_eval/src/lsc_eval/core/data_validator.py (1 hunks)
  • lsc_eval/src/lsc_eval/core/models.py (1 hunks)
  • lsc_eval/src/lsc_eval/evaluation_engine.py (1 hunks)
  • lsc_eval/src/lsc_eval/llm_managers/__init__.py (1 hunks)
  • lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py (1 hunks)
  • lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (1 hunks)
  • lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (1 hunks)
  • lsc_eval/src/lsc_eval/metrics/__init__.py (1 hunks)
  • lsc_eval/src/lsc_eval/metrics/custom_metrics.py (1 hunks)
  • lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (1 hunks)
  • lsc_eval/src/lsc_eval/metrics/ragas_metrics.py (1 hunks)
  • lsc_eval/src/lsc_eval/output/__init__.py (1 hunks)
  • lsc_eval/src/lsc_eval/output/output_handler.py (1 hunks)
  • lsc_eval/src/lsc_eval/output/utils.py (1 hunks)
  • lsc_eval/src/lsc_eval/output/visualization.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

  • lsc_eval/runner.py
🧬 Code graph analysis (19)
lsc_eval/src/lsc_eval/output/__init__.py (2)
lsc_eval/src/lsc_eval/output/output_handler.py (1)
  • OutputHandler (15-224)
lsc_eval/src/lsc_eval/output/visualization.py (1)
  • GraphGenerator (17-412)
lsc_eval/src/lsc_eval/output/utils.py (1)
lsc_eval/src/lsc_eval/core/models.py (2)
  • EvaluationResult (126-162)
  • TurnData (8-44)
lsc_eval/src/lsc_eval/metrics/__init__.py (3)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (1)
  • CustomMetrics (25-242)
lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (1)
  • DeepEvalMetrics (19-138)
lsc_eval/src/lsc_eval/metrics/ragas_metrics.py (1)
  • RagasMetrics (23-247)
lsc_eval/src/lsc_eval/core/__init__.py (4)
lsc_eval/src/lsc_eval/core/config_loader.py (4)
  • ConfigLoader (176-256)
  • SystemConfig (140-173)
  • setup_environment_variables (27-44)
  • validate_metrics (123-137)
lsc_eval/src/lsc_eval/core/models.py (5)
  • validate_metrics (82-87)
  • EvaluationData (47-123)
  • EvaluationResult (126-162)
  • LLMConfig (165-182)
  • TurnData (8-44)
lsc_eval/src/lsc_eval/core/data_validator.py (1)
  • DataValidator (11-82)
lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (1)
  • LLMConfig (13-33)
lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (4)
lsc_eval/src/lsc_eval/core/models.py (1)
  • LLMConfig (165-182)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (1)
  • from_system_config (239-242)
lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (1)
  • from_system_config (135-138)
lsc_eval/src/lsc_eval/metrics/ragas_metrics.py (1)
  • from_system_config (244-247)
lsc_eval/src/lsc_eval/output/output_handler.py (3)
lsc_eval/src/lsc_eval/core/models.py (1)
  • EvaluationResult (126-162)
lsc_eval/src/lsc_eval/output/utils.py (2)
  • calculate_basic_stats (19-45)
  • calculate_detailed_stats (48-68)
lsc_eval/src/lsc_eval/output/visualization.py (2)
  • GraphGenerator (17-412)
  • generate_all_graphs (53-103)
lsc_eval/src/lsc_eval/metrics/ragas_metrics.py (6)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (2)
  • evaluate (42-57)
  • from_system_config (239-242)
lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (2)
  • evaluate (62-77)
  • from_system_config (135-138)
lsc_eval/src/lsc_eval/core/models.py (1)
  • TurnData (8-44)
lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (5)
  • LLMManager (36-167)
  • get_model_name (144-146)
  • get_litellm_params (148-156)
  • from_dict (24-33)
  • from_system_config (163-167)
lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (2)
  • RagasLLMManager (78-106)
  • get_llm (97-99)
lsc_eval/src/lsc_eval/output/utils.py (1)
  • EvaluationScope (11-16)
lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py (1)
lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (2)
  • get_llm (97-99)
  • get_model_info (101-106)
lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (3)
lsc_eval/src/lsc_eval/core/models.py (1)
  • TurnData (8-44)
lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py (2)
  • DeepEvalLLMManager (8-43)
  • get_llm (31-33)
lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (4)
  • LLMManager (36-167)
  • get_model_name (144-146)
  • get_litellm_params (148-156)
  • from_system_config (163-167)
lsc_eval/src/lsc_eval/output/visualization.py (2)
lsc_eval/src/lsc_eval/core/models.py (1)
  • EvaluationResult (126-162)
lsc_eval/src/lsc_eval/output/utils.py (2)
  • calculate_basic_stats (19-45)
  • calculate_detailed_stats (48-68)
lsc_eval/src/lsc_eval/llm_managers/__init__.py (3)
lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py (1)
  • DeepEvalLLMManager (8-43)
lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (1)
  • LLMManager (36-167)
lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (1)
  • RagasLLMManager (78-106)
lsc_eval/src/lsc_eval/core/config_loader.py (1)
lsc_eval/src/lsc_eval/core/models.py (1)
  • validate_metrics (82-87)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (3)
lsc_eval/src/lsc_eval/core/models.py (1)
  • TurnData (8-44)
lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (4)
  • LLMManager (36-167)
  • get_model_name (144-146)
  • get_litellm_params (148-156)
  • from_system_config (163-167)
lsc_eval/src/lsc_eval/output/utils.py (1)
  • EvaluationScope (11-16)
lsc_eval/src/lsc_eval/__init__.py (6)
lsc_eval/src/lsc_eval/core/config_loader.py (2)
  • ConfigLoader (176-256)
  • SystemConfig (140-173)
lsc_eval/src/lsc_eval/core/data_validator.py (1)
  • DataValidator (11-82)
lsc_eval/src/lsc_eval/core/models.py (3)
  • EvaluationData (47-123)
  • EvaluationResult (126-162)
  • TurnData (8-44)
lsc_eval/src/lsc_eval/evaluation_engine.py (1)
  • EvaluationEngine (104-294)
lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (1)
  • LLMManager (36-167)
lsc_eval/src/lsc_eval/output/output_handler.py (1)
  • OutputHandler (15-224)
lsc_eval/runner.py (5)
lsc_eval/src/lsc_eval/core/config_loader.py (3)
  • ConfigLoader (176-256)
  • setup_environment_variables (27-44)
  • load_system_config (185-240)
lsc_eval/src/lsc_eval/core/data_validator.py (2)
  • DataValidator (11-82)
  • load_evaluation_data (19-37)
lsc_eval/src/lsc_eval/evaluation_engine.py (1)
  • EvaluationEngine (104-294)
lsc_eval/src/lsc_eval/output/output_handler.py (2)
  • OutputHandler (15-224)
  • generate_reports (32-76)
lsc_eval/src/lsc_eval/output/utils.py (1)
  • calculate_basic_stats (19-45)
lsc_eval/src/lsc_eval/core/data_validator.py (1)
lsc_eval/src/lsc_eval/core/models.py (2)
  • EvaluationData (47-123)
  • validate_metric_requirements (89-123)
lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (1)
lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py (2)
  • get_llm (31-33)
  • get_model_info (35-43)
lsc_eval/src/lsc_eval/evaluation_engine.py (8)
lsc_eval/src/lsc_eval/core/config_loader.py (2)
  • ConfigLoader (176-256)
  • get_llm_config_dict (242-256)
lsc_eval/src/lsc_eval/core/data_validator.py (2)
  • DataValidator (11-82)
  • validate_evaluation_data (39-54)
lsc_eval/src/lsc_eval/core/models.py (3)
  • EvaluationData (47-123)
  • EvaluationResult (126-162)
  • TurnData (8-44)
lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (2)
  • LLMManager (36-167)
  • from_system_config (163-167)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (3)
  • CustomMetrics (25-242)
  • evaluate (42-57)
  • from_system_config (239-242)
lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (3)
  • DeepEvalMetrics (19-138)
  • evaluate (62-77)
  • from_system_config (135-138)
lsc_eval/src/lsc_eval/metrics/ragas_metrics.py (3)
  • RagasMetrics (23-247)
  • evaluate (78-107)
  • from_system_config (244-247)
lsc_eval/src/lsc_eval/output/utils.py (1)
  • EvaluationScope (11-16)
lsc_eval/src/lsc_eval/core/models.py (2)
lsc_eval/src/lsc_eval/core/config_loader.py (1)
  • validate_metrics (123-137)
lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (1)
  • LLMConfig (13-33)
🪛 YAMLlint (1.37.1)
lsc_eval/config/evaluation_data.yaml

[error] 10-10: trailing spaces

(trailing-spaces)


[error] 11-11: trailing spaces

(trailing-spaces)


[error] 12-12: trailing spaces

(trailing-spaces)


[error] 16-16: trailing spaces

(trailing-spaces)


[error] 31-31: trailing spaces

(trailing-spaces)


[error] 33-33: trailing spaces

(trailing-spaces)


[error] 37-37: trailing spaces

(trailing-spaces)


[error] 47-47: trailing spaces

(trailing-spaces)


[error] 50-50: trailing spaces

(trailing-spaces)


[error] 52-52: trailing spaces

(trailing-spaces)


[error] 56-56: trailing spaces

(trailing-spaces)


[error] 58-58: trailing spaces

(trailing-spaces)


[error] 66-66: trailing spaces

(trailing-spaces)

lsc_eval/config/system.yaml

[error] 16-16: trailing spaces

(trailing-spaces)


[error] 23-23: trailing spaces

(trailing-spaces)


[error] 26-26: trailing spaces

(trailing-spaces)


[error] 30-30: trailing spaces

(trailing-spaces)


[error] 51-51: trailing spaces

(trailing-spaces)


[error] 57-57: trailing spaces

(trailing-spaces)


[error] 64-64: trailing spaces

(trailing-spaces)


[error] 70-70: trailing spaces

(trailing-spaces)


[error] 76-76: trailing spaces

(trailing-spaces)


[error] 82-82: trailing spaces

(trailing-spaces)


[error] 89-89: trailing spaces

(trailing-spaces)


[error] 98-98: trailing spaces

(trailing-spaces)


[error] 101-101: trailing spaces

(trailing-spaces)


[error] 104-104: trailing spaces

(trailing-spaces)


[error] 120-120: trailing spaces

(trailing-spaces)


[error] 124-124: trailing spaces

(trailing-spaces)


[error] 138-138: trailing spaces

(trailing-spaces)

🪛 LanguageTool
lsc_eval/README.md

[grammar] ~5-~5: There might be a mistake here.
Context: ...e GenAI application. ## 🎯 Key Features - Multi-Framework Support: Seamlessly use...

(QB_NEW_EN)


[grammar] ~15-~15: There might be a mistake here.
Context: ...distribution analysis ## 🚀 Quick Start ### Installation ```bash # From Git pdm add g...

(QB_NEW_EN)


[grammar] ~36-~36: There might be a mistake here.
Context: ...n_data.yaml ``` ## 📊 Supported Metrics ### Turn-Level (Single Query) - Ragas -...

(QB_NEW_EN)


[grammar] ~38-~38: There might be a mistake here.
Context: ...d Metrics ### Turn-Level (Single Query) - Ragas - Response Evaluation - `f...

(QB_NEW_EN)


[grammar] ~39-~39: There might be a mistake here.
Context: ... ### Turn-Level (Single Query) - Ragas - Response Evaluation - faithfulness...

(QB_NEW_EN)


[grammar] ~48-~48: There might be a mistake here.
Context: ...ext_precision_with_reference- **Custom** - Response Evaluation -answer_correc...

(QB_NEW_EN)


[grammar] ~52-~52: There might be a mistake here.
Context: ...ss ### Conversation-Level (Multi-turn) - **DeepEval** -conversation_completenes...

(QB_NEW_EN)


[grammar] ~113-~113: There might be a mistake here.
Context: ...t...." ``` ## 📈 Output & Visualization ### Generated Reports - CSV: Detailed res...

(QB_NEW_EN)


[grammar] ~115-~115: There might be a mistake here.
Context: ...t & Visualization ### Generated Reports - CSV: Detailed results with status, sco...

(QB_NEW_EN)


[grammar] ~116-~116: There might be a mistake here.
Context: ...led results with status, scores, reasons - JSON: Summary statistics with score di...

(QB_NEW_EN)


[grammar] ~117-~117: There might be a mistake here.
Context: ...mary statistics with score distributions - TXT: Human-readable summary - PNG:...

(QB_NEW_EN)


[grammar] ~118-~118: There might be a mistake here.
Context: ...utions - TXT: Human-readable summary - PNG: 4 visualization types (pass rates...

(QB_NEW_EN)


[grammar] ~121-~121: There might be a mistake here.
Context: ...us breakdown) ### Key Metrics in Output - PASS/FAIL/ERROR: Status based on thres...

(QB_NEW_EN)


[grammar] ~122-~122: There might be a mistake here.
Context: ...FAIL/ERROR**: Status based on thresholds - Actual Reasons: DeepEval provides LLM-...

(QB_NEW_EN)


[grammar] ~135-~135: There might be a mistake here.
Context: ...lint . ``` ## 📄 License & Contributing This project is licensed under the Apache...

(QB_NEW_EN)


[grammar] ~139-~139: There might be a mistake here.
Context: ...e for details. Contributions welcome - see development setup above for code qualit...

(QB_NEW_EN)


[grammar] ~139-~139: There might be a mistake here.
Context: ...ment setup above for code quality tools.

(QB_NEW_EN)

🪛 GitHub Actions: Black
lsc_eval/src/lsc_eval/output/utils.py

[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.

lsc_eval/src/lsc_eval/core/__init__.py

[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.

lsc_eval/src/lsc_eval/llm_managers/llm_manager.py

[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.

lsc_eval/src/lsc_eval/output/output_handler.py

[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.

lsc_eval/src/lsc_eval/metrics/ragas_metrics.py

[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.

lsc_eval/src/lsc_eval/output/visualization.py

[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.

lsc_eval/src/lsc_eval/core/config_loader.py

[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.

lsc_eval/src/lsc_eval/metrics/custom_metrics.py

[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.

lsc_eval/runner.py

[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.

lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py

[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.

lsc_eval/src/lsc_eval/evaluation_engine.py

[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.

lsc_eval/src/lsc_eval/core/models.py

[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.

🪛 GitHub Actions: Type checks
lsc_eval/src/lsc_eval/output/output_handler.py

[error] 18-18: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]

lsc_eval/src/lsc_eval/metrics/ragas_metrics.py

[error] 64-64: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]


[error] 78-78: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]


[error] 109-109: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]


[error] 128-128: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]


[error] 145-145: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]


[error] 167-167: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]


[error] 197-197: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]


[error] 224-224: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]

lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py

[error] 40-40: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]


[error] 51-51: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]


[error] 62-62: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]


[error] 79-79: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]


[error] 95-95: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]


[error] 115-115: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]

lsc_eval/src/lsc_eval/core/config_loader.py

[error] 27-27: Mypy: Function is missing a return type annotation. [no-untyped-def]


[error] 47-47: Mypy: Function is missing a return type annotation. [no-untyped-def]


[error] 105-105: Mypy: Function is missing a return type annotation. [no-untyped-def]


[error] 179-179: Mypy: Function is missing a return type annotation. [no-untyped-def]

lsc_eval/src/lsc_eval/metrics/custom_metrics.py

[error] 42-42: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]


[error] 193-193: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]

lsc_eval/runner.py

[error] 20-20: Mypy: Module 'lsc_eval' has no attribute 'ConfigLoader'. [attr-defined]


[error] 20-20: Mypy: Module 'lsc_eval' has no attribute 'DataValidator'. [attr-defined]


[error] 20-20: Mypy: Module 'lsc_eval' has no attribute 'EvaluationEngine'. [attr-defined]


[error] 20-20: Mypy: Module 'lsc_eval' has no attribute 'OutputHandler'. [attr-defined]


[error] 99-99: Mypy: Function is missing a return type annotation. [no-untyped-def]


[error] 135-135: Mypy: Call to untyped function 'main' in typed context. [no-untyped-call]

lsc_eval/src/lsc_eval/core/data_validator.py

[error] 14-14: Mypy: Function is missing a return type annotation. [no-untyped-def]


[error] 56-56: Mypy: Function is missing a return type annotation. [no-untyped-def]


[error] 74-74: Mypy: Function is missing a return type annotation. [no-untyped-def]

lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py

[error] 20-20: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]


[error] 61-61: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]

lsc_eval/src/lsc_eval/evaluation_engine.py

[error] 83-83: Mypy: Function is missing a return type annotation. [no-untyped-def]


[error] 118-118: Mypy: Call to untyped function 'DataValidator' in typed context. [no-untyped-call]


[error] 158-158: Mypy: Function is missing a return type annotation. [no-untyped-def]


[error] 178-178: Mypy: Function is missing a return type annotation. [no-untyped-def]


[error] 186-186: Mypy: Function is missing a return type annotation. [no-untyped-def]

🪛 GitHub Actions: Pyright
lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py

[error] 5-5: Import "deepeval.models" could not be resolved (reportMissingImports)

lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py

[error] 5-5: Import "deepeval.metrics" could not be resolved (reportMissingImports)


[error] 10-10: Import "deepeval.test_case" could not be resolved (reportMissingImports)


[error] 11-11: Import "deepeval.test_case" could not be resolved (reportMissingImports)

lsc_eval/src/lsc_eval/output/visualization.py

[warning] 10-10: Import "seaborn" could not be resolved from source (reportMissingModuleSource)

lsc_eval/src/lsc_eval/metrics/custom_metrics.py

[error] 6-6: Import "litellm" could not be resolved (reportMissingImports)

lsc_eval/runner.py

[error] 20-20: "ConfigLoader" is unknown import symbol (reportAttributeAccessIssue)


[error] 20-20: "DataValidator" is unknown import symbol (reportAttributeAccessIssue)


[error] 20-20: "EvaluationEngine" is unknown import symbol (reportAttributeAccessIssue)


[error] 20-20: "OutputHandler" is unknown import symbol (reportAttributeAccessIssue)


[error] 21-21: Import "lsc_eval.core" could not be resolved (reportMissingImports)


[error] 22-22: Import "lsc_eval.output.utils" could not be resolved (reportMissingImports)

lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py

[error] 5-5: Import "litellm" could not be resolved (reportMissingImports)

🪛 GitHub Actions: Python linter
lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py

[error] 5-5: Pylint: E0401: Unable to import 'deepeval.models' (import-error).

lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py

[error] 5-5: Pylint: E0401: Unable to import 'deepeval.metrics' (import-error).


[error] 10-10: Pylint: E0401: Unable to import 'deepeval.test_case' (import-error).

lsc_eval/src/lsc_eval/output/visualization.py

[error] 10-10: Pylint: E0401: Unable to import 'seaborn' (import-error).


[warning] 204-213: Pylint: duplicate-code detected between lightspeed_core_evaluation.utils.plot [26:35] and lsc_eval/src/lsc_eval/output/visualization.py [204:213].

lsc_eval/src/lsc_eval/metrics/custom_metrics.py

[error] 6-6: Pylint: E0401: Unable to import 'litellm' (import-error).

lsc_eval/runner.py

[warning] 1-1: Pylint: R0801: Similar lines in 2 files.

lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py

[error] 5-5: Pylint: E0401: Unable to import 'litellm' (import-error).

🔇 Additional comments (32)
lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py (1)

35-43: Model info helper looks good.

The info map mirrors constructor params and includes sensible defaults.

Makefile (2)

42-44: Type-check scope expansion looks good.

Including lsc_eval/src/ and lsc_eval/runner.py in mypy is appropriate for the new module.


75-80: pyright target expansion is correct; keep optional-import guards aligned.

With the guards added in code, this target should pass even without installing optional deps.

lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (1)

1-7: Format and Lint: Run Black and Re-run Static Analysis

File: lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py
Lines: 1–7

Please ensure the file is properly formatted and free of style or type-checking errors:

  • Run Black to normalize formatting:
    black lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py
  • Re-run static analysis checks and address any new issues:
    pylint lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py
    mypy lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py
    pyright lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py

Once formatting and lint/type errors are cleared, this can be merged.

lsc_eval/src/lsc_eval/output/__init__.py (1)

1-6: Public surface looks clean and minimal.

Re-exporting OutputHandler and GraphGenerator with all is tidy and conventional.

lsc_eval/src/lsc_eval/metrics/__init__.py (1)

1-7: Clean public API for metrics — LGTM

Concise docstring, clear re-exports, and stable surface for from lsc_eval.metrics import .... Looks good.

lsc_eval/src/lsc_eval/llm_managers/__init__.py (1)

1-7: LLM managers aggregator — LGTM

Good, straightforward re-exports that match usage patterns elsewhere.

lsc_eval/pyproject.toml (2)

35-37: Console script entry point validated

lsc_eval/runner.py defines main() on line 99, so the lsc-eval = "lsc_eval.runner:main" entry point is valid and will install/run successfully.


23-33: Dev dependency versions are valid
All specified dev-tool versions exist on PyPI and none exceed the current latest releases:

  • black >=25.1.0 (latest 25.1.0)
  • mypy >=1.15.0 (latest 1.17.1)
  • ruff >=0.8.0 (latest 0.12.10)
  • pyright >=1.1.401 (latest 1.1.404)
  • pydocstyle >=6.3.0 (latest 6.3.0)
  • pylint >=3.3.2 (latest 3.3.8)
  • pytest >=8.3.2 (latest 8.4.1)
  • pytest-cov >=5.0.0 (latest 6.2.1)

No adjustments needed.

lsc_eval/config/evaluation_data.yaml (1)

1-73: All trailing whitespace removed in lsc_eval/config/evaluation_data.yaml

The sed cleanup has stripped all trailing spaces and a ripgrep check confirms there are none remaining. No further action is needed.

lsc_eval/config/system.yaml (2)

1-145: Trailing spaces removed – no further action needed

All trailing whitespace has been cleaned from lsc_eval/config/system.yaml and a subsequent search found no remaining instances. CI should now pass without yamllint errors.


41-110: It looks like the combined install + inspect script didn’t emit any output, so we can’t confirm whether those classes actually exist. Could you please run just the inspection portion and share what it prints? For example:

python3 - << 'EOF'
import importlib

# Ragas
mod = importlib.import_module('ragas.evaluation.metrics')
print("Ragas exports:",
      [n for n in ('LLMContextPrecisionWithReference',
                   'LLMContextPrecisionWithoutReference')
       if hasattr(mod, n)])

# DeepEval
mod2 = importlib.import_module('deepeval.metrics')
print("DeepEval exports:",
      [n for n in ('ConversationCompletenessMetric',
                   'TurnRelevancyMetric')
       if hasattr(mod2, n)])
EOF

That will confirm definitively whether you can safely reference these metric identifiers against ragas>=0.3.0 and deepeval>=1.3.0.

lsc_eval/runner.py (1)

14-19: Ignore the src/-layout shim suggestion

This repository uses a flat lsc_eval/ package, not a src/ layout, so there is no …/lsc_eval/src directory to insert on sys.path. Rather than hacking sys.path, you should:

  • Install the package in editable/dev mode (e.g. pip install -e .) so that
    python -m lsc_eval.runner and python lsc_eval/runner.py both work out of the box.
  • Or add a proper entrypoint/__main__.py inside lsc_eval/ to support
    python -m lsc_eval without manual path manipulation.

Remove the proposed sys.path shim and adopt one of the above approaches to ensure the runner is discoverable in development.

Likely an incorrect or invalid review comment.

lsc_eval/src/lsc_eval/core/data_validator.py (2)

39-55: LGTM on the validation flow

Clear separation: availability check, then per-metric field requirements. The boolean return plus aggregated error printout is practical.


1-1: ✅ Type-check passed for data_validator.py

  • Installed mypy and its pydantic plugin dependency.
  • Ran mypy lsc_eval/src/lsc_eval/core/data_validator.py and confirmed no type errors.

All checks are green—no further action needed.

lsc_eval/src/lsc_eval/output/utils.py (3)

19-46: Basic stats calculation looks correct and robust to empty input

Guarding empty results and returning 0.0 rates is sensible; integer counts + float rates are fine.


71-116: Detailed stats aggregation is solid; nice touch on stddev guard

Using stdev only when count > 1 avoids StatisticsError. Good structure for consumers.


1-1: Black formatting failure — reformat to pass CI

I attempted to verify formatting but the pdm command wasn’t available in this environment. Please re-format lsc_eval/src/lsc_eval/output/utils.py with Black in your development setup and confirm it passes CI. For example:

  • Install dev dependencies (if not already):
    pdm install --dev
  • Reformat the file:
    pdm run black lsc_eval/src/lsc_eval/output/utils.py
  • Verify no changes are needed:
    pdm run black --check lsc_eval/src/lsc_eval/output/utils.py
lsc_eval/src/lsc_eval/output/output_handler.py (2)

61-77: Graph generation error handling is appropriate

Gracefully catching and logging graph errors keeps the reporting path resilient.


1-1: Please verify formatting and type checks
Since the sandbox couldn’t run pdm, black, or mypy, please re-run your formatter and type checker locally to ensure no regressions were introduced by the recent signature change.

• From your project root, run:

black --check lsc_eval/src/lsc_eval/output/output_handler.py
mypy    lsc_eval/src/lsc_eval/output/output_handler.py

• Address any formatting diffs or type errors before merging.

lsc_eval/src/lsc_eval/metrics/ragas_metrics.py (2)

1-1: Action Required: Manually verify Black formatting and mypy checks on ragas_metrics.py

It looks like pdm isn’t available in this environment, so the Black and mypy commands couldn’t run. To ensure there are no lingering formatting or typing errors, please perform the following steps locally:

  • Install Black and mypy (and PDM if you rely on it):
    pip install black mypy            # Install formatting and type-checking tools
    # pip install pdm black mypy     # If you use PDM for dependency management
  • Apply Black in “check” mode:
    black --check lsc_eval/src/lsc_eval/metrics/ragas_metrics.py
  • Re-run mypy for this file to catch any remaining no-untyped-def errors:
    mypy lsc_eval/src/lsc_eval/metrics/ragas_metrics.py

Once you’ve confirmed the file is properly formatted and free of mypy errors, please re-run the CI pipeline to ensure everything passes.


1-248: Verify Ragas Metrics DataFrame Column Names
To prevent run-time key errors, confirm that the columns produced by your locked ragas version match the keys used in these evaluation calls. In most recent versions, the mappings are:

  • ResponseRelevancy → answer_relevancy
  • Faithfulness → faithfulness
  • ContextRelevance → nv_context_relevance
  • LLMContextRecall → context_recall
  • LLMContextPrecisionWithReference → llm_context_precision_with_reference
  • LLMContextPrecisionWithoutReference → llm_context_precision_without_reference

If your pinned version uses different column names, update the result_key arguments (and any downstream logic) accordingly.

lsc_eval/src/lsc_eval/evaluation_engine.py (2)

211-226: Good user-facing progress logging

Clear, concise status lines with emoji are helpful during long runs.


1-1: Ensure Black and mypy checks are installed and passing on the evaluation engine

I attempted to run the formatting and type‐checking pipeline, but neither Black nor mypy was available in this environment. Please install the project’s dev dependencies and verify that this file passes both checks:

  • Install development tools (via PDM or your preferred method):
    pdm install --dev
  • Run Black in check mode:
    pdm run black --check lsc_eval/src/lsc_eval/evaluation_engine.py
  • Run mypy type checks:
    pdm run mypy lsc_eval/src/lsc_eval/evaluation_engine.py

Once installed, confirm there are no remaining formatting or type‐annotation errors in this file.

lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (2)

52-65: LGTM: provider handler dispatch is clean and extensible

The handler map + small provider-specific validators make this maintainable.


118-123: Please verify LiteLLM Azure model prefix and required environment variables

  • Confirm that the model string format "azure/<deployment_name>" is correct for your target Azure OpenAI API version.
  • Ensure _validate_azure_env() enforces presence of:
    • AZURE_OPENAI_API_KEY
    • AZURE_OPENAI_ENDPOINT
    • AZURE_OPENAI_API_VERSION
  • If LiteLLM requires passing additional parameters (e.g. api_base, api_version) in the client setup, update this handler to include them.
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (1)

238-243: Factory method usage is clean and composable

Nice decoupling via LLMManager.from_system_config; aligns with other metrics modules.

lsc_eval/src/lsc_eval/core/config_loader.py (2)

185-241: load_system_config typing is good; logging setup first is a plus

The flow—logging first, then metric registry population, then model construction—is clear.


242-256: get_llm_config_dict: add return type and keep dict shape aligned with LLMManager.from_system_config

Mypy flagged missing return type; adding it keeps types consistent. The dict shape matches LLMManager expectations.

-def get_llm_config_dict(self) -> Dict[str, Any]:
+def get_llm_config_dict(self) -> Dict[str, Any]:
     """Get system config as dictionary for LLMManager."""

Note: Signature already matches; ensure mypy config picks it up after earlier fixes.

lsc_eval/src/lsc_eval/core/models.py (3)

8-16: LGTM: TurnData structure and validators are solid

Good constraints on non-empty query/response and positive turn_id; contexts schema check is helpful.


80-88: Metric identifier format validation is clear

Enforcing "framework:metric_name" early prevents downstream confusion.


148-155: Score bounds are enforced here; ensure upstream parsers normalize

CustomMetrics now clamps scores; this validator will still catch anomalies if any slip through. Good layering.

Comment on lines 6 to 12
Usage:
python runner.py --system-config config/system.yaml --eval-data config/evaluation_data.yaml
Or programmatically:
from runner import run_evaluation
results = run_evaluation("config/system.yaml", "config/evaluation_data.yaml")
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix usage guidance for src/ layout and console entry point.

python runner.py will not find the package in a src/ layout. Recommend using python -m lsc_eval.runner or the console script.

Apply:

-Usage:
-    python runner.py --system-config config/system.yaml --eval-data config/evaluation_data.yaml
+Usage:
+    # After installing (or in editable mode):
+    lsc-eval --system-config config/system.yaml --eval-data config/evaluation_data.yaml
+
+    # Or directly from the repo (src/ layout):
+    python -m lsc_eval.runner --system-config config/system.yaml --eval-data config/evaluation_data.yaml
@@
-Or programmatically:
-    from runner import run_evaluation
-    results = run_evaluation("config/system.yaml", "config/evaluation_data.yaml")
+Or programmatically:
+    from lsc_eval.runner import run_evaluation
+    results = run_evaluation("config/system.yaml", "config/evaluation_data.yaml")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Usage:
python runner.py --system-config config/system.yaml --eval-data config/evaluation_data.yaml
Or programmatically:
from runner import run_evaluation
results = run_evaluation("config/system.yaml", "config/evaluation_data.yaml")
"""
"""
Usage:
# After installing (or in editable mode):
lsc-eval --system-config config/system.yaml --eval-data config/evaluation_data.yaml
# Or directly from the repo (src/ layout):
python -m lsc_eval.runner --system-config config/system.yaml --eval-data config/evaluation_data.yaml
Or programmatically:
from lsc_eval.runner import run_evaluation
results = run_evaluation("config/system.yaml", "config/evaluation_data.yaml")
"""
🤖 Prompt for AI Agents
In lsc_eval/runner.py around lines 6 to 12, the usage comment incorrectly
suggests running "python runner.py" which fails in a src/ layout; update the
guidance to recommend using "python -m lsc_eval.runner" (or the provided console
script name if one exists) and adjust the programmatic example to "from
lsc_eval.runner import run_evaluation" to reflect the package import path so
users can run the module or import it correctly.

Comment on lines +93 to +97
except (FileNotFoundError, ValueError, RuntimeError) as e:
print(f"\n❌ Evaluation failed: {e}")
traceback.print_exc()
return None

Copy link
Contributor

@coderabbitai coderabbitai bot Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Catch ImportError to surface missing optional deps cleanly.

Broaden the exception tuple so missing libs (e.g., seaborn/deepeval) don’t crash with a traceback but return a non-zero exit code.

Apply:

-    except (FileNotFoundError, ValueError, RuntimeError) as e:
+    except (FileNotFoundError, ValueError, RuntimeError, ImportError) as e:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except (FileNotFoundError, ValueError, RuntimeError) as e:
print(f"\n❌ Evaluation failed: {e}")
traceback.print_exc()
return None
except (FileNotFoundError, ValueError, RuntimeError, ImportError) as e:
print(f"\n❌ Evaluation failed: {e}")
traceback.print_exc()
return None
🤖 Prompt for AI Agents
In lsc_eval/runner.py around lines 93-97, the except block currently catches
FileNotFoundError, ValueError and RuntimeError but not ImportError; update the
exception tuple to include ImportError so missing optional dependencies are
handled cleanly, and change the error path to return a non-zero exit signal (for
example return 1 or call sys.exit(1)) instead of returning None — if you use
sys.exit, import sys at the top.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the issues will get resolved when we make this primary eval tool and remove the existing one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asamal4, understood! That makes sense - consolidating to a single evaluation tool will indeed clean up many of these edge cases naturally. Thanks for the context about the broader architectural plan to make this the primary tool.


✏️ Learnings added
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.603Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Comment on lines +124 to +151
_, ax = plt.subplots(figsize=(12, 8))

# Create bar chart
bars = ax.bar(metrics, pass_rates, color="skyblue", alpha=0.7)

# Add percentage labels with status breakdown
for chart_bar, breakdown in zip(bars, status_breakdowns):
height = chart_bar.get_height()
ax.text(
chart_bar.get_x() + chart_bar.get_width() / 2.0,
height + 1,
f"{height:.1f}%\n({breakdown})",
ha="center",
va="bottom",
fontsize=9,
)

# Customize chart
ax.set_title("Pass Rates by Metric", fontsize=16, fontweight="bold")
ax.set_xlabel("Pass Rate (%)", fontsize=12)
ax.set_ylabel("Metrics", fontsize=12)
ax.set_ylim(0, 105) # Give space for labels

# Rotate x-axis labels if needed
if len(metrics) > 5:
plt.xticks(rotation=45, ha="right")

plt.tight_layout()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Pass rates chart: wrong axis labels and inconsistent figure sizing.

  • Using hardcoded figsize=(12,8) ignores configured size/DPI.
  • X/Y labels are flipped: x-axis is metrics, y-axis is pass rate.

Apply:

-        _, ax = plt.subplots(figsize=(12, 8))
+        _, ax = plt.subplots(figsize=tuple(self.figsize), dpi=self.dpi)
@@
-        ax.set_title("Pass Rates by Metric", fontsize=16, fontweight="bold")
-        ax.set_xlabel("Pass Rate (%)", fontsize=12)
-        ax.set_ylabel("Metrics", fontsize=12)
+        ax.set_title("Pass Rates by Metric", fontsize=16, fontweight="bold")
+        ax.set_xlabel("Metrics", fontsize=12)
+        ax.set_ylabel("Pass Rate (%)", fontsize=12)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_, ax = plt.subplots(figsize=(12, 8))
# Create bar chart
bars = ax.bar(metrics, pass_rates, color="skyblue", alpha=0.7)
# Add percentage labels with status breakdown
for chart_bar, breakdown in zip(bars, status_breakdowns):
height = chart_bar.get_height()
ax.text(
chart_bar.get_x() + chart_bar.get_width() / 2.0,
height + 1,
f"{height:.1f}%\n({breakdown})",
ha="center",
va="bottom",
fontsize=9,
)
# Customize chart
ax.set_title("Pass Rates by Metric", fontsize=16, fontweight="bold")
ax.set_xlabel("Pass Rate (%)", fontsize=12)
ax.set_ylabel("Metrics", fontsize=12)
ax.set_ylim(0, 105) # Give space for labels
# Rotate x-axis labels if needed
if len(metrics) > 5:
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
_, ax = plt.subplots(figsize=tuple(self.figsize), dpi=self.dpi)
# Create bar chart
bars = ax.bar(metrics, pass_rates, color="skyblue", alpha=0.7)
# Add percentage labels with status breakdown
for chart_bar, breakdown in zip(bars, status_breakdowns):
height = chart_bar.get_height()
ax.text(
chart_bar.get_x() + chart_bar.get_width() / 2.0,
height + 1,
f"{height:.1f}%\n({breakdown})",
ha="center",
va="bottom",
fontsize=9,
)
# Customize chart
ax.set_title("Pass Rates by Metric", fontsize=16, fontweight="bold")
ax.set_xlabel("Metrics", fontsize=12)
ax.set_ylabel("Pass Rate (%)", fontsize=12)
ax.set_ylim(0, 105) # Give space for labels
# Rotate x-axis labels if needed
if len(metrics) > 5:
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
🤖 Prompt for AI Agents
In lsc_eval/src/lsc_eval/output/visualization.py around lines 124-151, the
figure creation and axis labels are incorrect and ignore configured sizing:
replace the hardcoded plt.subplots(figsize=(12, 8)) with
plt.subplots(figsize=fig_size, dpi=dpi) (or the appropriate passed/configured
fig_size and dpi variables), and swap the axis labels so the x-axis reads
"Metrics" and the y-axis reads "Pass Rate (%)"; keep the bar chart as
ax.bar(metrics, pass_rates), maintain ax.set_ylim(0, 105) for label space, and
preserve the xtick rotation logic for long metric lists.

Comment on lines +371 to +387
# Create heatmap
_, ax = plt.subplots(figsize=tuple(self.figsize), dpi=self.dpi)

sns.heatmap(
df,
annot=True,
fmt=".1f",
cmap="RdYlGn",
vmin=0,
vmax=100,
cbar_kws={"label": "Pass Rate (%)"},
ax=ax,
mask=df.isna(), # Mask NaN values (missing data)
linewidths=0.5,
linecolor="lightgray",
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Skip heatmap gracefully when seaborn is unavailable.

Avoid runtime errors when seaborn isn’t installed.

Apply:

-        sns.heatmap(
+        if not HAS_SEABORN:
+            self.logger.warning("seaborn not available; skipping heatmap")
+            return None
+        sns.heatmap(
             df,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Create heatmap
_, ax = plt.subplots(figsize=tuple(self.figsize), dpi=self.dpi)
sns.heatmap(
df,
annot=True,
fmt=".1f",
cmap="RdYlGn",
vmin=0,
vmax=100,
cbar_kws={"label": "Pass Rate (%)"},
ax=ax,
mask=df.isna(), # Mask NaN values (missing data)
linewidths=0.5,
linecolor="lightgray",
)
# Create heatmap
_, ax = plt.subplots(figsize=tuple(self.figsize), dpi=self.dpi)
if not HAS_SEABORN:
self.logger.warning("seaborn not available; skipping heatmap")
return None
sns.heatmap(
df,
annot=True,
fmt=".1f",
cmap="RdYlGn",
vmin=0,
vmax=100,
cbar_kws={"label": "Pass Rate (%)"},
ax=ax,
mask=df.isna(), # Mask NaN values (missing data)
linewidths=0.5,
linecolor="lightgray",
)
🤖 Prompt for AI Agents
In lsc_eval/src/lsc_eval/output/visualization.py around lines 371 to 387, the
heatmap block currently assumes seaborn is available; update it to handle
seaborn absence by first attempting "import seaborn as sns" (or checking an
existing import) in a try/except ImportError, and if seaborn is missing log or
warn that seaborn is unavailable and either (a) fall back to a matplotlib-based
heatmap using imshow + text annotations to replicate annot=True, cmap, vmin/vmax
and mask handling, or (b) gracefully skip drawing the heatmap and
return/continue without raising; ensure the fallback respects figsize/dpi and
the cbar label and does not raise runtime errors when seaborn is not installed.

Comment on lines +389 to +397
ax.set_title(
"Conversation Performance Heatmap\n(Only showing metrics with data)",
fontsize=14,
fontweight="bold",
pad=20,
)
ax.set_xlabel("Pass Rate (%)", fontsize=12, fontweight="bold")
ax.set_ylabel("Conversation Groups", fontsize=12, fontweight="bold")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Heatmap axis labels are misleading.

X-axis is metrics; colorbar already conveys pass rate.

Apply:

-        ax.set_xlabel("Pass Rate (%)", fontsize=12, fontweight="bold")
-        ax.set_ylabel("Conversation Groups", fontsize=12, fontweight="bold")
+        ax.set_xlabel("Metrics", fontsize=12, fontweight="bold")
+        ax.set_ylabel("Conversation Groups", fontsize=12, fontweight="bold")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ax.set_title(
"Conversation Performance Heatmap\n(Only showing metrics with data)",
fontsize=14,
fontweight="bold",
pad=20,
)
ax.set_xlabel("Pass Rate (%)", fontsize=12, fontweight="bold")
ax.set_ylabel("Conversation Groups", fontsize=12, fontweight="bold")
ax.set_title(
"Conversation Performance Heatmap\n(Only showing metrics with data)",
fontsize=14,
fontweight="bold",
pad=20,
)
ax.set_xlabel("Metrics", fontsize=12, fontweight="bold")
ax.set_ylabel("Conversation Groups", fontsize=12, fontweight="bold")
🤖 Prompt for AI Agents
In lsc_eval/src/lsc_eval/output/visualization.py around lines 389 to 397, the
x-axis label is incorrect (it currently reads "Pass Rate (%)" while the x-axis
actually lists metric names and the colorbar conveys pass rate); change the
ax.set_xlabel call to something like "Metrics" (or remove the x-axis label) and
ensure the colorbar is explicitly labeled "Pass Rate (%)" so axis and legend
accurately reflect what is being plotted.

Comment on lines 49 to 53
verify: install-deps-test ## Verify the code using various linters
pdm run black . --check
pdm run ruff check . --per-file-ignores=tests/*:S101 --per-file-ignores=scripts/*:S101 --per-file-ignores=lsc_agent_eval/tests/*:S101
pdm run pylint src tests lsc_agent_eval/src lsc_agent_eval/tests
pdm run pylint src tests lsc_agent_eval/src lsc_agent_eval/tests lsc_eval/src/ lsc_eval/runner.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Pylint fails due to optional metric deps; consider guard or install in CI.

Current CI fails on deepeval import. You can either:

  • keep the code guards (recommended; see code diffs) and
  • also install optional deps in CI when running full linters.

Optionally update CI or Makefile to install extras for metrics:


🏁 Script executed:

#!/bin/bash
# Quick check if optional deps are present; if not, install for CI lint pass.
python - <<'PY'
import importlib, sys
missing=[m for m in ['deepeval','litellm'] if importlib.util.find_spec(m) is None]
print("Missing:", missing)
PY

Length of output: 182


🏁 Script executed:

#!/bin/bash
# Re-run the missing dependencies check with Python 3
python3 - << 'PY'
import importlib.util, sys
missing=[m for m in ['deepeval','litellm'] if importlib.util.find_spec(m) is None]
print("Missing:", missing)
PY

Length of output: 44


🏁 Script executed:

#!/bin/bash
set -e
# Search for deepeval imports and usage with context
rg -n 'import deepeval' -C3 .
rg -n 'deepeval' -C3 .
# Search for litellm imports and usage with context
rg -n 'import litellm' -C3 .
rg -n 'litellm' -C3 .

Length of output: 40


🏁 Script executed:

#!/bin/bash
set -e
# Inspect optional dependencies/extras in pyproject.toml
rg -nP '^\[tool\.pdm\.extras\]' -C5 pyproject.toml
rg -nP '^metrics\s*=' -C5 pyproject.toml

Length of output: 61


🏁 Script executed:

#!/bin/bash
# Search for deepeval and litellm references without exiting on first miss
rg -n 'import deepeval' -C3 . || true
rg -n 'deepeval' -C3 . || true
rg -n 'import litellm' -C3 . || true
rg -n 'litellm' -C3 . || true

Length of output: 35588


🏁 Script executed:

#!/bin/bash
# Show the verify target and context in Makefile
rg -n 'verify:' -C3 Makefile

echo '--- First 50 lines of Makefile ---'
head -n 50 Makefile

Length of output: 3073


Review comment

File: Makefile
Lines: 49–53

Snippet showing the final state of code at these lines

verify:	install-deps-test ## Verify the code using various linters
	pdm run black . --check
	pdm run ruff check . --per-file-ignores=tests/*:S101 --per-file-ignores=scripts/*:S101 --per-file-ignores=lsc_agent_eval/tests/*:S101
	pdm run pylint src tests lsc_agent_eval/src lsc_agent_eval/tests lsc_eval/src/ lsc_eval/runner.py

Comment

Optional metric dependencies (deepeval, litellm) are not installed in CI, causing Pylint import errors.

These packages are imported unconditionally in your metrics and LLM-manager modules, so the CI lint pass (which runs pdm sync --dev) fails when Pylint encounters missing imports. You can address this in one of two ways:

  • Wrap the imports in try/except ImportError guards (as shown in your code diffs).
  • Install the optional dependencies in CI before running the linters.

To install them via the Makefile, update the install-deps-test (or the verify) target:

 install-deps-test: install-tools pdm-lock-check ## Install all required dev dependencies needed to test the service
 	@for a in 1 2 3 4 5; do pdm sync --dev && break || sleep 15; done
+	# Ensure metric packages are available for full linting
+	pip install deepeval litellm

Alternatively, if you prefer using PDM extras, first define a metrics extra in your pyproject.toml:

[tool.pdm.extras]
metrics = ["deepeval>=1.3.0", "litellm>=1.0.0"]

and then install it in the Makefile:

 install-deps-test: install-tools pdm-lock-check
 	pdm sync --dev
+	pdm install --group metrics

Please apply one of these changes so that Pylint can successfully import the metric modules in CI.

🤖 Prompt for AI Agents
Makefile lines 49–53: CI Pylint fails because optional metric deps (deepeval,
litellm) aren’t installed; update the dependency install step so linters can
import metric modules by adding installation of the metrics extra (or the
specific packages) before running linters — e.g. in the install-deps-test target
run pdm sync --dev then pdm install --group metrics (or pdm install deepeval
litellm) so the verify target’s pdm run pylint step no longer errors on missing
imports.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review continued from previous batch...

]


def setup_environment_variables(config_path: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add return type annotation: setup_environment_variables -> None

Mypy flagged missing return type. Also prefer warnings over prints and avoid setting env vars to the literal string "None".

-def setup_environment_variables(config_path: str):
+from warnings import warn
+
+def setup_environment_variables(config_path: str) -> None:
@@
-        for env_var, value in env_config.items():
-            os.environ[env_var] = str(value)
+        for env_var, value in env_config.items():
+            if value is None:
+                continue
+            os.environ[env_var] = str(value)
@@
-        print(f"Warning: Could not load environment config from {config_path}: {e}")
-        print("Using fallback environment settings...")
+        warn(f"Could not load environment config from {config_path}: {e}. Using fallback env settings…")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def setup_environment_variables(config_path: str):
from warnings import warn
def setup_environment_variables(config_path: str) -> None:
try:
# … load env_config from config_path …
for env_var, value in env_config.items():
if value is None:
continue
os.environ[env_var] = str(value)
except Exception as e:
warn(f"Could not load environment config from {config_path}: {e}. Using fallback env settings…")
🧰 Tools
🪛 GitHub Actions: Type checks

[error] 27-27: Mypy: Function is missing a return type annotation. [no-untyped-def]

🤖 Prompt for AI Agents
In lsc_eval/src/lsc_eval/core/config_loader.py around line 27, add an explicit
return type annotation to the function signature (def
setup_environment_variables(config_path: str) -> None:), replace any print(...)
calls with warnings.warn(...) imports from the warnings module, and when setting
environment variables avoid assigning the literal string "None" by checking for
None/empty values and either skipping os.environ assignment or deleting the key
instead of setting it to "None"; ensure the function does not return a value.

Comment on lines +67 to +75
try:
response = litellm.completion(
model=self.model_name,
messages=messages,
temperature=self.litellm_params.get("temperature", 0.0),
max_tokens=self.litellm_params.get("max_tokens"),
timeout=self.litellm_params.get("timeout"),
num_retries=self.litellm_params.get("num_retries", 3),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Don’t pass None-valued args to litellm; build kwargs dynamically

Passing None for max_tokens/timeout can cause API errors on some providers. Build kwargs conditionally.

-            response = litellm.completion(
-                model=self.model_name,
-                messages=messages,
-                temperature=self.litellm_params.get("temperature", 0.0),
-                max_tokens=self.litellm_params.get("max_tokens"),
-                timeout=self.litellm_params.get("timeout"),
-                num_retries=self.litellm_params.get("num_retries", 3),
-            )
+            kwargs: Dict[str, Any] = {
+                "model": self.model_name,
+                "messages": messages,
+                "temperature": self.litellm_params.get("temperature", 0.0),
+                "num_retries": self.litellm_params.get("num_retries", 3),
+            }
+            if (mt := self.litellm_params.get("max_tokens")) is not None:
+                kwargs["max_tokens"] = mt
+            # Use request_timeout if your LiteLLM version expects that name
+            if (to := self.litellm_params.get("timeout")) is not None:
+                kwargs["timeout"] = to
+            response = litellm.completion(**kwargs)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
response = litellm.completion(
model=self.model_name,
messages=messages,
temperature=self.litellm_params.get("temperature", 0.0),
max_tokens=self.litellm_params.get("max_tokens"),
timeout=self.litellm_params.get("timeout"),
num_retries=self.litellm_params.get("num_retries", 3),
)
try:
kwargs: Dict[str, Any] = {
"model": self.model_name,
"messages": messages,
"temperature": self.litellm_params.get("temperature", 0.0),
"num_retries": self.litellm_params.get("num_retries", 3),
}
if (mt := self.litellm_params.get("max_tokens")) is not None:
kwargs["max_tokens"] = mt
# Use request_timeout if your LiteLLM version expects that name
if (to := self.litellm_params.get("timeout")) is not None:
kwargs["timeout"] = to
response = litellm.completion(**kwargs)
🤖 Prompt for AI Agents
In lsc_eval/src/lsc_eval/metrics/custom_metrics.py around lines 67 to 75, the
call to litellm.completion passes max_tokens and timeout (and potentially other
params) even when their values are None which can cause API errors; change to
build a kwargs dict (e.g., kwargs = {"model": self.model_name, "messages":
messages, "temperature": self.litellm_params.get("temperature", 0.0),
"num_retries": self.litellm_params.get("num_retries", 3)}) and then
conditionally add max_tokens and timeout only if their values are not None (and
omit any other None-valued entries), then call response =
litellm.completion(**kwargs).

Comment on lines +118 to +124
if score is not None and score > 1.0:
if score <= 10.0: # Assume 0-10 scale
score = score / 10.0
elif score <= 100.0: # Assume 0-100 scale
score = score / 100.0

return score, reason
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Normalize and clamp scores to [0,1]; support percentages

Values like “85%” or spurious integers can exceed 1. Clamp and parse percent format to avoid downstream validation errors.

-        if score is not None and score > 1.0:
-            if score <= 10.0:  # Assume 0-10 scale
-                score = score / 10.0
-            elif score <= 100.0:  # Assume 0-100 scale
-                score = score / 100.0
+        if score is None:
+            # Additional: handle percentages like "85%"
+            pct = re.search(r'(\d+(\.\d+)?)\s*%', response)
+            if pct:
+                score = float(pct.group(1)) / 100.0
+        # Final normalization and clamp
+        if score is not None:
+            if score > 1.0 and score <= 10.0:
+                score = score / 10.0
+            elif score > 1.0 and score <= 100.0:
+                score = score / 100.0
+            score = max(0.0, min(1.0, score))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if score is not None and score > 1.0:
if score <= 10.0: # Assume 0-10 scale
score = score / 10.0
elif score <= 100.0: # Assume 0-100 scale
score = score / 100.0
return score, reason
if score is None:
# Additional: handle percentages like "85%"
pct = re.search(r'(\d+(\.\d+)?)\s*%', response)
if pct:
score = float(pct.group(1)) / 100.0
# Final normalization and clamp
if score is not None:
if score > 1.0 and score <= 10.0:
score = score / 10.0
elif score > 1.0 and score <= 100.0:
score = score / 100.0
score = max(0.0, min(1.0, score))
return score, reason
🤖 Prompt for AI Agents
In lsc_eval/src/lsc_eval/metrics/custom_metrics.py around lines 118 to 124, the
normalization only handles numeric scales >1 but doesn't parse percentage
strings or clamp values into [0,1]; update the logic to: detect and strip a
trailing '%' then parse as a float and divide by 100, otherwise if numeric and
>1 apply the existing 0-10/0-100 heuristics, then clamp the final score into the
[0.0, 1.0] range (also handle negative values by clamping to 0.0); keep the
existing reason value and return the normalized, clamped score and reason.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (6)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (2)

6-6: Fix CI/type-check failures and avoid None-args: lazy-import LiteLLM and build kwargs dynamically

Module-level import of litellm breaks CI/type checks when the dep isn’t installed, and passing None for provider params can cause API errors. Lazy-import inside the call path and construct kwargs conditionally.

Apply:

@@
- import litellm
@@
     def _call_llm(self, prompt: str, system_prompt: Optional[str] = None) -> str:
         """Make a LiteLLM call with the configured parameters."""
         # Prepare messages
         messages = []
         if system_prompt:
             messages.append({"role": "system", "content": system_prompt})
         messages.append({"role": "user", "content": prompt})
 
+        # Lazy-import to avoid import errors in environments without litellm
+        try:
+            import litellm  # type: ignore
+        except Exception as exc:
+            raise RuntimeError(
+                "litellm is required for CustomMetrics. Install with `pip install litellm`."
+            ) from exc
+
         try:
-            response = litellm.completion(
-                model=self.model_name,
-                messages=messages,
-                temperature=self.litellm_params.get("temperature", 0.0),
-                max_tokens=self.litellm_params.get("max_tokens"),
-                timeout=self.litellm_params.get("timeout"),
-                num_retries=self.litellm_params.get("num_retries", 3),
-            )
+            kwargs: Dict[str, Any] = {
+                "model": self.model_name,
+                "messages": messages,
+                "temperature": self.litellm_params.get("temperature", 0.0),
+                "num_retries": self.litellm_params.get("num_retries", 3),
+            }
+            if (mt := self.litellm_params.get("max_tokens")) is not None:
+                kwargs["max_tokens"] = mt
+            if (to := self.litellm_params.get("timeout")) is not None:
+                # If your LiteLLM version expects 'request_timeout', switch the key here.
+                kwargs["timeout"] = to
+            response = litellm.completion(**kwargs)

Also applies to: 59-76


85-124: Harden score parsing: support “85%”, “Score: 8.5/10”, and clamp to [0,1]

Current logic fails on percent formats, may misinterpret scales, and doesn’t clamp. Normalize robustly and clamp the final score.

@@
-        # Try to find explicit score/reason format
+        # Try to find explicit score/reason format
         for line in lines:
             line = line.strip()
             if line.lower().startswith("score:"):
                 try:
-                    score_text = line.split(":", 1)[1].strip()
-                    score = float(score_text)
+                    score_text = line.split(":", 1)[1].strip()
+                    # Percent e.g., "85%"
+                    pct = re.match(r"(\d+(?:\.\d+)?)\s*%", score_text)
+                    if pct:
+                        score = float(pct.group(1)) / 100.0
+                    # Fraction or "out of" inside Score line, e.g., "8.5/10", "4 out of 5"
+                    elif "/" in score_text or re.search(r"\bout of\b", score_text, re.IGNORECASE):
+                        score = self._extract_score_from_text(score_text)
+                    else:
+                        score = float(score_text)
                 except (ValueError, IndexError):
                     pass
             elif line.lower().startswith("reason:"):
                 try:
                     reason = line.split(":", 1)[1].strip()
                 except IndexError:
                     pass
 
-        # If no explicit score found, try to extract from text
+        # If no explicit score found, try to extract from text (percent first, then numeric)
         if score is None:
-            score = self._extract_score_from_text(response)
+            pct = re.search(r"(\d+(?:\.\d+)?)\s*%", response)
+            if pct:
+                score = float(pct.group(1)) / 100.0
+            else:
+                score = self._extract_score_from_text(response)
 
-        # Normalize score to 0-1 range if needed
-        if score is not None and score > 1.0:
-            if score <= 10.0:  # Assume 0-10 scale
-                score = score / 10.0
-            elif score <= 100.0:  # Assume 0-100 scale
-                score = score / 100.0
+        # Normalize and clamp into [0,1]
+        if score is not None:
+            if score > 1.0:
+                if score <= 10.0:  # Assume 0-10 scale
+                    score = score / 10.0
+                elif score <= 100.0:  # Assume 0-100 scale
+                    score = score / 100.0
+            score = max(0.0, min(1.0, score))
lsc_eval/runner.py (2)

6-12: Fix usage examples for src/ layout and import path.

"python -m runner" and "from runner import ..." won’t work in a src/ layout. Use the fully qualified package path.

-Usage:
-    python -m runner --system-config config/system.yaml --eval-data config/evaluation_data.yaml
+Usage:
+    # Run as a module (src/ layout friendly):
+    python -m lsc_eval.runner --system-config config/system.yaml --eval-data config/evaluation_data.yaml
@@
-Or programmatically:
-    from runner import run_evaluation
-    results = run_evaluation("config/system.yaml", "config/evaluation_data.yaml")
+Or programmatically:
+    from lsc_eval.runner import run_evaluation
+    results = run_evaluation("config/system.yaml", "config/evaluation_data.yaml")

93-97: Handle missing optional dependencies cleanly (catch ImportError).

Broaden the except tuple so missing libs (e.g., matplotlib/seaborn) don’t crash with an unhandled traceback.

-    except (FileNotFoundError, ValueError, RuntimeError) as e:
+    except (FileNotFoundError, ValueError, RuntimeError, ImportError) as e:
         print(f"\n❌ Evaluation failed: {e}")
         traceback.print_exc()
         return None
lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (2)

5-5: Guard optional LiteLLM import and fail fast if unavailable.

Unconditional import litellm breaks CI when the package isn’t installed, and runtime should surface a clear error before calling into LiteLLM. This was previously flagged; please implement the guard and runtime check.

Apply this diff:

-from typing import Any, Dict, List, Optional
-
-import litellm
+from typing import Any, Dict, List, Optional, TYPE_CHECKING
+import logging
+import asyncio
+try:
+    import litellm  # pyright: ignore[reportMissingImports]
+    _LITELLM_AVAILABLE = True
+except Exception:  # pragma: no cover
+    litellm = None  # type: ignore[assignment]
+    _LITELLM_AVAILABLE = False
+logger = logging.getLogger(__name__)

And fail fast before the LiteLLM call:

-        try:
+        if not _LITELLM_AVAILABLE:
+            raise ImportError(
+                "litellm is not installed. Install with `pip install litellm` or enable the appropriate extras."
+            )
+        try:
             response = litellm.completion(

Also applies to: 34-35, 61-61


25-26: Honor the stop parameter in LiteLLM calls.

The function accepts stop but doesn’t forward it to LiteLLM. Please pass it through. This was noted earlier and is still unresolved.

Apply this diff:

             response = litellm.completion(
                 model=self.model_name,
                 messages=[{"role": "user", "content": prompt_text}],
                 n=n,
                 temperature=temp,
                 max_tokens=self.litellm_params.get("max_tokens"),
                 timeout=self.litellm_params.get("timeout"),
-                num_retries=self.litellm_params.get("num_retries"),
+                num_retries=self.litellm_params.get("num_retries"),
+                stop=stop,
             )

Also applies to: 34-43

🧹 Nitpick comments (12)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (5)

3-4: Use logging instead of print for library initialization status

Printing from libraries is noisy; prefer the standard logging pipeline.

@@
-import re
+import re
+import logging
@@
-
+logger = logging.getLogger(__name__)
@@
-        print(f"✅ Custom Metrics initialized: {self.model_name}")
+        logger.info("Custom Metrics initialized: %s", self.model_name)

Also applies to: 13-13, 40-40


4-4: Tighten contexts typing to match TurnData and help mypy/pydantic tooling

TurnData.contexts is List[Dict[str, str]]. Align the param model to reduce ambiguity.

-from typing import Any, Dict, Optional, Tuple
+from typing import Any, Dict, Optional, Tuple, List
@@
-    contexts: Optional[list] = Field(None, description="Context information if available")
+    contexts: Optional[List[Dict[str, str]]] = Field(
+        None, description="Context information if available"
+    )

Also applies to: 21-21


180-189: Constrain output format to improve parse reliability

Tighten instructions so the model emits exactly two lines, reducing parser ambiguity.

         prompt_parts.extend(
             [
                 "",
-                f"Rate the {params.metric_name} and provide your reasoning.",
+                f"Rate the {params.metric_name} and provide your reasoning.",
                 "",
-                "Format your response as:",
-                f"Score: [your score on {params.scale}]",
-                "Reason: [your detailed explanation]",
+                "Output exactly two lines in this exact format (no extra text before or after):",
+                f"Score: <number on {params.scale}>",
+                "Reason: <brief explanation without additional numeric scores>",
             ]
         )

229-231: Nudge the model with a minimal system prompt to enforce structure

A short system prompt often cuts variance and helps parsing.

-        llm_response = self._call_llm(prompt)
+        llm_response = self._call_llm(
+            prompt,
+            system_prompt=(
+                "You are a strict evaluator. Respond in exactly two lines: "
+                "'Score: <number>' on the first line and 'Reason: <text>' on the second."
+            ),
+        )

158-191: Optional: consider adding guardrails for very large context lists

If contexts can be long, consider truncating or summarizing to keep prompts within provider limits/cost targets.

Would you like a helper that truncates contexts by token budget (e.g., tiktoken/litellm token counting) and appends “+N more items omitted”?

lsc_eval/runner.py (3)

48-50: Avoid double validation of evaluation data.

DataValidator.load_evaluation_data() validates, and engine.run_evaluation() validates again. Consider trusting the already-validated input or adding a skip_validation flag in the engine to avoid redundant work on large datasets.

Also applies to: 58-61


114-131: Resolve CLI paths to absolute early for predictable behavior.

Aligns with the prior learning to prefer absolute paths internally; improves logs and downstream file handling consistency.

     args = parser.parse_args()
 
-    # CRITICAL: Setup environment variables from system config FIRST
-    setup_environment_variables(args.system_config)
+    # Normalize to absolute paths for consistency
+    system_config_path = str(Path(args.system_config).resolve())
+    eval_data_path = str(Path(args.eval_data).resolve())
+
+    # CRITICAL: Setup environment variables from system config FIRST
+    setup_environment_variables(system_config_path)
 
     # Validate input files exist
-    if not Path(args.system_config).exists():
-        print(f"❌ System config file not found: {args.system_config}")
+    if not Path(system_config_path).exists():
+        print(f"❌ System config file not found: {system_config_path}")
         return 1
 
-    if not Path(args.eval_data).exists():
-        print(f"❌ Evaluation data file not found: {args.eval_data}")
+    if not Path(eval_data_path).exists():
+        print(f"❌ Evaluation data file not found: {eval_data_path}")
         return 1
 
     # Run evaluation
-    summary = run_evaluation(args.system_config, args.eval_data, args.output_dir)
+    summary = run_evaluation(system_config_path, eval_data_path, args.output_dir)

134-136: Add minimal tests for CLI and programmatic runner.

Even if broader tests land later, add a smoke test for:

  • non-existent files → exit code 1 and user-friendly message
  • happy path with stubbed engine/validator → exit code 0 and dict summary shape

I can scaffold pytest tests that monkeypatch ConfigLoader/DataValidator/EvaluationEngine to avoid I/O and assert exit codes/output. Want me to draft them?

lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (4)

18-18: Replace print statements with logging.

Library code shouldn’t print; use the module logger to respect caller logging config and avoid noisy stdout.

Apply this diff:

-        print(f"✅ Ragas Custom LLM: {self.model_name}")
+        logger.info("Ragas Custom LLM initialized: %s", self.model_name)
@@
-        except Exception as e:
-            print(f"❌ Ragas LLM failed: {e}")
-            raise RuntimeError(f"Ragas LLM evaluation failed: {str(e)}") from e
+        except Exception as e:
+            logger.exception("Ragas LLM failed")
+            raise RuntimeError(f"Ragas LLM evaluation failed: {str(e)}") from e
@@
-        print("✅ Ragas LLM Manager configured")
+        logger.info("Ragas LLM Manager configured")

Also applies to: 57-59, 95-95


101-106: Expose full model info for consistency with other managers.

Align with deepeval_llm.get_model_info by surfacing max_tokens, timeout, and num_retries.

Apply this diff:

     def get_model_info(self) -> Dict[str, Any]:
         """Get information about the configured model."""
         return {
             "model_name": self.model_name,
-            "temperature": self.litellm_params.get("temperature", 0.0),
+            "temperature": self.litellm_params.get("temperature", 0.0),
+            "max_tokens": self.litellm_params.get("max_tokens"),
+            "timeout": self.litellm_params.get("timeout"),
+            "num_retries": self.litellm_params.get("num_retries", 3),
         }

91-94: Global side effects on metric singletons.

Rebinding answer_relevancy.llm and faithfulness.llm mutates module-level state and may surprise other tests/components if multiple managers are instantiated. If feasible, pass the LLM per-evaluation or document that this manager globally configures Ragás metrics.


20-27: Unify temperature handling across sync/async (optional)

I ran the ripgrep check and confirmed that the only occurrences of the 1e-08 sentinel live in ​lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py; no other modules or call sites depend on that exact value. If you’d like to eliminate this “magic number” and centralize the default-temperature logic, you can:

  • In generate_text(...) (lines 20–27):

    • Change the signature from
      def generate_text(
          …,
          temperature: float = 1e-08,
          …
      ) -> LLMResult:
      to
      def generate_text(
          …,
          temperature: Optional[float] = None,
          …
      ) -> LLMResult:
    • Update the temp-fallback from
      temp = temperature if temperature != 1e-08 else self.litellm_params.get("temperature", 0.0)
      to
      temp = temperature if temperature is not None else self.litellm_params.get("temperature", 0.0)
  • In generate_text_async(...) (lines 68–72):

    • Remove the explicit fallback to 1e-08:
      - temp = temperature if temperature is not None else 1e-08
      - return self.generate_text(prompt, n=n, temperature=temp, stop=stop, callbacks=callbacks)
      + return self.generate_text(prompt, n=n, temperature=temperature, stop=stop, callbacks=callbacks)

This optional refactor eliminates the numeric sentinel and lets the synchronous path alone handle defaulting to litellm_params["temperature"].

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between eac9f72 and e5fcdee.

📒 Files selected for processing (12)
  • lsc_eval/pyproject.toml (1 hunks)
  • lsc_eval/runner.py (1 hunks)
  • lsc_eval/src/lsc_eval/__init__.py (1 hunks)
  • lsc_eval/src/lsc_eval/core/config_loader.py (1 hunks)
  • lsc_eval/src/lsc_eval/core/data_validator.py (1 hunks)
  • lsc_eval/src/lsc_eval/evaluation_engine.py (1 hunks)
  • lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (1 hunks)
  • lsc_eval/src/lsc_eval/metrics/custom_metrics.py (1 hunks)
  • lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (1 hunks)
  • lsc_eval/src/lsc_eval/metrics/ragas_metrics.py (1 hunks)
  • lsc_eval/src/lsc_eval/output/output_handler.py (1 hunks)
  • lsc_eval/src/lsc_eval/output/visualization.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
  • lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py
  • lsc_eval/src/lsc_eval/core/config_loader.py
  • lsc_eval/src/lsc_eval/init.py
  • lsc_eval/src/lsc_eval/core/data_validator.py
  • lsc_eval/pyproject.toml
  • lsc_eval/src/lsc_eval/evaluation_engine.py
  • lsc_eval/src/lsc_eval/output/visualization.py
  • lsc_eval/src/lsc_eval/metrics/ragas_metrics.py
  • lsc_eval/src/lsc_eval/output/output_handler.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

  • lsc_eval/runner.py
🧬 Code graph analysis (3)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (4)
lsc_eval/src/lsc_eval/core/models.py (1)
  • TurnData (8-44)
lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (4)
  • LLMManager (36-167)
  • get_model_name (144-146)
  • get_litellm_params (148-156)
  • from_system_config (163-167)
lsc_eval/src/lsc_eval/output/utils.py (1)
  • EvaluationScope (11-16)
lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (2)
  • evaluate (62-77)
  • from_system_config (135-138)
lsc_eval/runner.py (5)
lsc_eval/src/lsc_eval/core/config_loader.py (3)
  • ConfigLoader (179-259)
  • setup_environment_variables (30-47)
  • load_system_config (188-243)
lsc_eval/src/lsc_eval/core/data_validator.py (2)
  • DataValidator (11-82)
  • load_evaluation_data (19-37)
lsc_eval/src/lsc_eval/evaluation_engine.py (2)
  • EvaluationEngine (106-296)
  • run_evaluation (134-158)
lsc_eval/src/lsc_eval/output/output_handler.py (2)
  • OutputHandler (15-224)
  • generate_reports (32-76)
lsc_eval/src/lsc_eval/output/utils.py (1)
  • calculate_basic_stats (19-45)
lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (1)
lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py (2)
  • get_llm (31-33)
  • get_model_info (35-43)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: black
  • GitHub Check: ruff
  • GitHub Check: Pyright
  • GitHub Check: tests (3.11)
  • GitHub Check: tests (3.12)
  • GitHub Check: mypy
  • GitHub Check: tests (3.13)
  • GitHub Check: Pylinter
  • GitHub Check: pydocstyle
🔇 Additional comments (4)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (2)

42-48: LGTM: public evaluate signature is typed and consistent

Good call making conv_data typed as Any and returning Tuple[Optional[float], str]. This aligns with deepeval_metrics.evaluate and unblocks mypy.


193-200: Minor: parameters prefixed with “_” are correctly typed and unused

The unused _conv_data and optional _turn_idx are properly annotated; no action needed.

lsc_eval/runner.py (2)

99-101: Good: main() has an explicit return type.

This satisfies type checkers and matches the sys.exit(main()) pattern.


20-21: No changes needed: package-level imports are valid.

The root and core packages both re-export the referenced symbols, so the existing imports will resolve correctly at runtime:

  • lsc_eval/src/lsc_eval/__init__.py exports ConfigLoader, DataValidator, EvaluationEngine, and OutputHandler in its __all__.
  • lsc_eval/src/lsc_eval/core/__init__.py exports setup_environment_variables alongside ConfigLoader and DataValidator.

You can safely keep the current imports:

from lsc_eval import ConfigLoader, DataValidator, EvaluationEngine, OutputHandler
from lsc_eval.core import setup_environment_variables

Likely an incorrect or invalid review comment.

Comment on lines 61 to 71
async def agenerate_text( # pylint: disable=too-many-arguments,too-many-positional-arguments
self,
prompt: Any,
n: int = 1,
temperature: Optional[float] = None,
stop: Optional[List[str]] = None,
callbacks: Optional[Any] = None,
) -> LLMResult:
"""Async generate."""
temp = temperature if temperature is not None else 1e-08
return self.generate_text(prompt, n=n, temperature=temp, stop=stop, callbacks=callbacks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Async method currently blocks the event loop; use litellm.acompletion or offload to a thread.

agenerate_text delegates to the sync method, defeating the purpose of async and potentially stalling callers.

Apply this diff:

     async def agenerate_text(  # pylint: disable=too-many-arguments,too-many-positional-arguments
         self,
         prompt: Any,
         n: int = 1,
         temperature: Optional[float] = None,
         stop: Optional[List[str]] = None,
         callbacks: Optional[Any] = None,
     ) -> LLMResult:
-        """Async generate."""
-        temp = temperature if temperature is not None else 1e-08
-        return self.generate_text(prompt, n=n, temperature=temp, stop=stop, callbacks=callbacks)
+        """Async generate using LiteLLM's acompletion when available; falls back to a thread."""
+        temp = temperature if temperature is not None else self.litellm_params.get("temperature", 0.0)
+        if not _LITELLM_AVAILABLE:
+            raise ImportError(
+                "litellm is not installed. Install with `pip install litellm` or enable the appropriate extras."
+            )
+        try:
+            if hasattr(litellm, "acompletion"):
+                response = await litellm.acompletion(
+                    model=self.model_name,
+                    messages=[{"role": "user", "content": str(prompt)}],
+                    n=n,
+                    temperature=temp,
+                    max_tokens=self.litellm_params.get("max_tokens"),
+                    timeout=self.litellm_params.get("timeout"),
+                    num_retries=self.litellm_params.get("num_retries"),
+                    stop=stop,
+                )
+                generations: List[Generation] = []
+                for choice in response.choices:  # type: ignore[attr-defined]
+                    content = getattr(choice.message, "content", "") or ""
+                    generations.append(Generation(text=content.strip()))
+                return LLMResult(generations=[generations])
+            # Fallback: offload sync call to a thread to avoid blocking
+            return await asyncio.to_thread(
+                self.generate_text,
+                prompt,
+                n,
+                temp,
+                stop,
+                callbacks,
+            )
+        except Exception as e:
+            logger.exception("Async Ragas LLM failed")
+            raise RuntimeError(f"Ragas LLM evaluation failed: {str(e)}") from e

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py around lines 61–71, the async
method agenerate_text currently calls the synchronous generate_text and blocks
the event loop; replace that blocking call with a true async implementation by
either awaiting litellm.acompletion with the same parameters (ensuring you map
its result into an LLMResult and preserve the default temperature logic) or, if
litellm.acompletion is not available, offload the sync call to a thread using
asyncio.to_thread(self.generate_text, prompt, n=n, temperature=temp, stop=stop,
callbacks=callbacks) and return its LLMResult; ensure exceptions propagate
correctly and the signature/return type remain unchanged.

Comment on lines +52 to +63
try:
return self.supported_metrics[metric_name](
conv_data, scope.turn_idx, scope.turn_data, scope.is_conversation
)
except (ValueError, AttributeError, KeyError) as e:
return None, f"Custom {metric_name} evaluation failed: {str(e)}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Prevent hard crashes by catching _call_llm RuntimeError in evaluate()

unhandled RuntimeError from _call_llm will bubble out and crash evaluation. Include it in the caught exceptions so the API returns a (None, reason) tuple consistently.

-        except (ValueError, AttributeError, KeyError) as e:
+        except (ValueError, AttributeError, KeyError, RuntimeError) as e:
             return None, f"Custom {metric_name} evaluation failed: {str(e)}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
return self.supported_metrics[metric_name](
conv_data, scope.turn_idx, scope.turn_data, scope.is_conversation
)
except (ValueError, AttributeError, KeyError) as e:
return None, f"Custom {metric_name} evaluation failed: {str(e)}"
try:
return self.supported_metrics[metric_name](
conv_data, scope.turn_idx, scope.turn_data, scope.is_conversation
)
except (ValueError, AttributeError, KeyError, RuntimeError) as e:
return None, f"Custom {metric_name} evaluation failed: {str(e)}"
🤖 Prompt for AI Agents
In lsc_eval/src/lsc_eval/metrics/custom_metrics.py around lines 52 to 57, the
except block currently catches ValueError, AttributeError, and KeyError but
misses RuntimeError from _call_llm which can bubble up; update the except clause
to also catch RuntimeError so the function returns (None, f"Custom {metric_name}
evaluation failed: {str(e)}") consistently when _call_llm raises a RuntimeError.

@asamal4 asamal4 force-pushed the generic-eval-tool branch from 5b854a6 to d94afec Compare August 26, 2025 11:10
@asamal4
Copy link
Collaborator Author

asamal4 commented Aug 26, 2025

This is still in WIP.. Most of the issues will get resolved when we make this primary eval tool and remove the existing one.

@asamal4
Copy link
Collaborator Author

asamal4 commented Aug 26, 2025

@tisnik PTAL
This is a standalone code, doesn't impact any of the existing code.. Most CodeRabbit issues are related to import which will get resolved once we make this the primary one..
This is a big change, so practically impossible to review. I would suggest to review by running/testing it.
I will continue to improve/fix issues, add test cases.. But meanwhile can we please merge this so that others can contribute in parallel.

if we decide not to use this, then we can simply delete the folder in future, so no impact at all.

cc: @VladimirKadlec @Anxhela21

@asamal4
Copy link
Collaborator Author

asamal4 commented Aug 26, 2025

Also the Bandit issue is not related to code change.. We will have to fix this separately.

Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, very good job!

Copy link
Contributor

@Anxhela21 Anxhela21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants