Generic eval tool #28

asamal4 · 2025-08-26T10:12:06Z

Generic & configurable eval tool/framework (no dependency on lightspeed-stack)

Test cases are pending. will add in follow up PRs.
This is still WIP, adding to the repo for others to test.

Summary by CodeRabbit

New Features
- Adds the LSC Evaluation Framework and CLI (lsc-eval) for running turn- and conversation-level evaluations; programmatic runner also available.
- Built-in metric support: Ragas, DeepEval, and Custom LLM-based metrics with configurable thresholds.
- Generates CSV/JSON/TXT reports and optional graphs (pass rates, score distribution, conversation heatmap).
- Includes sample system and evaluation configuration files.
Documentation
- Adds a comprehensive README with quick start, configuration, and usage guidance.
Chores
- Updates .gitignore and expands tooling targets (type checks, linters, auto-format fix).

coderabbitai · 2025-08-26T10:12:16Z

Warning

Rate limit exceeded

@asamal4 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 0 minutes and 8 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between e5fcdee and d94afec.

📒 Files selected for processing (17)

lsc_eval/README.md (1 hunks)
lsc_eval/pyproject.toml (1 hunks)
lsc_eval/runner.py (1 hunks)
lsc_eval/src/lsc_eval/__init__.py (1 hunks)
lsc_eval/src/lsc_eval/core/__init__.py (1 hunks)
lsc_eval/src/lsc_eval/core/config_loader.py (1 hunks)
lsc_eval/src/lsc_eval/core/data_validator.py (1 hunks)
lsc_eval/src/lsc_eval/core/models.py (1 hunks)
lsc_eval/src/lsc_eval/evaluation_engine.py (1 hunks)
lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (1 hunks)
lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (1 hunks)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (1 hunks)
lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (1 hunks)
lsc_eval/src/lsc_eval/metrics/ragas_metrics.py (1 hunks)
lsc_eval/src/lsc_eval/output/output_handler.py (1 hunks)
lsc_eval/src/lsc_eval/output/utils.py (1 hunks)
lsc_eval/src/lsc_eval/output/visualization.py (1 hunks)

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 45

🧹 Nitpick comments (42)

.gitignore (2)
181-182: Remove redundant ignore: lsc_eval/eval_output*/ is unnecessary

The existing pattern eval_output*/ (Line 180) already matches directories named eval_output* at any depth in the repo. The added lsc_eval/eval_output*/ is redundant and can be dropped to avoid drift and duplication.

Apply this minimal diff:
- lsc_eval/eval_output*/
183-184: Broaden .deepeval ignore to any location, not just under lsc_eval/

DeepEval may create .deepeval/ at the repo root (or elsewhere). Scoping the ignore to lsc_eval/.deepeval/ could miss those artifacts. Prefer ignoring .deepeval/ globally.

Proposed change:
-# DeepEval telemetry and configuration
-lsc_eval/.deepeval/
+# DeepEval telemetry and configuration
+.deepeval/
If you explicitly want to scope it to any subdir (not only root), you could use **/.deepeval/ instead.
lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py (2)
29-29: Use logging instead of print for library code.

Printing in a library is noisy for CLI/consumers; prefer logging.
-        print(f"✅ DeepEval LLM Manager: {self.model_name}")
+        logger.info("DeepEval LLM Manager initialized for model %s", self.model_name)
31-33: Stringize or postpone annotations to avoid import-time typing issues.

If DeepEval isn’t installed, exported annotations referencing LiteLLMModel can still cause tooling churn. With the guard above, you’re mostly fine; optionally stringize the return type to be extra-safe.
-    def get_llm(self) -> LiteLLMModel:
+    def get_llm(self) -> "LiteLLMModel":
         """Get the configured DeepEval LiteLLM model."""
         return self.llm_model
lsc_eval/README.md (5)
3-3: Tighten intro grammar.
-A comprehensive framework/tool to evaluate GenAI application.
+A comprehensive tool to evaluate GenAI applications.
19-25: Unify install instructions with repo org and mention optional deps.

Ensure readers install the correct repo and, when using Ragas/DeepEval metrics, pull optional dependencies.
-# From Git
-pdm add git+https://github.com/your-org/lightspeed-evaluation.git#subdirectory=lsc_eval
-# or pip install git+https://github.com/lightspeed-core/lightspeed-evaluation.git#subdirectory=lsc_eval
+# From Git (PDM)
+pdm add git+https://github.com/lightspeed-core/lightspeed-evaluation.git#subdirectory=lsc_eval
+# Or via pip
+# pip install "git+https://github.com/lightspeed-core/lightspeed-evaluation.git#subdirectory=lsc_eval"
+
+# Optional metric backends:
+# Ragas + LiteLLM
+# pdm add ragas litellm
+# DeepEval + LiteLLM
+# pdm add deepeval litellm
32-34: Prefer the CLI entrypoint if provided, keep runner.py as an alternative.

The PR summary mentions a console script. Consider showcasing it first.
-# Run evaluation (Create your own data)
-python runner.py --system-config config/system.yaml --eval-data config/evaluation_data.yaml
+# Run evaluation (create your own data)
+# If installed as a package with console script:
+# lsc-eval --system-config config/system.yaml --eval-data config/evaluation_data.yaml
+# Or directly via module:
+python runner.py --system-config config/system.yaml --eval-data config/evaluation_data.yaml
121-125: Clarify “Actual Reasons” wording.

Minor readability tweak.
-**Actual Reasons**: DeepEval provides LLM-generated explanations, Custom metrics provide detailed reasoning
+**Reasons/Explanations**: DeepEval provides LLM-generated explanations; custom metrics include detailed reasoning
126-133: Pin the dev tools order and add pyright/pylint notes.

Not required, but it helps new contributors reproduce CI locally.
 pdm run black .
 pdm run ruff check .
 pdm run mypy .
-pdm run pyright .
-pdm run pylint .
+pdm run pyright .
+pdm run pylint .
+# Tip: if you don't use DeepEval/Ragas, you can still run linters;
+# optional imports in code are guarded to avoid hard failures.
lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (3)
57-59: Use logging for errors instead of print.

Keep libraries quiet unless explicitly configured.
-        except Exception as e:
-            print(f"❌ Ragas LLM failed: {e}")
-            raise RuntimeError(f"Ragas LLM evaluation failed: {str(e)}") from e
+        except Exception as e:
+            logger.error("Ragas LLM failed: %s", e)
+            raise RuntimeError(f"Ragas LLM evaluation failed: {e}") from e
18-18: Use logging instead of print on initialization.
-        print(f"✅ Ragas Custom LLM: {self.model_name}")
+        logger.info("Ragas Custom LLM initialized for model %s", self.model_name)
91-95: Avoid global metric side effects or make them opt-in.

Assigning to answer_relevancy.llm and faithfulness.llm mutates module-level singletons and can surprise callers if multiple managers exist. Prefer per-metric instances or make this opt-in.
-        # Configure Ragas metrics to use our custom LLM
-        answer_relevancy.llm = self.custom_llm
-        faithfulness.llm = self.custom_llM
-        print("✅ Ragas LLM Manager configured")
+        # Optional: configure default Ragas metrics to use our custom LLM
+        if self.litellm_params.get("inject_global_llm", True):
+            answer_relevancy.llm = self.custom_llm
+            faithfulness.llm = self.custom_llm
+        logger.info("Ragas LLM Manager configured (inject_global_llm=%s)", self.litellm_params.get("inject_global_llm", True))
If you keep global assignment, document it in README to set expectations.
lsc_eval/src/lsc_eval/metrics/__init__.py (1)

3-7: Consider lazy imports to avoid heavyweight import costs

If startup time matters for CLI runs that only use a subset of frameworks, switch to local (on-demand) imports inside the evaluation dispatch or use a lightweight registry to avoid importing ragas/deepeval unless needed.
lsc_eval/src/lsc_eval/llm_managers/__init__.py (1)
3-7: Re-export LLMConfig to avoid deep imports

Consumers often need the config type; re-exporting it keeps imports consistent with this package’s public surface.

Apply:
-from .llm_manager import LLMManager
+from .llm_manager import LLMManager, LLMConfig
@@
-__all__ = ["LLMManager", "RagasLLMManager", "DeepEvalLLMManager"]
+__all__ = ["LLMManager", "LLMConfig", "RagasLLMManager", "DeepEvalLLMManager"]
lsc_eval/pyproject.toml (1)
8-8: Use a valid SPDX license identifier

“Apache-2.0” is the canonical SPDX identifier and improves license detection.
-license = {text = "Apache"}
+license = {text = "Apache-2.0"}
lsc_eval/config/evaluation_data.yaml (1)
46-47: Add missing group description for conv_group_3 (consistency with other groups)

Earlier groups include description. If your model validation expects it, the absence here will fail validation. Add a description for consistency.
 - conversation_group_id: "conv_group_3"
- 
+  description: "conversation group description"
lsc_eval/src/lsc_eval/core/__init__.py (3)

3-5: Public API is coherent; one naming collision to consider.

Re-exporting core types is handy. However, exposing LLMConfig from .models can be confused with the different LLMConfig defined in llm_managers/llm_manager.py. Consider dropping LLMConfig from this module’s public surface or renaming one of the two to avoid accidental imports.

7-17: Black reformat failure — run formatter.

CI shows Black would reformat this file. Please run the project’s formatter to unblock CI.

1-1: Black reformat failure — run formatter.

CI flagged Black on this file too. After edits, run Black to satisfy CI.

lsc_eval/src/lsc_eval/output/visualization.py (2)

1-1: Black reformat failure — run formatter.

CI flagged Black on this file. After applying the above changes, please run Black to normalize formatting.

204-213: Duplicate-code warning is acceptable; consider a helper later.

Pylint reports duplicate lines vs. another module. Given this is a new, decoupled tool, we can defer, but extracting shared plotting helpers under output/utils.py would reduce drift long-term.

lsc_eval/runner.py (1)

1-1: Black reformat failure — run formatter.

After the above edits, run Black to satisfy CI.
lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (4)
51-61: Type hints for _evaluate_metric.

Apply:
-    def _evaluate_metric(self, metric, test_case) -> Tuple[float, str]:
+    def _evaluate_metric(self, metric: Any, test_case: Any) -> Tuple[float, str]:
79-94: Type hints for conversation_completeness.

Apply:
-    def _evaluate_conversation_completeness(
+    def _evaluate_conversation_completeness(
         self,
-        conv_data,
+        conv_data: EvaluationData,
         _turn_idx: Optional[int],
         _turn_data: Optional[TurnData],
         is_conversation: bool,
     ) -> Tuple[Optional[float], str]:
95-114: Type hints for conversation_relevancy.

Apply:
-    def _evaluate_conversation_relevancy(
+    def _evaluate_conversation_relevancy(
         self,
-        conv_data,
+        conv_data: EvaluationData,
         _turn_idx: Optional[int],
         _turn_data: Optional[TurnData],
         is_conversation: bool,
     ) -> Tuple[Optional[float], str]:
115-133: Type hints for knowledge_retention.

Apply:
-    def _evaluate_knowledge_retention(
+    def _evaluate_knowledge_retention(
         self,
-        conv_data,
+        conv_data: EvaluationData,
         _turn_idx: Optional[int],
         _turn_data: Optional[TurnData],
         is_conversation: bool,
     ) -> Tuple[Optional[float], str]:
lsc_eval/src/lsc_eval/core/data_validator.py (1)

35-36: Prefer structured logging over prints

Switch to the project’s logging setup for consistent, filterable output (especially useful in CI). If logging is not wired here yet, keep prints for now and follow up later.
lsc_eval/src/lsc_eval/output/output_handler.py (1)
80-82: Remove stale comment about DataFrame

The CSV writer doesn’t use a DataFrame. Removing this avoids confusion.
-        # Move to dataframe for better aggregation
         csv_file = self.output_dir / f"{base_filename}_detailed.csv"
lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (4)
66-72: Use logger.warning for generic provider path; include remediation hint

Using a generic provider silently can lead to confusing runtime errors when LiteLLM rejects the prefix. Emit a warning via logging and suggest valid providers.
-        print(f"⚠️ Using generic provider format for {provider}")
+        logger.warning("Using generic provider format for '%s'. Verify LiteLLM supports this prefix.", provider)
106-112: Demote noisy Ollama notice to debug and route through logger

Running locally without OLLAMA_HOST is normal; keep logs clean by using debug level.
-        if not os.environ.get("OLLAMA_HOST"):
-            print("ℹ️ OLLAMA_HOST not set, using default localhost:11434")
+        if not os.environ.get("OLLAMA_HOST"):
+            logger.debug("OLLAMA_HOST not set; defaulting to localhost:11434")
12-33: Naming collision with another LLMConfig; consolidate or rename

There’s a second LLMConfig (Pydantic) in core/models.py. Two different LLMConfig types across modules will cause confusion and type mismatches in IDEs and reviews.

Options:

Prefer a single canonical config model (e.g., the Pydantic one) and derive the LiteLLM params from it.

Or rename this dataclass to LLMManagerConfig to prevent ambiguity.

148-156: Double-check LiteLLM parameter names: timeout/request_timeout, num_retries

LiteLLM’s completion kwargs use specific names. Some versions use request_timeout instead of timeout. Validate and adjust to avoid silent no-ops.

If needed:
-            "timeout": self.config.timeout,
+            "request_timeout": self.config.timeout,
Please verify against the LiteLLM version in your pyproject.
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (3)
14-23: Type the contexts field precisely

Use Optional[List[Dict[str, str]]] to align with TurnData.contexts and improve static checks.
-    contexts: Optional[list] = Field(None, description="Context information if available")
+    contexts: Optional[list[dict[str, str]]] = Field(
+        None, description="Context information if available"
+    )
If Python <3.9 support is needed, use typing.List/Dict instead.

77-81: Harden choice/content extraction across providers

Some providers return dicts instead of objects. Add a safe fallback; raise if missing.
-            content = response.choices[0].message.content  # type: ignore
+            choice0 = response.choices[0]  # type: ignore[index]
+            msg = getattr(choice0, "message", getattr(choice0, "delta", None))
+            content = None if msg is None else (getattr(msg, "content", None) or getattr(msg, "text", None) or (isinstance(msg, dict) and msg.get("content")))
And keep the existing empty-response check.

1-41: Run Black; minor nits only

Black check is failing. After the above diffs, please run Black to settle whitespace/line-wrapping.
lsc_eval/src/lsc_eval/core/config_loader.py (3)
179-184: Add return type for init and prefer explicit None assignments

Minor typing/readability polish.
-    def __init__(self):
+    def __init__(self) -> None:
         """Initialize Config Loader."""
         self.system_config = None
         self.evaluation_data = None
         self.logger = None
140-176: Duplicate “system config” models across modules

This file defines SystemConfig while core/models.py defines EvaluationSystemConfig. Two overlapping system config models increase drift risk.

Consider keeping only one canonical system config model (e.g., in core/models.py) and have ConfigLoader populate it. If you keep both, add conversion helpers and tests to ensure parity.

1-1: Run Black to fix formatting

CI shows Black would reformat this file.
lsc_eval/src/lsc_eval/core/models.py (4)

89-124: Validate metric requirements only for turn metrics; consider conversation-level too

Today only turn_metrics are checked for context/expected_response. If any conversation-level metrics require contexts/answers, this will miss them.

Add analogous checks for conversation_metrics if needed by your metric set.

165-183: Name clash: LLMConfig here vs llm_managers.LLMConfig

Two different LLMConfig classes exist across modules. This is confusing for maintainers and tooling.

Prefer a single shared LLMConfig (likely this Pydantic model) and have LLMManager accept it.

If you need a lightweight runtime class for LLMManager, rename it to LLMManagerConfig.

Add tests to ensure config serialization/deserialization remains consistent.

185-212: Duplication with SystemConfig in config_loader

EvaluationSystemConfig overlaps with SystemConfig. Consolidate to one to avoid divergence in future changes.

1-1: Run Black to fix formatting

Black check is failing on this file as well.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ffe1d6a and eac9f72.

📒 Files selected for processing (25)

.gitignore (1 hunks)
Makefile (3 hunks)
lsc_eval/README.md (1 hunks)
lsc_eval/config/evaluation_data.yaml (1 hunks)
lsc_eval/config/system.yaml (1 hunks)
lsc_eval/pyproject.toml (1 hunks)
lsc_eval/runner.py (1 hunks)
lsc_eval/src/lsc_eval/__init__.py (1 hunks)
lsc_eval/src/lsc_eval/core/__init__.py (1 hunks)
lsc_eval/src/lsc_eval/core/config_loader.py (1 hunks)
lsc_eval/src/lsc_eval/core/data_validator.py (1 hunks)
lsc_eval/src/lsc_eval/core/models.py (1 hunks)
lsc_eval/src/lsc_eval/evaluation_engine.py (1 hunks)
lsc_eval/src/lsc_eval/llm_managers/__init__.py (1 hunks)
lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py (1 hunks)
lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (1 hunks)
lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (1 hunks)
lsc_eval/src/lsc_eval/metrics/__init__.py (1 hunks)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (1 hunks)
lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (1 hunks)
lsc_eval/src/lsc_eval/metrics/ragas_metrics.py (1 hunks)
lsc_eval/src/lsc_eval/output/__init__.py (1 hunks)
lsc_eval/src/lsc_eval/output/output_handler.py (1 hunks)
lsc_eval/src/lsc_eval/output/utils.py (1 hunks)
lsc_eval/src/lsc_eval/output/visualization.py (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-07-16T12:07:29.169Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

lsc_eval/runner.py

🧬 Code graph analysis (19)

lsc_eval/src/lsc_eval/output/__init__.py (2)

lsc_eval/src/lsc_eval/output/output_handler.py (1)

OutputHandler (15-224)

lsc_eval/src/lsc_eval/output/visualization.py (1)

GraphGenerator (17-412)

lsc_eval/src/lsc_eval/output/utils.py (1)

lsc_eval/src/lsc_eval/core/models.py (2)

EvaluationResult (126-162)

TurnData (8-44)

lsc_eval/src/lsc_eval/metrics/__init__.py (3)

lsc_eval/src/lsc_eval/metrics/custom_metrics.py (1)

CustomMetrics (25-242)

lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (1)

DeepEvalMetrics (19-138)

lsc_eval/src/lsc_eval/metrics/ragas_metrics.py (1)

RagasMetrics (23-247)

lsc_eval/src/lsc_eval/core/__init__.py (4)

lsc_eval/src/lsc_eval/core/config_loader.py (4)

ConfigLoader (176-256)

SystemConfig (140-173)

setup_environment_variables (27-44)

validate_metrics (123-137)

lsc_eval/src/lsc_eval/core/models.py (5)

validate_metrics (82-87)

EvaluationData (47-123)

EvaluationResult (126-162)

LLMConfig (165-182)

TurnData (8-44)

lsc_eval/src/lsc_eval/core/data_validator.py (1)

DataValidator (11-82)

lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (1)

LLMConfig (13-33)

lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (4)

lsc_eval/src/lsc_eval/core/models.py (1)

LLMConfig (165-182)

lsc_eval/src/lsc_eval/metrics/custom_metrics.py (1)

from_system_config (239-242)

lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (1)

from_system_config (135-138)

lsc_eval/src/lsc_eval/metrics/ragas_metrics.py (1)

from_system_config (244-247)

lsc_eval/src/lsc_eval/output/output_handler.py (3)

lsc_eval/src/lsc_eval/core/models.py (1)

EvaluationResult (126-162)

lsc_eval/src/lsc_eval/output/utils.py (2)

calculate_basic_stats (19-45)

calculate_detailed_stats (48-68)

lsc_eval/src/lsc_eval/output/visualization.py (2)

GraphGenerator (17-412)

generate_all_graphs (53-103)

lsc_eval/src/lsc_eval/metrics/ragas_metrics.py (6)

lsc_eval/src/lsc_eval/metrics/custom_metrics.py (2)

evaluate (42-57)

from_system_config (239-242)

lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (2)

evaluate (62-77)

from_system_config (135-138)

lsc_eval/src/lsc_eval/core/models.py (1)

TurnData (8-44)

lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (5)

LLMManager (36-167)

get_model_name (144-146)

get_litellm_params (148-156)

from_dict (24-33)

from_system_config (163-167)

lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (2)

RagasLLMManager (78-106)

get_llm (97-99)

lsc_eval/src/lsc_eval/output/utils.py (1)

EvaluationScope (11-16)

lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py (1)

lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (2)

get_llm (97-99)

get_model_info (101-106)

lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (3)

lsc_eval/src/lsc_eval/core/models.py (1)

TurnData (8-44)

lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py (2)

DeepEvalLLMManager (8-43)

get_llm (31-33)

lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (4)

LLMManager (36-167)

get_model_name (144-146)

get_litellm_params (148-156)

from_system_config (163-167)

lsc_eval/src/lsc_eval/output/visualization.py (2)

lsc_eval/src/lsc_eval/core/models.py (1)

EvaluationResult (126-162)

lsc_eval/src/lsc_eval/output/utils.py (2)

calculate_basic_stats (19-45)

calculate_detailed_stats (48-68)

lsc_eval/src/lsc_eval/llm_managers/__init__.py (3)

lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py (1)

DeepEvalLLMManager (8-43)

lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (1)

LLMManager (36-167)

lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (1)

RagasLLMManager (78-106)

lsc_eval/src/lsc_eval/core/config_loader.py (1)

lsc_eval/src/lsc_eval/core/models.py (1)

validate_metrics (82-87)

lsc_eval/src/lsc_eval/metrics/custom_metrics.py (3)

lsc_eval/src/lsc_eval/core/models.py (1)

TurnData (8-44)

lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (4)

LLMManager (36-167)

get_model_name (144-146)

get_litellm_params (148-156)

from_system_config (163-167)

lsc_eval/src/lsc_eval/output/utils.py (1)

EvaluationScope (11-16)

lsc_eval/src/lsc_eval/__init__.py (6)

lsc_eval/src/lsc_eval/core/config_loader.py (2)

ConfigLoader (176-256)

SystemConfig (140-173)

lsc_eval/src/lsc_eval/core/data_validator.py (1)

DataValidator (11-82)

lsc_eval/src/lsc_eval/core/models.py (3)

EvaluationData (47-123)

EvaluationResult (126-162)

TurnData (8-44)

lsc_eval/src/lsc_eval/evaluation_engine.py (1)

EvaluationEngine (104-294)

lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (1)

LLMManager (36-167)

lsc_eval/src/lsc_eval/output/output_handler.py (1)

OutputHandler (15-224)

lsc_eval/runner.py (5)

lsc_eval/src/lsc_eval/core/config_loader.py (3)

ConfigLoader (176-256)

setup_environment_variables (27-44)

load_system_config (185-240)

lsc_eval/src/lsc_eval/core/data_validator.py (2)

DataValidator (11-82)

load_evaluation_data (19-37)

lsc_eval/src/lsc_eval/evaluation_engine.py (1)

EvaluationEngine (104-294)

lsc_eval/src/lsc_eval/output/output_handler.py (2)

OutputHandler (15-224)

generate_reports (32-76)

lsc_eval/src/lsc_eval/output/utils.py (1)

calculate_basic_stats (19-45)

lsc_eval/src/lsc_eval/core/data_validator.py (1)

lsc_eval/src/lsc_eval/core/models.py (2)

EvaluationData (47-123)

validate_metric_requirements (89-123)

lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (1)

lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py (2)

get_llm (31-33)

get_model_info (35-43)

lsc_eval/src/lsc_eval/evaluation_engine.py (8)

lsc_eval/src/lsc_eval/core/config_loader.py (2)

ConfigLoader (176-256)

get_llm_config_dict (242-256)

lsc_eval/src/lsc_eval/core/data_validator.py (2)

DataValidator (11-82)

validate_evaluation_data (39-54)

lsc_eval/src/lsc_eval/core/models.py (3)

EvaluationData (47-123)

EvaluationResult (126-162)

TurnData (8-44)

lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (2)

LLMManager (36-167)

from_system_config (163-167)

lsc_eval/src/lsc_eval/metrics/custom_metrics.py (3)

CustomMetrics (25-242)

evaluate (42-57)

from_system_config (239-242)

lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (3)

DeepEvalMetrics (19-138)

evaluate (62-77)

from_system_config (135-138)

lsc_eval/src/lsc_eval/metrics/ragas_metrics.py (3)

RagasMetrics (23-247)

evaluate (78-107)

from_system_config (244-247)

lsc_eval/src/lsc_eval/output/utils.py (1)

EvaluationScope (11-16)

lsc_eval/src/lsc_eval/core/models.py (2)

lsc_eval/src/lsc_eval/core/config_loader.py (1)

validate_metrics (123-137)

lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (1)

LLMConfig (13-33)

🪛 YAMLlint (1.37.1)

lsc_eval/config/evaluation_data.yaml

[error] 10-10: trailing spaces

(trailing-spaces)

[error] 11-11: trailing spaces

(trailing-spaces)

[error] 12-12: trailing spaces

(trailing-spaces)

[error] 16-16: trailing spaces

(trailing-spaces)

[error] 31-31: trailing spaces

(trailing-spaces)

[error] 33-33: trailing spaces

(trailing-spaces)

[error] 37-37: trailing spaces

(trailing-spaces)

[error] 47-47: trailing spaces

(trailing-spaces)

[error] 50-50: trailing spaces

(trailing-spaces)

[error] 52-52: trailing spaces

(trailing-spaces)

[error] 56-56: trailing spaces

(trailing-spaces)

[error] 58-58: trailing spaces

(trailing-spaces)

[error] 66-66: trailing spaces

(trailing-spaces)

lsc_eval/config/system.yaml

[error] 16-16: trailing spaces

(trailing-spaces)

[error] 23-23: trailing spaces

(trailing-spaces)

[error] 26-26: trailing spaces

(trailing-spaces)

[error] 30-30: trailing spaces

(trailing-spaces)

[error] 51-51: trailing spaces

(trailing-spaces)

[error] 57-57: trailing spaces

(trailing-spaces)

[error] 64-64: trailing spaces

(trailing-spaces)

[error] 70-70: trailing spaces

(trailing-spaces)

[error] 76-76: trailing spaces

(trailing-spaces)

[error] 82-82: trailing spaces

(trailing-spaces)

[error] 89-89: trailing spaces

(trailing-spaces)

[error] 98-98: trailing spaces

(trailing-spaces)

[error] 101-101: trailing spaces

(trailing-spaces)

[error] 104-104: trailing spaces

(trailing-spaces)

[error] 120-120: trailing spaces

(trailing-spaces)

[error] 124-124: trailing spaces

(trailing-spaces)

[error] 138-138: trailing spaces

(trailing-spaces)

🪛 LanguageTool

lsc_eval/README.md

[grammar] ~5-~5: There might be a mistake here.
Context: ...e GenAI application. ## 🎯 Key Features - Multi-Framework Support: Seamlessly use...

(QB_NEW_EN)

[grammar] ~15-~15: There might be a mistake here.
Context: ...distribution analysis ## 🚀 Quick Start ### Installation ```bash # From Git pdm add g...

(QB_NEW_EN)

[grammar] ~36-~36: There might be a mistake here.
Context: ...n_data.yaml ``` ## 📊 Supported Metrics ### Turn-Level (Single Query) - Ragas -...

(QB_NEW_EN)

[grammar] ~38-~38: There might be a mistake here.
Context: ...d Metrics ### Turn-Level (Single Query) - Ragas - Response Evaluation - `f...

(QB_NEW_EN)

[grammar] ~39-~39: There might be a mistake here.
Context: ... ### Turn-Level (Single Query) - Ragas - Response Evaluation - faithfulness...

(QB_NEW_EN)

[grammar] ~48-~48: There might be a mistake here.
Context: ...ext_precision_with_reference- **Custom** - Response Evaluation -answer_correc...

(QB_NEW_EN)

[grammar] ~52-~52: There might be a mistake here.
Context: ...ss ### Conversation-Level (Multi-turn) - **DeepEval** -conversation_completenes...

(QB_NEW_EN)

[grammar] ~113-~113: There might be a mistake here.
Context: ...t...." ``` ## 📈 Output & Visualization ### Generated Reports - CSV: Detailed res...

(QB_NEW_EN)

[grammar] ~115-~115: There might be a mistake here.
Context: ...t & Visualization ### Generated Reports - CSV: Detailed results with status, sco...

(QB_NEW_EN)

[grammar] ~116-~116: There might be a mistake here.
Context: ...led results with status, scores, reasons - JSON: Summary statistics with score di...

(QB_NEW_EN)

[grammar] ~117-~117: There might be a mistake here.
Context: ...mary statistics with score distributions - TXT: Human-readable summary - PNG:...

(QB_NEW_EN)

[grammar] ~118-~118: There might be a mistake here.
Context: ...utions - TXT: Human-readable summary - PNG: 4 visualization types (pass rates...

(QB_NEW_EN)

[grammar] ~121-~121: There might be a mistake here.
Context: ...us breakdown) ### Key Metrics in Output - PASS/FAIL/ERROR: Status based on thres...

(QB_NEW_EN)

[grammar] ~122-~122: There might be a mistake here.
Context: ...FAIL/ERROR**: Status based on thresholds - Actual Reasons: DeepEval provides LLM-...

(QB_NEW_EN)

[grammar] ~135-~135: There might be a mistake here.
Context: ...lint . ``` ## 📄 License & Contributing This project is licensed under the Apache...

(QB_NEW_EN)

[grammar] ~139-~139: There might be a mistake here.
Context: ...e for details. Contributions welcome - see development setup above for code qualit...

(QB_NEW_EN)

[grammar] ~139-~139: There might be a mistake here.
Context: ...ment setup above for code quality tools.

(QB_NEW_EN)

🪛 GitHub Actions: Black

lsc_eval/src/lsc_eval/output/utils.py