Skip to content

Conversation

@lpiwowar
Copy link
Contributor

@lpiwowar lpiwowar commented Sep 12, 2025

This update makes sure the evaluation_data.yaml adheres the models
[1][2].

  • Standardize turn_id format to strings
  • Simplify contexts structure from objects to arrays
  • Add missing description field to conversation group 3

Co-Authored-By: Claude [email protected]

[1]

contexts: Optional[list[str]] = Field(

[2]

Summary by CodeRabbit

  • New Features

    • Added several new evaluation metrics at turn and conversation levels to broaden quality assessment.
    • Expanded CSV export with additional columns (result, reason, query, response, execution_time) for richer reporting.
    • Introduced new visualization options: score distribution, conversation heatmap, and status breakdown.
  • Changes

    • Standardized turn IDs to strings across datasets.
    • Simplified context entries from objects to plain strings.
    • Added descriptions to select conversation groups and streamlined group fields.
    • Reorganized per-turn metric thresholds for clearer configuration.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 12, 2025

Walkthrough

Updates two YAML configs: evaluation data schema changes (IDs stringified, context entry structure adjusted, group fields added/removed, and per-turn metric threshold restructured) and system config enhancements (adds turn- and conversation-level metrics, extends CSV output columns, and enables additional visualization graphs).

Changes

Cohort / File(s) Summary
Evaluation data schema updates
config/evaluation_data.yaml
- turn_id values changed from ints to strings across all groups
- contexts entries simplified from objects with content to plain strings
- conv_group_3 gains description field
- conv_group_2 removes conversation_group_id, adds description
- turn_metrics_metadata reorganized; adds ragas:faithfulness threshold 0.99
System metrics and outputs configuration
config/system.yaml
- Adds turn-level metrics: ragas:response_relevancy, ragas:context_recall, ragas:context_relevance, ragas:context_precision_with_reference, ragas:context_precision_without_reference with thresholds
- Adds conversation-level metrics: deepeval:conversation_relevancy, deepeval:knowledge_retention with thresholds
- Expands output.csv_columns: result, reason, query, response, execution_time
- Extends visualization.enabled_graphs: score_distribution, conversation_heatmap, status_breakdown

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Runner as Evaluation Runner
  participant Data as evaluation_data.yaml
  participant System as system.yaml
  participant Metrics as Metric Engines
  participant CSV as CSV Writer
  participant Viz as Visualizer

  Runner->>Data: Load conversation groups, turns, contexts
  Note over Data: turn_id: string<br/>contexts: string list<br/>per-turn thresholds (faithfulness)
  Runner->>System: Load metrics metadata and output schema
  Note over System: +Turn metrics (ragas:*)<br/>+Conversation metrics (deepeval:*)<br/>+CSV columns & graphs

  loop For each conversation/turn
    Runner->>Metrics: Evaluate turn-level metrics
    Metrics-->>Runner: Scores (e.g., response_relevancy, context_precision...)
    alt Conversation-level phase
      Runner->>Metrics: Evaluate conversation-level metrics
      Metrics-->>Runner: Scores (conversation_relevancy, knowledge_retention)
    end
    Runner->>CSV: Write row with new columns (result, reason, query, response, execution_time)
  end

  Runner->>Viz: Generate graphs (score_distribution, heatmap, status_breakdown)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks (3 passed)

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "[nit] Clean up evaluation_data.yaml" concisely names the primary file changed and correctly signals a minor cleanup; the PR description and raw_summary show the main edits are normalizing turn_id to strings, simplifying contexts, and adding description fields, so the title is related and not misleading though somewhat generic.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.

Poem

I thump my paws on YAML plains,
Strings for turns, and simpler chains.
New metrics hop—so spry, so keen—
Heatmaps bloom in verdant green.
CSVs grow longer tails—hurray!
My whiskers twitch: ship day today.
Hippity-hop, configs away!

✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (7)
config/system.yaml (5)

34-37: Clarify default vs per-group override for faithfulness threshold.

System default is 0.8 here, while config/evaluation_data.yaml sets 0.99 for conv_group_1. Please confirm precedence and consider clarifying in-line.

Proposed tweak to the description to reduce ambiguity:

 "ragas:faithfulness":
   threshold: 0.8
-  description: "How faithful the response is to the provided context"
+  description: "How faithful the response is to the provided context (default; can be overridden per conversation group)"

64-66: Missing threshold handling for custom:tool_eval.

If pass/fail is computed, define a threshold; otherwise ensure the pipeline tolerates metrics without thresholds and doesn’t mark them as failures.


91-103: CSV schema expanded; ensure writer emits new columns.

Confirm the CSV writer populates result, reason, query, response, and execution_time to avoid empty columns or KeyErrors.


109-114: New visualization graph types; verify implementation exists.

Check that score_distribution, conversation_heatmap, and status_breakdown are implemented and feature-flagged here, not just documented.


136-144: Logger names may be case-sensitive; align with library logger IDs.

Python logger names are exact matches. Consider lowercase for litellm/deepeval if those are the actual logger names.

Proposed tweak (adjust if your code confirms different names):

   package_overrides:
-    httpx: ERROR
-    urllib3: ERROR
-    requests: ERROR
-    matplotlib: ERROR
-    LiteLLM: WARNING
-    DeepEval: WARNING
-    ragas: WARNING
+    httpx: ERROR
+    urllib3: ERROR
+    requests: ERROR
+    matplotlib: ERROR
+    litellm: WARNING
+    deepeval: WARNING
+    ragas: WARNING
config/evaluation_data.yaml (2)

1-2: Consider adding an explicit schema version to lock format changes.

Helps downstream tooling validate shape changes (string IDs, contexts as strings) and ease future migrations.

Proposed header:

 # LightSpeed Evaluation Framework - Sample/Mock Data
+
+schema_version: 2

17-25: Minor consistency nit: normalize sample text casing.

"User query" vs "User Query" can be standardized to reduce brittle tests that compare raw text.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b629ddc and e92b4c9.

📒 Files selected for processing (2)
  • config/evaluation_data.yaml (1 hunks)
  • config/system.yaml (3 hunks)
🔇 Additional comments (8)
config/system.yaml (3)

38-58: New RAGAS turn-level metrics look consistent; verify evaluator support and inputs.

Identifiers and thresholds align with evaluation_data.yaml. Please verify the scoring code recognizes these metric keys and that required inputs (reference/contexts) are present for each metric.


74-80: Conversation-level DeepEval metrics added; confirm availability.

Looks good. Please confirm deepeval is installed and these identifiers match the implementation names.


117-122: Env flags for DeepEval/LiteLLM logging/telemetry: LGTM.

config/evaluation_data.yaml (5)

11-16: Per-group threshold override for ragas:faithfulness.

Setting 0.99 here is fine if it overrides system defaults. Please confirm the loader applies per-group turn_metrics_metadata over system.yaml thresholds.


47-48: Added description for conv_group_3: LGTM.


55-57: Conversation metrics listed match system.yaml additions.

Looks consistent with deepeval:conversation_completeness and deepeval:conversation_relevancy.


61-74: Additional turns with string IDs: LGTM.

IDs are unique within the group and fields align with the schema.


18-24: Verified — turn_id is a string and contexts-as-strings are supported (no change required).

TurnData.turn_id is annotated as str (src/lightspeed_evaluation/core/models/data.py); contexts are normalized to accept either dicts with "content" or plain strings (src/lightspeed_evaluation/core/metrics/ragas.py and existing validators/tests); CSV writer emits attribute values as-is (src/lightspeed_evaluation/core/output/generator.py).

Copy link
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.. I forgot to change this..
LGTM

@tisnik tisnik merged commit 4db79e5 into lightspeed-core:main Sep 12, 2025
15 checks passed
@lpiwowar
Copy link
Contributor Author

@asamal4 no problem!

@coderabbitai coderabbitai bot mentioned this pull request Sep 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants