[nit] Clean up evaluation_data.yaml #52

lpiwowar · 2025-09-12T10:12:43Z

This update makes sure the evaluation_data.yaml adheres the models
[1][2].

Standardize turn_id format to strings
Simplify contexts structure from objects to arrays
Add missing description field to conversation group 3

Co-Authored-By: Claude [email protected]

[1]

lightspeed-evaluation/src/lightspeed_evaluation/core/models/data.py

Line 31 in b629ddc

contexts: Optional[list[str]] = Field(

[2]

lightspeed-evaluation/src/lightspeed_evaluation/core/models/data.py

Line 146 in b629ddc

turn_id: Optional[str] = Field(

Summary by CodeRabbit

New Features
- Added several new evaluation metrics at turn and conversation levels to broaden quality assessment.
- Expanded CSV export with additional columns (result, reason, query, response, execution_time) for richer reporting.
- Introduced new visualization options: score distribution, conversation heatmap, and status breakdown.
Changes
- Standardized turn IDs to strings across datasets.
- Simplified context entries from objects to plain strings.
- Added descriptions to select conversation groups and streamlined group fields.
- Reorganized per-turn metric thresholds for clearer configuration.

This update makes sure the evaluation_data.yaml adheres the models [1][2]. - Standardize turn_id format to strings - Simplify contexts structure from objects to arrays - Add missing description field to conversation group 3 Co-Authored-By: Claude <[email protected]> [1] https://github.com/lightspeed-core/lightspeed-evaluation/blob/b629ddc56912cfa6ce42eee06569c9bc7b27cfd9/src/lightspeed_evaluation/core/models/data.py#L31 [2] https://github.com/lightspeed-core/lightspeed-evaluation/blob/b629ddc56912cfa6ce42eee06569c9bc7b27cfd9/src/lightspeed_evaluation/core/models/data.py#L146

coderabbitai · 2025-09-12T10:12:51Z

Walkthrough

Updates two YAML configs: evaluation data schema changes (IDs stringified, context entry structure adjusted, group fields added/removed, and per-turn metric threshold restructured) and system config enhancements (adds turn- and conversation-level metrics, extends CSV output columns, and enables additional visualization graphs).

Changes

Cohort / File(s)	Summary
Evaluation data schema updates `config/evaluation_data.yaml`	- turn_id values changed from ints to strings across all groups - contexts entries simplified from objects with content to plain strings - conv_group_3 gains description field - conv_group_2 removes conversation_group_id, adds description - turn_metrics_metadata reorganized; adds ragas:faithfulness threshold 0.99
System metrics and outputs configuration `config/system.yaml`	- Adds turn-level metrics: ragas:response_relevancy, ragas:context_recall, ragas:context_relevance, ragas:context_precision_with_reference, ragas:context_precision_without_reference with thresholds - Adds conversation-level metrics: deepeval:conversation_relevancy, deepeval:knowledge_retention with thresholds - Expands output.csv_columns: result, reason, query, response, execution_time - Extends visualization.enabled_graphs: score_distribution, conversation_heatmap, status_breakdown

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Runner as Evaluation Runner
  participant Data as evaluation_data.yaml
  participant System as system.yaml
  participant Metrics as Metric Engines
  participant CSV as CSV Writer
  participant Viz as Visualizer

  Runner->>Data: Load conversation groups, turns, contexts
  Note over Data: turn_id: string<br/>contexts: string list<br/>per-turn thresholds (faithfulness)
  Runner->>System: Load metrics metadata and output schema
  Note over System: +Turn metrics (ragas:*)<br/>+Conversation metrics (deepeval:*)<br/>+CSV columns & graphs

  loop For each conversation/turn
    Runner->>Metrics: Evaluate turn-level metrics
    Metrics-->>Runner: Scores (e.g., response_relevancy, context_precision...)
    alt Conversation-level phase
      Runner->>Metrics: Evaluate conversation-level metrics
      Metrics-->>Runner: Scores (conversation_relevancy, knowledge_retention)
    end
    Runner->>CSV: Write row with new columns (result, reason, query, response, execution_time)
  end

  Runner->>Viz: Generate graphs (score_distribution, heatmap, status_breakdown)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks (3 passed)

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "[nit] Clean up evaluation_data.yaml" concisely names the primary file changed and correctly signals a minor cleanup; the PR description and raw_summary show the main edits are normalizing turn_id to strings, simplifying contexts, and adding description fields, so the title is related and not misleading though somewhat generic.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

Poem

I thump my paws on YAML plains,
Strings for turns, and simpler chains.
New metrics hop—so spry, so keen—
Heatmaps bloom in verdant green.
CSVs grow longer tails—hurray!
My whiskers twitch: ship day today.
Hippity-hop, configs away!

✨ Finishing touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tisnik

LGTM

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (7)

config/system.yaml (5)
34-37: Clarify default vs per-group override for faithfulness threshold.

System default is 0.8 here, while config/evaluation_data.yaml sets 0.99 for conv_group_1. Please confirm precedence and consider clarifying in-line.

Proposed tweak to the description to reduce ambiguity:
 "ragas:faithfulness":
   threshold: 0.8
-  description: "How faithful the response is to the provided context"
+  description: "How faithful the response is to the provided context (default; can be overridden per conversation group)"
64-66: Missing threshold handling for custom:tool_eval.

If pass/fail is computed, define a threshold; otherwise ensure the pipeline tolerates metrics without thresholds and doesn’t mark them as failures.

91-103: CSV schema expanded; ensure writer emits new columns.

Confirm the CSV writer populates result, reason, query, response, and execution_time to avoid empty columns or KeyErrors.

109-114: New visualization graph types; verify implementation exists.

Check that score_distribution, conversation_heatmap, and status_breakdown are implemented and feature-flagged here, not just documented.

136-144: Logger names may be case-sensitive; align with library logger IDs.

Python logger names are exact matches. Consider lowercase for litellm/deepeval if those are the actual logger names.

Proposed tweak (adjust if your code confirms different names):
   package_overrides:
-    httpx: ERROR
-    urllib3: ERROR
-    requests: ERROR
-    matplotlib: ERROR
-    LiteLLM: WARNING
-    DeepEval: WARNING
-    ragas: WARNING
+    httpx: ERROR
+    urllib3: ERROR
+    requests: ERROR
+    matplotlib: ERROR
+    litellm: WARNING
+    deepeval: WARNING
+    ragas: WARNING
config/evaluation_data.yaml (2)
1-2: Consider adding an explicit schema version to lock format changes.

Helps downstream tooling validate shape changes (string IDs, contexts as strings) and ease future migrations.

Proposed header:
 # LightSpeed Evaluation Framework - Sample/Mock Data
+
+schema_version: 2
17-25: Minor consistency nit: normalize sample text casing.

"User query" vs "User Query" can be standardized to reduce brittle tests that compare raw text.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b629ddc and e92b4c9.

📒 Files selected for processing (2)

config/evaluation_data.yaml (1 hunks)
config/system.yaml (3 hunks)

🔇 Additional comments (8)

config/system.yaml (3)

38-58: New RAGAS turn-level metrics look consistent; verify evaluator support and inputs.

Identifiers and thresholds align with evaluation_data.yaml. Please verify the scoring code recognizes these metric keys and that required inputs (reference/contexts) are present for each metric.

74-80: Conversation-level DeepEval metrics added; confirm availability.

Looks good. Please confirm deepeval is installed and these identifiers match the implementation names.

117-122: Env flags for DeepEval/LiteLLM logging/telemetry: LGTM.

config/evaluation_data.yaml (5)

11-16: Per-group threshold override for ragas:faithfulness.

Setting 0.99 here is fine if it overrides system defaults. Please confirm the loader applies per-group turn_metrics_metadata over system.yaml thresholds.

47-48: Added description for conv_group_3: LGTM.

55-57: Conversation metrics listed match system.yaml additions.

Looks consistent with deepeval:conversation_completeness and deepeval:conversation_relevancy.

61-74: Additional turns with string IDs: LGTM.

IDs are unique within the group and fields align with the schema.

18-24: Verified — turn_id is a string and contexts-as-strings are supported (no change required).

TurnData.turn_id is annotated as str (src/lightspeed_evaluation/core/models/data.py); contexts are normalized to accept either dicts with "content" or plain strings (src/lightspeed_evaluation/core/metrics/ragas.py and existing validators/tests); CSV writer emits attribute values as-is (src/lightspeed_evaluation/core/output/generator.py).

asamal4

Thanks.. I forgot to change this..
LGTM

lpiwowar · 2025-09-15T07:21:48Z

@asamal4 no problem!

lpiwowar added 2 commits September 12, 2025 12:05

[nit] Remove trailing white characters

e92b4c9

tisnik approved these changes Sep 12, 2025

View reviewed changes

coderabbitai bot reviewed Sep 12, 2025

View reviewed changes

asamal4 approved these changes Sep 12, 2025

View reviewed changes

tisnik merged commit 4db79e5 into lightspeed-core:main Sep 12, 2025
15 checks passed

coderabbitai bot mentioned this pull request Sep 18, 2025

Turn metric override #55

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[nit] Clean up evaluation_data.yaml #52

[nit] Clean up evaluation_data.yaml #52

Uh oh!

lpiwowar commented Sep 12, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 12, 2025 •

edited

Loading

Uh oh!

tisnik left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

asamal4 left a comment

Uh oh!

Uh oh!

lpiwowar commented Sep 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[nit] Clean up evaluation_data.yaml #52

[nit] Clean up evaluation_data.yaml #52

Uh oh!

Conversation

lpiwowar commented Sep 12, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks (3 passed)

Poem

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lpiwowar commented Sep 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lpiwowar commented Sep 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 12, 2025 •

edited

Loading