fix: run turn evaluation immediately after api call #105

asamal4 · 2025-11-19T06:10:34Z

Currently we do API call for all queries in the conversation, then start the evaluation. This PR will modify this logic to do turn evaluation immediately after API call.

Avoid un-necessary evaluation for follow up turns/queries when API has failed.
Script eval needs to be run immediately after the Agent action for the turn.

Code can be further modularized, but will do later as it will involve un-related logic change.

Summary by CodeRabbit

Refactor
- Evaluation now processes and amends model responses per turn (threads conversation context across turns) for finer-grained handling and clearer per-turn outcomes.
Bug Fixes
- On API errors, remaining turns and conversation-level metrics are automatically marked as failed to avoid misleading results.
Tests
- Added unit tests covering turn-level error marking and cascade-failure behavior.

Assisted-by: Cursor

coderabbitai · 2025-11-19T06:10:42Z

Walkthrough

Refactors the amendment flow from conversation-level to per-turn processing. Replaces amend_conversation_data with amend_single_turn(turn_data, conversation_id=None) that returns (error_message, conversation_id). Adds per-turn cascade error marking methods and updates the conversation processor to call the amender per turn and handle cascade failures.

Changes

Cohort / File(s)	Summary
Amender (per-turn API) `src/lightspeed_evaluation/pipeline/evaluation/amender.py`	Replaces `amend_conversation_data(conv_data)` with `amend_single_turn(turn_data, conversation_id=None)`. Accepts a single `TurnData`, performs the API call, updates `turn_data` in-place (response, conversation_id, contexts, tool_calls/attachments), and returns `(error_message, conversation_id)`. Logs per-turn actions and returns API errors as `(error_message, conversation_id)`.
Error handling additions `src/lightspeed_evaluation/pipeline/evaluation/errors.py`	Adds `mark_turn_metrics_as_error(conv_data, turn_idx, turn_data, turn_metrics, error_reason)` and `mark_cascade_failure(conv_data, failed_turn_idx, resolved_turn_metrics, resolved_conversation_metrics, error_reason)` on `EvaluationErrorHandler`. These create ERROR `EvaluationResult` entries for specific turns and for remaining turns + conversation-level metrics, and log warnings. Also exposes `TurnData` in imports.
Processor: per-turn flow & cascade semantics `src/lightspeed_evaluation/pipeline/evaluation/processor.py`	Changes `process_conversation()` to call `amend_single_turn()` per turn, thread `conversation_id` across calls, run per-turn metric evaluation after amendment, and on per-turn API error call error-handler methods to mark current and remaining metrics as ERROR and abort further processing. Reorders finalization/cleanup steps accordingly.
Tests — amender `tests/unit/pipeline/evaluation/test_amender.py`	Updates tests to call `amend_single_turn()` with `TurnData` and optional `conversation_id`; asserts tuple return `(error_msg, conversation_id)`. Adds/adjusts cases for missing API client, existing conversation_id, tool_calls, attachments, empty contexts, and API error formatting.
Tests — errors `tests/unit/pipeline/evaluation/test_errors.py`	Adds tests `test_mark_turn_metrics_as_error()` and `test_mark_cascade_failure()` to validate per-turn ERROR results and cascade marking of remaining turns and conversation-level metrics.

Sequence Diagram(s)

sequenceDiagram
    participant Proc as Processor
    participant Am as Amender
    participant API as API Client
    participant EH as ErrorHandler

    Proc->>Proc: For each turn in conversation
    Proc->>Am: amend_single_turn(turn_data, conversation_id)
    Am->>API: Call API with turn payload

    alt API Success
        API-->>Am: APIResponse
        Am->>Am: Update turn_data (response, conversation_id, contexts, tool_calls)
        Am-->>Proc: (None, new_conversation_id)
        Proc->>Proc: Evaluate turn metrics (Step 2b)
        Proc->>Proc: continue next turn (conversation_id threaded)
    else API Error
        API-->>Am: APIError
        Am-->>Proc: (error_message, current_conversation_id)
        rect rgb(255,230,230)
        Note over Proc,EH: Cascade failure handling
        Proc->>EH: mark_turn_metrics_as_error(conv, turn_idx, turn_data, turn_metrics, reason)
        Proc->>EH: mark_cascade_failure(conv, failed_turn_idx, resolved_turn_metrics, resolved_conversation_metrics, reason)
        end
        Proc->>Proc: Abort further turn processing
    end

    Proc->>Proc: After loop: evaluate conversation-level metrics (if not cascaded)
    Proc->>Proc: Cleanup & return results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring extra attention:

src/lightspeed_evaluation/pipeline/evaluation/processor.py: correctness of conversation_id threading, early-exit/cascade semantics, and that cleanup still runs in failure paths.
src/lightspeed_evaluation/pipeline/evaluation/errors.py: ensure EvaluationResult fields for ERROR entries match expected schema and that counts/summaries are accurate.
src/lightspeed_evaluation/pipeline/evaluation/amender.py and related tests: verify in-place TurnData updates and exact tuple return semantics are consistent across success/error branches.

Possibly related PRs

Ability to set alternate tool calls for eval #90 — Also adjusts TurnData usage and tool-eval logic; likely touches same data-model fields referenced by the new per-turn flow.
Turn metric override #55 — Introduces per-turn metric support and related processor/error-handler changes; overlaps with this PR's per-turn and cascade semantics.

Suggested reviewers

VladimirKadlec
tisnik

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately captures the main change: shifting from batch API processing to immediate per-turn evaluation after each API call.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)

46-177: Per-turn processing and error cascade look solid; fix the line-too-long lint

The refactor of process_conversation to:

run setup once,
then, for each turn, do “API amend → evaluate turn metrics,”
and only afterwards evaluate conversation-level metrics,
matches the stated goal of evaluating turns immediately after their API call. The error path that:
marks the current turn’s metrics as ERROR, then
marks remaining turns and all conversation metrics as ERROR, and
exits early while still running cleanup in finally,
is consistent with the new EvaluationErrorHandler helpers and avoids wasted work on failed conversations.

The only concrete issue is the pylint line-too-long at Line 141 (remaining_errors = error_handler.mark_remaining_turns_and_conversation_as_error(...)). You can resolve this by aliasing the method before calling it, e.g.:

-                        cascade_error_reason = (
-                            f"Cascade failure from turn {turn_idx + 1} API error: "
-                            f"{api_error_message}"
-                        )
-                        error_handler = self.components.error_handler
-                        remaining_errors = error_handler.mark_remaining_turns_and_conversation_as_error(
-                            conv_data,
-                            turn_idx,
-                            resolved_turn_metrics,
-                            resolved_conversation_metrics,
-                            cascade_error_reason,
-                        )
+                        cascade_error_reason = (
+                            f"Cascade failure from turn {turn_idx + 1} API error: "
+                            f"{api_error_message}"
+                        )
+                        error_handler = self.components.error_handler
+                        mark_remaining = (
+                            error_handler.mark_remaining_turns_and_conversation_as_error
+                        )
+                        remaining_errors = mark_remaining(
+                            conv_data,
+                            turn_idx,
+                            resolved_turn_metrics,
+                            resolved_conversation_metrics,
+                            cascade_error_reason,
+                        )

This keeps the behaviour unchanged while satisfying the linter.

🧹 Nitpick comments (1)

tests/unit/pipeline/evaluation/test_errors.py (1)

182-294: New per-turn error handler tests look correct and aligned with implementation

These tests exercise both mark_turn_metrics_as_error and mark_remaining_turns_and_conversation_as_error in realistic scenarios and validate:

Field population on EvaluationResult (IDs, status, reason, query/response, execution_time).

Correct ordering and counts of turn-level vs conversation-level errors.

Aggregation behaviour in get_error_summary.

This gives good coverage for the new error-handling paths in the processor. You might optionally add an assertion on query/response for the remaining-turns case, but it’s not strictly necessary.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4518a87 and a1c716c.

📒 Files selected for processing (5)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (2 hunks)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (2 hunks)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (4 hunks)
tests/unit/pipeline/evaluation/test_amender.py (3 hunks)
tests/unit/pipeline/evaluation/test_errors.py (1 hunks)

🧰 Additional context used

🧠 Learnings (5)

📓 Common learnings

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.

📚 Learning: 2025-09-19T12:32:06.403Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

tests/unit/pipeline/evaluation/test_errors.py
src/lightspeed_evaluation/pipeline/evaluation/errors.py
src/lightspeed_evaluation/pipeline/evaluation/processor.py

📚 Learning: 2025-09-09T14:58:10.630Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/amender.py

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/errors.py

📚 Learning: 2025-09-19T00:37:23.798Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/errors.py
src/lightspeed_evaluation/pipeline/evaluation/processor.py

🧬 Code graph analysis (5)

tests/unit/pipeline/evaluation/test_errors.py (2)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (4)

EvaluationErrorHandler (10-201)

mark_turn_metrics_as_error (79-125)

get_error_summary (195-201)

mark_remaining_turns_and_conversation_as_error (127-193)

src/lightspeed_evaluation/core/models/data.py (2)

TurnData (35-261)

EvaluationData (264-311)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (3)

src/lightspeed_evaluation/core/models/data.py (2)

EvaluationData (264-311)

TurnData (35-261)

src/lightspeed_evaluation/core/api/client.py (1)

query (71-105)

src/lightspeed_evaluation/core/system/exceptions.py (1)

APIError (8-9)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

src/lightspeed_evaluation/core/models/data.py (3)

EvaluationData (264-311)

EvaluationResult (314-353)

TurnData (35-261)

src/lightspeed_evaluation/pipeline/evaluation/processor.py (3)

src/lightspeed_evaluation/core/models/data.py (2)

EvaluationData (264-311)

EvaluationResult (314-353)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

amend_single_turn (20-69)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (2)

mark_turn_metrics_as_error (79-125)

mark_remaining_turns_and_conversation_as_error (127-193)

tests/unit/pipeline/evaluation/test_amender.py (5)

src/lightspeed_evaluation/core/models/api.py (1)

APIResponse (80-116)

src/lightspeed_evaluation/core/models/data.py (1)

TurnData (35-261)

src/lightspeed_evaluation/core/system/exceptions.py (1)

APIError (8-9)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (2)

APIDataAmender (13-80)

amend_single_turn (20-69)

src/lightspeed_evaluation/core/api/client.py (1)

query (71-105)

🪛 GitHub Actions: Python linter

src/lightspeed_evaluation/pipeline/evaluation/processor.py

[error] 141-141: pylint: C0301 Line too long (104/100) (line-too-long) in processor.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: mypy
GitHub Check: tests (3.13)
GitHub Check: tests (3.11)
GitHub Check: tests (3.12)

🔇 Additional comments (3)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

7-69: Per-turn amendment logic is consistent and matches the new contract

amend_single_turn cleanly encapsulates the per-turn API call and in-place mutation of TurnData, returning a simple (error_message, updated_conversation_id) tuple. It:

Always performs a fresh API call when a client is present, independent of existing response/tool_calls, in line with the framework’s expectations.

Correctly threads conversation_id through calls and preserves it on API failure.

Avoids setting empty contexts/tool_calls, which keeps TurnData compatible with its validators and the tests.

No functional issues here from my side.

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

5-193: New error-handling helpers are well-factored and consistent with existing patterns

The additions of:

mark_turn_metrics_as_error(...) for a single turn, and

mark_remaining_turns_and_conversation_as_error(...) for cascading failures

fit cleanly with mark_all_metrics_as_error and get_error_summary:

They operate on pre-resolved metrics passed from ConversationProcessor, respecting the existing division of responsibilities.

They produce EvaluationResult objects with the same ERROR-shape (empty response, 0.0 execution_time, reason populated) used elsewhere.

self.results is updated so summaries aggregate errors across multiple calls and conversations, as covered by the tests.

No changes needed here.

tests/unit/pipeline/evaluation/test_amender.py (1)

3-214: Tests comprehensively cover the new amend_single_turn behaviour

The updated tests validate all key aspects of the per-turn amender:

Correct tuple return semantics for success, no-client, and API error cases.

Proper propagation and use of conversation_id, including follow-up turns.

In-place mutation of TurnData (response, conversation_id, contexts, tool_calls), including the intended behaviour when contexts/tool_calls are empty (fields remain None).

Attachment handling via attachments being forwarded into the API client.

This suite gives strong confidence in the new per-turn API amendment flow.

asamal4 · 2025-11-21T09:17:14Z

@VladimirKadlec @tisnik PTAL

VladimirKadlec

LGTM.

src/lightspeed_evaluation/pipeline/evaluation/processor.py

tisnik

LGTM

fix: run turn evaluation immediately after api call

perform turn eval immediately after api call

6785231

Assisted-by: Cursor

coderabbitai bot reviewed Nov 19, 2025

View reviewed changes

asamal4 changed the title ~~run turn evaluation immediately after api call~~ fix: run turn evaluation immediately after api call Nov 19, 2025

mark remaining metrics as error

a52690e

asamal4 force-pushed the fix-script-verify branch from a1c716c to a52690e Compare November 19, 2025 06:23

VladimirKadlec approved these changes Nov 21, 2025

View reviewed changes

src/lightspeed_evaluation/pipeline/evaluation/processor.py Show resolved Hide resolved

tisnik approved these changes Nov 21, 2025

View reviewed changes

tisnik merged commit 686ae36 into lightspeed-core:main Nov 21, 2025
15 checks passed

coderabbitai bot mentioned this pull request Nov 24, 2025

add support for fail_on_invalid_data option #94

Merged

bsatapat-jpg pushed a commit to bsatapat-jpg/lightspeed-evaluation that referenced this pull request Nov 24, 2025

Merge pull request lightspeed-core#105 from asamal4/fix-script-verify

243d753

fix: run turn evaluation immediately after api call

coderabbitai bot mentioned this pull request Nov 24, 2025

LEADS-8: Lazy imports for eval tool #106

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: run turn evaluation immediately after api call #105

fix: run turn evaluation immediately after api call #105

Uh oh!

asamal4 commented Nov 19, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Nov 19, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

asamal4 commented Nov 21, 2025

Uh oh!

VladimirKadlec left a comment

Uh oh!

Uh oh!

tisnik left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: run turn evaluation immediately after api call #105

fix: run turn evaluation immediately after api call #105

Uh oh!

Conversation

asamal4 commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 commented Nov 21, 2025

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

asamal4 commented Nov 19, 2025 •

edited

Loading

coderabbitai bot commented Nov 19, 2025 •

edited

Loading