Fix(test): Update evaluation data #205

eranco74 · 2025-09-14T17:38:50Z

Replaced "ClustER-NAme" with a valid cluster name to allow local eval tests run
Removed response_eval:sub-string and expected_keywords because response_eval:accuracy already performs an exact match, making the substring check unnecessary.
The cluster name in the cluster_id_from_name conversation was changed from eval-test-uniq-cluster-name to eval-test2-uniq-cluster-name to avoid potential name conflicts with other tests.
Removed list_clusters from the expected tool calls in cluster_id_from_name since the cluster name and ID are part of the current conversation context

Summary by CodeRabbit

Tests
- Standardized the cluster name placeholder to “uniq-cluster-name” across all eval scenarios, updating queries, expected responses, and keyword checks.
- Simplified specific evaluations by removing substring-level checks and dropping a redundant expected-keyword assertion.
- Adjusted evaluation steps to reflect a reduced set of expected tool-call checks.
Chores
- Updated the test runner substitution to use the new placeholder during test execution.

coderabbitai · 2025-09-14T17:38:57Z

Walkthrough

Renamed the cluster placeholder to "uniq-cluster-name" across eval blocks and updated all corresponding queries, tool-call refs, IDs, and expected responses/keywords; removed a substring-level eval and trimmed expected_tool_calls in one block; updated prow entrypoint sed to target the new placeholder.

Changes

Cohort / File(s)	Summary
Eval data renames and expectation updates `test/evals/eval_data.yaml`	Replaced all cluster-name variants with `uniq-cluster-name` across eval IDs, eval_query text, tool call args, get_iso/cluster_info references, expected_response and expected_keywords; removed substring eval type and its expected_keywords in `available_operators_conv`; removed `list_clusters` from `expected_tool_calls` in `cluster_id_from_name`.
Prow entrypoint substitution `test/prow/entrypoint.sh`	Updated `sed` target to replace `uniq-cluster-name` with `${UNIQUE_ID}` in `test/evals/eval_data.yaml`; no other script logic changed.

Sequence Diagram(s)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

creating a unique name for the clusters created by the evaluation tests #198 — Also adjusts the cluster-name placeholder and the prow entrypoint substitution for unique IDs.
Increasing eval test coverage #134 — Related changes to eval_data.yaml around cluster creation/ISO tests and expected keywords.
fix eval tests for substring matching #200 — Overlapping edits to cluster_id_from_name and related eval expectations.

Suggested labels

lgtm

Suggested reviewers

omertuc
carbonin

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "Fix(test): Update evaluation data" is concise and accurately summarizes the primary purpose of the changeset—modifying evaluation test data and expectations. The PR's edits (replacing an invalid cluster name, adjusting expected responses/keywords, and removing unnecessary eval steps) are all encompassed by "Update evaluation data." This makes the title appropriate for reviewers scanning project history.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between faa8c30 and 23221f2.

📒 Files selected for processing (2)

test/evals/eval_data.yaml (6 hunks)
test/prow/entrypoint.sh (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

test/evals/eval_data.yaml
test/prow/entrypoint.sh

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Red Hat Konflux / assisted-chat-saas-main-on-pull-request
GitHub Check: Red Hat Konflux / assisted-chat-test-image-saas-main-on-pull-request

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2025-09-14T17:38:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eranco74

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [eranco74]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

test/evals/eval_data.yaml (1)

91-102: Same brittleness warning for multi-node workflow accuracy checks.

Ensure the agent’s exact phrasing matches expected_response and the keywords after UNIQUE_ID substitution; otherwise accuracy evals will flap.

Use the same script above to verify these blocks too.

Also applies to: 112-120

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between de2e8a0 and faa8c30.

📒 Files selected for processing (2)

test/evals/eval_data.yaml (5 hunks)
test/prow/entrypoint.sh (1 hunks)

🔇 Additional comments (2)

test/evals/eval_data.yaml (2)

33-37: Good change: switched to exact-match eval and removed keyword check.

Moving to response_eval:accuracy with tool_eval and dropping expected_keywords aligns with the stated objective and removes redundant substring checks.

64-76: Accuracy eval is brittle — confirm exact agent wording or relax to substring

expected_response is an exact-match; punctuation or casing differences will fail. Confirm the agent returns this identical sentence after UNIQUE_ID templating, or change eval_types to response_eval:sub-string (recommended).

Location: test/evals/eval_data.yaml lines 74–76 (SNO creation expected_response/expected_keywords). Also review other response_eval:accuracy entries in this file (e.g., lines 50, 57, 148, 170, 178).

test/evals/eval_data.yaml

coderabbitai · 2025-09-14T17:44:45Z

test/prow/entrypoint.sh

 echo "GEMINI_API_KEY=${GEMINI_API_KEY}" > .env

-sed -i "s/ClustER-NAme/${UNIQUE_ID}/g" $TEST_DIR/eval_data.yaml
+sed -i "s/uniq-cluster-name/${UNIQUE_ID}/g" $TEST_DIR/eval_data.yaml


💡 Verification agent

🧩 Analysis chain

Make sed replacement robust to special chars in UNIQUE_ID and quote paths.

If UNIQUE_ID contains “&” or the path has spaces, the current sed may misbehave. Quote variables and escape “&” in the replacement.

Apply:

- sed -i "s/uniq-cluster-name/${UNIQUE_ID}/g" $TEST_DIR/eval_data.yaml + UNIQUE_ESCAPED=${UNIQUE_ID//&/\\&} + sed -i "s|uniq-cluster-name|${UNIQUE_ESCAPED}|g" "$TEST_DIR/eval_data.yaml"

Alternatively, use perl for fully safe replacement:

perl -0777 -pe 's/\Quniq-cluster-name\E/$ENV{UNIQUE_ID}/g' "$TEST_DIR/eval_data.yaml" > "$TEMP_DIR/eval_data.patched" && mv "$TEMP_DIR/eval_data.patched" "$TEST_DIR/eval_data.yaml"

🏁 Script executed:

#!/bin/bash # Dry-run preview of substitutions to ensure only the intended tokens change rg -n "uniq-cluster-name" test/evals/eval_data.yaml

Length of output: 2417

Fix sed replacement to handle '&' in UNIQUE_ID and quote $TEST_DIR

File: test/prow/entrypoint.sh:23 — rg shows multiple occurrences of "uniq-cluster-name" in test/evals/eval_data.yaml; the current sed will mis-handle '&' in UNIQUE_ID and can break if $TEST_DIR contains spaces.

Apply:

- sed -i "s/uniq-cluster-name/${UNIQUE_ID}/g" $TEST_DIR/eval_data.yaml + UNIQUE_ESCAPED=${UNIQUE_ID//&/\\&} + sed -i "s|uniq-cluster-name|${UNIQUE_ESCAPED}|g" "$TEST_DIR/eval_data.yaml"

Alternatively, for a fully robust replacement (handles arbitrary chars), use perl:

perl -0777 -pe 's/\Quniq-cluster-name\E/$ENV{UNIQUE_ID}/g' "$TEST_DIR/eval_data.yaml" > "$TEMP_DIR/eval_data.patched" && mv "$TEMP_DIR/eval_data.patched" "$TEST_DIR/eval_data.yaml"

🤖 Prompt for AI Agents

In test/prow/entrypoint.sh around line 23, the sed invocation isn't robust: $TEST_DIR should be quoted to handle spaces, and sed's replacement will mis-handle '&' characters in $UNIQUE_ID. Fix by quoting "$TEST_DIR/eval_data.yaml" and either (a) escape any '&' (and sed delimiter chars) in $UNIQUE_ID before passing it to sed, or (b) replace the sed step with the suggested perl approach that uses \Q...\E (or $ENV{UNIQUE_ID}) to perform a safe, global replacement that tolerates arbitrary characters; ensure the temp-file->mv pattern is used to atomically update the file.

- Replaced "ClustER-NAme" with a valid cluster name to allow local eval tests run - Removed `response_eval:sub-string` and `expected_keywords` because `response_eval:accuracy` already performs an exact match, making the substring check unnecessary. - The cluster name in the `cluster_id_from_name` conversation was changed from `eval-test-uniq-cluster-name` to `eval-test2-uniq-cluster-name` to avoid potential name conflicts with other tests. - Removed list_clusters from the expected tool calls in cluster_id_from_name since the cluster name and ID are part of the current conversation context Signed-off-by: Eran Cohen <[email protected]>

maorfr · 2025-09-15T07:07:26Z

/lgtm

eranco74 · 2025-09-15T07:12:04Z

/retest required

openshift-ci · 2025-09-15T07:12:08Z

@eranco74: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test images

The following commands are available to trigger optional jobs:

/test eval-test

Use /test all to run all jobs.

In response to this:

/retest required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

eranco74 · 2025-09-15T07:18:07Z

/retest

openshift-ci bot requested review from carbonin and omertuc September 14, 2025 17:38

openshift-ci bot added the approved label Sep 14, 2025

coderabbitai bot reviewed Sep 14, 2025

View reviewed changes

eranco74 force-pushed the fix_eval branch from faa8c30 to 23221f2 Compare September 14, 2025 18:21

openshift-ci bot assigned maorfr Sep 15, 2025

openshift-ci bot added the lgtm label Sep 15, 2025

openshift-merge-bot bot merged commit ced2c4a into rh-ecosystem-edge:main Sep 15, 2025
7 checks passed

This was referenced Sep 15, 2025

eval tests copy eval_data to temp dir to avoid permission issues #207

Merged

an experiment for more stable eval tests #210

Closed

coderabbitai bot mentioned this pull request Sep 22, 2025

making the evaluation tests more stable #213

Merged

coderabbitai bot mentioned this pull request Oct 1, 2025

changing the prompt so the host_booted_but_not_discovered test would be more consistent #223

Merged

coderabbitai bot mentioned this pull request Oct 28, 2025

Make eval tests more reliable #239

Merged

coderabbitai bot mentioned this pull request Nov 11, 2025

Resolves asking for cluster ID when it's known. #249

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix(test): Update evaluation data #205

Fix(test): Update evaluation data #205

Uh oh!

eranco74 commented Sep 14, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 14, 2025 •

edited

Loading

Uh oh!

openshift-ci bot commented Sep 14, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Sep 14, 2025

Uh oh!

maorfr commented Sep 15, 2025

Uh oh!

eranco74 commented Sep 15, 2025

Uh oh!

openshift-ci bot commented Sep 15, 2025

Uh oh!

eranco74 commented Sep 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix(test): Update evaluation data #205

Fix(test): Update evaluation data #205

Uh oh!

Conversation

eranco74 commented Sep 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

openshift-ci bot commented Sep 14, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Sep 14, 2025

Choose a reason for hiding this comment

Uh oh!

maorfr commented Sep 15, 2025

Uh oh!

eranco74 commented Sep 15, 2025

Uh oh!

openshift-ci bot commented Sep 15, 2025

Uh oh!

eranco74 commented Sep 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eranco74 commented Sep 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 14, 2025 •

edited

Loading