an experiment for more stable eval tests by andrej1991 · Pull Request #210 · rh-ecosystem-edge/assisted-chat

andrej1991 · 2025-09-17T13:36:18Z

Summary by CodeRabbit

Tests
- Switched from keyword-based checks to exact-response (accuracy) evaluations for stricter assertions.
- Tightened failure gate so any evaluation failure now fails the suite.
- Removed flow that relied on cluster lookups by name; tests now use existing cluster listing/info flows.
- Specified CPU architecture as x86_64 and set SSH key to an explicit empty value in relevant evaluations.
- Standardized the static-networking message to point users to the assisted-installer web-based wizard.
- Normalized punctuation in evaluation queries for consistency.

openshift-ci · 2025-09-17T13:36:24Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: andrej1991
Once this PR has been reviewed and has the lgtm label, please assign omertuc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2025-09-17T13:36:26Z

Walkthrough

Switched several evals in test/evals/eval_data.yaml from keyword/ or tool-call-based checks to exact/accuracy response_eval expectations, adjusted create_cluster tool-call arguments and eval queries (including trailing periods and cluster references), removed the cluster_id_from_name conversation group, and made the test runner fail on any non-zero failed evals in test/evals/eval.py.

Changes

Cohort / File(s)	Summary
Eval test data updates `test/evals/eval_data.yaml`	Replaced `expected_keywords`/keyword expectations with exact `expected_response` and `response_eval:accuracy`; updated `eval_query` text (trailing periods, cluster references); changed `create_cluster` calls to include `cpu_architecture: "x86_64"` and `ssh_public_key: ""`; removed `cluster_id_from_name` conversation group and related tooling expectations; removed some `expected_tool_calls` and adjusted host-related evals.
Test runner threshold change `test/evals/eval.py`	Tightened failure gate: treat any non-zero `failed_evals_count` as overall test failure (previously allowed up to 2 failures).

Sequence Diagram(s)

sequenceDiagram
  participant T as Test Runner
  participant D as Eval Data
  participant R as Eval Executor
  participant Tools as Tooling (create_cluster, list_clusters, cluster_info)
  participant O as Exit

  rect #f3f9ff
    T->>D: Load updated eval definitions (`response_eval:accuracy`, updated queries)
    T->>R: Execute eval (send eval_query)
    R->>Tools: Optional tool calls (e.g., `list_clusters` -> `cluster_info`, `create_cluster` w/ cpu_architecture/ssh_public_key)
    Tools-->>R: Tool responses
    R-->>T: Eval result (pass / fail)
  end

  rect #fff7f0
    alt All evals passed
      T->>O: exit 0
    else Any eval failed
      T->>O: exit 1
    end
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Fix(test): Update evaluation data #205 — Similar edits to test/evals/eval_data.yaml converting keyword checks to accuracy-based responses and adjusting cluster/tool expectations.
changing the evaluation type for some of the tests #193 — Also converts tests to response_eval:accuracy and updates expected responses for cluster-creation evals.
MGMT-21240: improve eval test coverage #182 — Modifies multinode and cluster-creation evals and overlaps with create_cluster usage changes.

Suggested labels

approved, lgtm

Suggested reviewers

eranco74
carbonin

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "an experiment for more stable eval tests" is short, a single sentence, and directly relates to the primary changes in the diff (updates to eval YAML and tightening the eval failure gate) so it accurately summarizes the main intent of the changeset and is clear enough for a teammate scanning history to understand the primary change. It avoids noisy file lists or vague terms and focuses on the experiment's purpose. Therefore it meets the PR title guidelines.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

test/evals/eval_data.yaml (1)
183-188: Cluster name mismatch will flake this convo.

The expected_response mentions eval-test-uniq-cluster-name but the created cluster is eval-test2-uniq-cluster-name. Align them.

Apply this diff:
-      expected_response: I have created a cluster with name eval-test-uniq-cluster-name. Next, you'll need to download the Discovery ISO, then boot your hosts with it. Would you like me to get the Discovery ISO download URL?
+      expected_response: I have created a cluster with name eval-test2-uniq-cluster-name. Next, you'll need to download the Discovery ISO, then boot your hosts with it. Would you like me to get the Discovery ISO download URL?

🧹 Nitpick comments (2)

test/evals/eval_data.yaml (2)
127-127: Grammar nit: plural agreement.

Use “hosts haven't been discovered yet.”

Apply this diff:
-      expected_response: "hosts hasn't been discovered yet."
+      expected_response: "hosts haven't been discovered yet."
183-183: Typo in eval_id.

Rename create_single_node_cluser to create_single_node_cluster for consistency.

Apply this diff:
-    - eval_id: create_single_node_cluser
+    - eval_id: create_single_node_cluster

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3cdf477 and 99e418f.

📒 Files selected for processing (1)

test/evals/eval_data.yaml (3 hunks)

🔇 Additional comments (2)

test/evals/eval_data.yaml (2)

121-121: OK to reference “previously created cluster,” but confirm harness context.

Ensure the eval harness persists the just-created cluster context across turns in this conversation group so the following cluster_info call resolves correctly without an explicit name/ID.

189-196: Name-based query is fine; double-check tool-call path.

Since the query is by name, confirm the agent uses the stored cluster_id from the creation step (vs calling list_clusters to resolve the ID). The current expectation only checks cluster_info with a UUID, which is fine if memory is used.

test/evals/eval_data.yaml

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

test/evals/eval_data.yaml (1)

183-188: Name mismatch and typo will break the scenario.

eval_id typo: “create_single_node_cluser”.
expected_response references eval-test-uniq-cluster-name but the query uses eval-test2-uniq-cluster-name.

-    - eval_id: create_single_node_cluser
+    - eval_id: create_single_node_cluster
@@
-      expected_response: I have created a cluster with name eval-test-uniq-cluster-name. Next, you'll need to download the Discovery ISO, then boot your hosts with it. Would you like me to get the Discovery ISO download URL?
+      expected_response: I have created a cluster with name eval-test2-uniq-cluster-name. Next, you'll need to download the Discovery ISO, then boot your hosts with it. Would you like me to get the Discovery ISO download URL?

♻️ Duplicate comments (1)

test/evals/eval_data.yaml (1)
43-44: Fix capitalization/punctuation to avoid brittle accuracy match (repeat of prior note).

Use “Assisted Installer” (proper noun) and avoid the comma splice; otherwise the accuracy check will fail if the assistant returns the corrected wording used elsewhere.
-      expected_response: "I do not support creating clusters with static networking, however, you can use the assisted-installer web-based wizard to configure a cluster with static networking."
+      expected_response: "I do not support creating clusters with static networking. However, you can use the Assisted Installer web-based wizard to configure a cluster with static networking."

🧹 Nitpick comments (3)

test/evals/eval_data.yaml (3)
121-128: Grammar in expected_response will cause false negatives with accuracy eval.

Plural subject requires plural verb.
-      expected_response: "hosts hasn't been discovered yet."
+      expected_response: "Hosts haven't been discovered yet."
104-112: Fix duplicated words in SSH-key confirmation message.

Minor copy edit to remove “for cluster” duplication.
-      expected_response: The SSH public key is set for the cluster for cluster
+      expected_response: The SSH public key is set for the cluster.
189-196: Confirm expected tool-call path when requesting by name.

You expect a direct cluster_info call with cluster_id. If the agent resolves name→id via list_clusters, this test will fail. Either make the tool-call expectation tolerant (allow list_clusters then cluster_info) or confirm the agent caches the ID from the earlier create step.

If the harness supports multi-path expectations, consider documenting both sequences; otherwise, keep as-is but ensure the agent uses cached ID.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 390f056 and b4736e8.

📒 Files selected for processing (1)

test/evals/eval_data.yaml (4 hunks)

🔇 Additional comments (2)

test/evals/eval_data.yaml (2)
91-91: Confirm create_cluster schema exposes cpu_architecture

Nice: explicitly stating CPU architecture in the eval query. Repo search shows cpu_architecture only in test/evals/eval_data.yaml (expected_tool_calls); I could not find a create_cluster tool schema/handler exposing this parameter. Confirm create_cluster accepts cpu_architecture or add it to the tool schema/handler.

99-100: Verify acceptance of empty ssh_public_key (keep, omit, or null)

test/evals/eval_data.yaml contains ssh_public_key: "" at line 100 while other entries expect an ssh-rsa pattern (lines 74, 110). I found no validation handler in the repo — confirm whether the tool accepts an empty string.

Options:

Keep as empty string if schema/handler explicitly allows it (confirm).

Remove the field from expected_tool_calls:
-              ssh_public_key: ""
Or switch to a nullable representation:
-              ssh_public_key: ""
+              ssh_public_key: null

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

test/evals/eval_data.yaml (1)
183-188: Cluster name mismatch will cause false negatives.

The eval creates 'eval-test2-uniq-cluster-name' (Line 184) but expected_response mentions 'eval-test-uniq-cluster-name' (Line 187). This will fail an accuracy check.
-      expected_response: I have created a cluster with name eval-test-uniq-cluster-name. Next, you'll need to download the Discovery ISO, then boot your hosts with it. Would you like me to get the Discovery ISO download URL?
+      expected_response: I have created a cluster with name eval-test2-uniq-cluster-name. Next, you'll need to download the Discovery ISO, then boot your hosts with it. Would you like me to get the Discovery ISO download URL?

🧹 Nitpick comments (3)

test/evals/eval_data.yaml (3)
43-44: Prefer canonical phrasing and casing for exact-match stability.

Current sentence uses a comma splice and non-canonical casing for "Assisted Installer." If your model responses elsewhere use the canonical phrasing, this test will become brittle. Consider tightening the text.
-      expected_response: "I do not support creating clusters with static networking, however, you can use the assisted-installer web-based wizard to configure a cluster with static networking."
+      expected_response: "I do not support creating clusters with static networking. However, you can use the Assisted Installer web-based wizard to configure a cluster with static networking."
121-127: Anchor the reference by name and fix minor grammar in expected response.

Referring to “the cluster you just created” relies on conversational state; naming the cluster increases determinism. Also, “hosts hasn't” → “hosts haven't”.
-      eval_query: I booted the hosts for the cluster you just created
+      eval_query: I booted the hosts for the cluster 'eval-test-multinode-uniq-cluster-name'
@@
-      expected_response: "hosts hasn't been discovered yet."
+      expected_response: "hosts haven't been discovered yet."
183-183: Typo in eval_id.

Rename create_single_node_cluser → create_single_node_cluster for consistency.

Be sure no external references rely on the old ID before renaming.
-    - eval_id: create_single_node_cluser
+    - eval_id: create_single_node_cluster

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2eb582d and 79fb045.

📒 Files selected for processing (1)

test/evals/eval_data.yaml (4 hunks)

🔇 Additional comments (3)

test/evals/eval_data.yaml (3)
91-91: LGTM: clearer, punctuated eval_query.

The added period reduces ambiguity for exact-match checks.

189-193: LGTM: prepend list_clusters for name→ID resolution.

This makes the flow robust before calling cluster_info.

Ensure list_clusters returns name and id so the harness can map the correct cluster deterministically.

99-100: Remove empty ssh_public_key from expected create_cluster args (or confirm adapter sends empty string).

test/evals/eval_data.yaml (line 100) contains ssh_public_key: ""; if the adapter omits the field instead of sending an empty string the eval will fail — remove that key from the expected_tool_calls or make the adapter forward "" exactly.
               cpu_architecture: "x86_64"
-              ssh_public_key: ""

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

test/evals/eval.py (3)

91-93: Fail on any non‑zero failures — good; make summary access defensive and print breakdown.

Use .get() with defaults to avoid KeyError if the summary shape changes and include FAIL/ERROR counts for quick debugging.

- failed_evals_count = result_summary["FAIL"] + result_summary["ERROR"]
- if failed_evals_count:
-     print(f"❌ {failed_evals_count} evaluation(s) failed!")
+ failed = int(result_summary.get("FAIL", 0) or 0)
+ errors = int(result_summary.get("ERROR", 0) or 0)
+ failed_evals_count = failed + errors
+ if failed_evals_count > 0:
+     print(f"❌ {failed_evals_count} evaluation(s) failed (FAIL={failed}, ERROR={errors}).")
      sys.exit(1)

61-64: Help text mismatch for --judge_model default.

The help says “default: gemini-2.0-flash” but the actual default is “gemini-2.5-flash”.

-        help="Judge model for LLM evaluation (default: gemini-2.0-flash)",
+        help="Judge model for LLM evaluation (default: gemini-2.5-flash)",

81-97: Add a main guard to make the module import‑safe and more testable.

Prevents side effects on import and allows reuse from other scripts/tests.

-# Parse command line arguments
-args = parse_args()
-
-evaluator = AgentGoalEval(args)
-# Run Evaluation
-evaluator.run_evaluation()
-# Get result summary
-result_summary = evaluator.get_result_summary()
-
-failed_evals_count = result_summary["FAIL"] + result_summary["ERROR"]
-if failed_evals_count:
-    print(f"❌ {failed_evals_count} evaluation(s) failed!")
-    sys.exit(1)
-
-print("🎉 All evaluations passed!")
-sys.exit(0)
+def main() -> int:
+    # Parse command line arguments
+    args = parse_args()
+    evaluator = AgentGoalEval(args)
+    # Run Evaluation
+    evaluator.run_evaluation()
+    # Get result summary
+    result_summary = evaluator.get_result_summary()
+    failed_evals_count = result_summary["FAIL"] + result_summary["ERROR"]
+    if failed_evals_count:
+        print(f"❌ {failed_evals_count} evaluation(s) failed!")
+        return 1
+    print("🎉 All evaluations passed!")
+    return 0
+
+if __name__ == "__main__":
+    sys.exit(main())

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9db80d6 and 54e062e.

📒 Files selected for processing (2)

test/evals/eval.py (1 hunks)
test/evals/eval_data.yaml (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

test/evals/eval_data.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Red Hat Konflux / assisted-chat-saas-main-on-pull-request
GitHub Check: Red Hat Konflux / assisted-chat-test-image-saas-main-on-pull-request

andrej1991 · 2025-09-18T14:20:34Z

/test eval-test

andrej1991 · 2025-09-18T14:24:45Z

/test eval-test

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

test/evals/eval_data.yaml (2)
43-44: Tighten exact-match phrasing; fix capitalization and comma splice for stability.

For accuracy checks, minor wording changes will fail the eval. Use “Assisted Installer” (consistent with the rest of the file) and split into two sentences.
-      expected_response: "I do not support creating clusters with static networking, however, you can use the assisted-installer web-based wizard to configure a cluster with static networking."
+      expected_response: "I do not support creating clusters with static networking. However, you can use the Assisted Installer web-based wizard to configure a cluster with static networking."
121-124: Fix grammar in exact-match response.

Use plural agreement and add article to avoid false negatives in accuracy mode.
-      expected_response: "hosts hasn't been discovered yet."
+      expected_response: "The hosts have not been discovered yet."

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54e062e and d40e7b7.

📒 Files selected for processing (2)

test/evals/eval.py (1 hunks)
test/evals/eval_data.yaml (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

test/evals/eval.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Red Hat Konflux / assisted-chat-saas-main-on-pull-request
GitHub Check: Red Hat Konflux / assisted-chat-test-image-saas-main-on-pull-request

coderabbitai · 2025-09-18T18:36:08Z

test/evals/eval_data.yaml

+      eval_query: Create a multi-node cluster named 'eval-test-multinode-uniq-cluster-name' with OpenShift 4.18.22 and domain test.local.
      expected_tool_calls:
        - - tool_name: create_cluster
            arguments:
              name: "eval-test-multinode-uniq-cluster-name"
              version: "4\\.18\\.22"
              base_domain: "test\\.local"
              single_node: "(?i:false)"
-              cpu_architecture: None
-              ssh_public_key: None
+              cpu_architecture: "x86_64"
+              ssh_public_key: ""
      expected_keywords: ["eval-test-multinode-uniq-cluster-name", "ID", "Discovery ISO", "cluster"]


🛠️ Refactor suggestion

🧩 Analysis chain

Make expected tool args tolerant of optional SSH key; current empty-string match is brittle.

The agent may omit ssh_public_key entirely or set it to null. Matching an empty string only will cause flaky failures. Prefer a permissive regex.

- ssh_public_key: "" + ssh_public_key: ".*"

If the harness supports partial argument matching, consider omitting ssh_public_key from expectations entirely.

🏁 Script executed:

#!/usr/bin/env bash # Check for consistency of SSH key expectations across evals rg -nC1 'ssh_public_key:' test/evals/eval_data.yaml

Length of output: 687

Make ssh_public_key expectation tolerant and consistent across evals.

Empty-string at test/evals/eval_data.yaml (lines 99-101) is brittle; other entries (lines 73-75 and 109-111) use an ssh-rsa regex. Replace the empty-string with a pattern that accepts either a valid ssh key or nothing:

- ssh_public_key: "" + ssh_public_key: '^(?:ssh-rsa\s+[A-Za-z0-9+/]+[=]{0,3}(?:\s+.+)?\s*)?$'

Or, if the harness supports partial matching, omit ssh_public_key from expectations entirely.

🤖 Prompt for AI Agents

In test/evals/eval_data.yaml around lines 91 to 101, the expected ssh_public_key is an empty string which is brittle; replace the empty-string expectation with a tolerant regex that accepts either nothing or a valid SSH public key (e.g. set ssh_public_key to something like "(^$|^ssh-(rsa|ed25519) [A-Za-z0-9+/=]+(?: .*)?$)") so it matches an empty value or common ssh-rsa/ssh-ed25519 keys; alternatively, if the test harness allows partial matching, remove the ssh_public_key key from the expected arguments entirely.

andrej1991 · 2025-09-19T08:45:25Z

/test eval-test

andrej1991 · 2025-09-19T09:45:59Z

/test eval-test

andrej1991 · 2025-09-19T11:19:23Z

/test eval-test

andrej1991 · 2025-09-19T12:51:14Z

/test eval-test

andrej1991 · 2025-09-19T14:22:14Z

/test eval-test

andrej1991 · 2025-09-19T14:58:35Z

/test eval-test

openshift-ci · 2025-09-19T15:18:14Z

@andrej1991: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/eval-test	`d40e7b7`	link	false	`/test eval-test`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot requested review from carbonin and eranco74 September 17, 2025 13:36

coderabbitai bot reviewed Sep 17, 2025

View reviewed changes

test/evals/eval_data.yaml Outdated Show resolved Hide resolved

andrej1991 force-pushed the stable_eval_test branch 3 times, most recently from 390f056 to b4736e8 Compare September 17, 2025 18:06

coderabbitai bot reviewed Sep 17, 2025

View reviewed changes

andrej1991 force-pushed the stable_eval_test branch 2 times, most recently from 2eb582d to 79fb045 Compare September 18, 2025 11:02

coderabbitai bot reviewed Sep 18, 2025

View reviewed changes

andrej1991 force-pushed the stable_eval_test branch 3 times, most recently from 9db80d6 to 54e062e Compare September 18, 2025 13:55

coderabbitai bot reviewed Sep 18, 2025

View reviewed changes

an experiment for more stable eval tests

d40e7b7

andrej1991 force-pushed the stable_eval_test branch from 54e062e to d40e7b7 Compare September 18, 2025 18:29

coderabbitai bot reviewed Sep 18, 2025

View reviewed changes

andrej1991 closed this Sep 19, 2025

Conversation

andrej1991 commented Sep 17, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-ci bot commented Sep 17, 2025

Uh oh!

coderabbitai bot commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

andrej1991 commented Sep 18, 2025

Uh oh!

andrej1991 commented Sep 18, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

andrej1991 commented Sep 19, 2025

Uh oh!

andrej1991 commented Sep 19, 2025

Uh oh!

andrej1991 commented Sep 19, 2025

Uh oh!

andrej1991 commented Sep 19, 2025

Uh oh!

andrej1991 commented Sep 19, 2025

Uh oh!

andrej1991 commented Sep 19, 2025

Uh oh!

openshift-ci bot commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andrej1991 commented Sep 17, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 17, 2025 •

edited

Loading