Skip to content

an experiment for more stable eval tests#210

Closed
andrej1991 wants to merge 1 commit intorh-ecosystem-edge:mainfrom
andrej1991:stable_eval_test
Closed

an experiment for more stable eval tests#210
andrej1991 wants to merge 1 commit intorh-ecosystem-edge:mainfrom
andrej1991:stable_eval_test

Conversation

@andrej1991
Copy link
Contributor

@andrej1991 andrej1991 commented Sep 17, 2025

Summary by CodeRabbit

  • Tests
    • Switched from keyword-based checks to exact-response (accuracy) evaluations for stricter assertions.
    • Tightened failure gate so any evaluation failure now fails the suite.
    • Removed flow that relied on cluster lookups by name; tests now use existing cluster listing/info flows.
    • Specified CPU architecture as x86_64 and set SSH key to an explicit empty value in relevant evaluations.
    • Standardized the static-networking message to point users to the assisted-installer web-based wizard.
    • Normalized punctuation in evaluation queries for consistency.

@openshift-ci openshift-ci bot requested review from carbonin and eranco74 September 17, 2025 13:36
@openshift-ci
Copy link

openshift-ci bot commented Sep 17, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: andrej1991
Once this PR has been reviewed and has the lgtm label, please assign omertuc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Sep 17, 2025

Walkthrough

Switched several evals in test/evals/eval_data.yaml from keyword/ or tool-call-based checks to exact/accuracy response_eval expectations, adjusted create_cluster tool-call arguments and eval queries (including trailing periods and cluster references), removed the cluster_id_from_name conversation group, and made the test runner fail on any non-zero failed evals in test/evals/eval.py.

Changes

Cohort / File(s) Summary
Eval test data updates
test/evals/eval_data.yaml
Replaced expected_keywords/keyword expectations with exact expected_response and response_eval:accuracy; updated eval_query text (trailing periods, cluster references); changed create_cluster calls to include cpu_architecture: "x86_64" and ssh_public_key: ""; removed cluster_id_from_name conversation group and related tooling expectations; removed some expected_tool_calls and adjusted host-related evals.
Test runner threshold change
test/evals/eval.py
Tightened failure gate: treat any non-zero failed_evals_count as overall test failure (previously allowed up to 2 failures).

Sequence Diagram(s)

sequenceDiagram
  participant T as Test Runner
  participant D as Eval Data
  participant R as Eval Executor
  participant Tools as Tooling (create_cluster, list_clusters, cluster_info)
  participant O as Exit

  rect #f3f9ff
    T->>D: Load updated eval definitions (`response_eval:accuracy`, updated queries)
    T->>R: Execute eval (send eval_query)
    R->>Tools: Optional tool calls (e.g., `list_clusters` -> `cluster_info`, `create_cluster` w/ cpu_architecture/ssh_public_key)
    Tools-->>R: Tool responses
    R-->>T: Eval result (pass / fail)
  end

  rect #fff7f0
    alt All evals passed
      T->>O: exit 0
    else Any eval failed
      T->>O: exit 1
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

approved, lgtm

Suggested reviewers

  • eranco74
  • carbonin

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "an experiment for more stable eval tests" is short, a single sentence, and directly relates to the primary changes in the diff (updates to eval YAML and tightening the eval failure gate) so it accurately summarizes the main intent of the changeset and is clear enough for a teammate scanning history to understand the primary change. It avoids noisy file lists or vague terms and focuses on the experiment's purpose. Therefore it meets the PR title guidelines.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/evals/eval_data.yaml (1)

183-188: Cluster name mismatch will flake this convo.

The expected_response mentions eval-test-uniq-cluster-name but the created cluster is eval-test2-uniq-cluster-name. Align them.

Apply this diff:

-      expected_response: I have created a cluster with name eval-test-uniq-cluster-name. Next, you'll need to download the Discovery ISO, then boot your hosts with it. Would you like me to get the Discovery ISO download URL?
+      expected_response: I have created a cluster with name eval-test2-uniq-cluster-name. Next, you'll need to download the Discovery ISO, then boot your hosts with it. Would you like me to get the Discovery ISO download URL?
🧹 Nitpick comments (2)
test/evals/eval_data.yaml (2)

127-127: Grammar nit: plural agreement.

Use “hosts haven't been discovered yet.”

Apply this diff:

-      expected_response: "hosts hasn't been discovered yet."
+      expected_response: "hosts haven't been discovered yet."

183-183: Typo in eval_id.

Rename create_single_node_cluser to create_single_node_cluster for consistency.

Apply this diff:

-    - eval_id: create_single_node_cluser
+    - eval_id: create_single_node_cluster
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3cdf477 and 99e418f.

📒 Files selected for processing (1)
  • test/evals/eval_data.yaml (3 hunks)
🔇 Additional comments (2)
test/evals/eval_data.yaml (2)

121-121: OK to reference “previously created cluster,” but confirm harness context.

Ensure the eval harness persists the just-created cluster context across turns in this conversation group so the following cluster_info call resolves correctly without an explicit name/ID.


189-196: Name-based query is fine; double-check tool-call path.

Since the query is by name, confirm the agent uses the stored cluster_id from the creation step (vs calling list_clusters to resolve the ID). The current expectation only checks cluster_info with a UUID, which is fine if memory is used.

@andrej1991 andrej1991 force-pushed the stable_eval_test branch 3 times, most recently from 390f056 to b4736e8 Compare September 17, 2025 18:06
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/evals/eval_data.yaml (1)

183-188: Name mismatch and typo will break the scenario.

  • eval_id typo: “create_single_node_cluser”.
  • expected_response references eval-test-uniq-cluster-name but the query uses eval-test2-uniq-cluster-name.
-    - eval_id: create_single_node_cluser
+    - eval_id: create_single_node_cluster
@@
-      expected_response: I have created a cluster with name eval-test-uniq-cluster-name. Next, you'll need to download the Discovery ISO, then boot your hosts with it. Would you like me to get the Discovery ISO download URL?
+      expected_response: I have created a cluster with name eval-test2-uniq-cluster-name. Next, you'll need to download the Discovery ISO, then boot your hosts with it. Would you like me to get the Discovery ISO download URL?
♻️ Duplicate comments (1)
test/evals/eval_data.yaml (1)

43-44: Fix capitalization/punctuation to avoid brittle accuracy match (repeat of prior note).

Use “Assisted Installer” (proper noun) and avoid the comma splice; otherwise the accuracy check will fail if the assistant returns the corrected wording used elsewhere.

-      expected_response: "I do not support creating clusters with static networking, however, you can use the assisted-installer web-based wizard to configure a cluster with static networking."
+      expected_response: "I do not support creating clusters with static networking. However, you can use the Assisted Installer web-based wizard to configure a cluster with static networking."
🧹 Nitpick comments (3)
test/evals/eval_data.yaml (3)

121-128: Grammar in expected_response will cause false negatives with accuracy eval.

Plural subject requires plural verb.

-      expected_response: "hosts hasn't been discovered yet."
+      expected_response: "Hosts haven't been discovered yet."

104-112: Fix duplicated words in SSH-key confirmation message.

Minor copy edit to remove “for cluster” duplication.

-      expected_response: The SSH public key is set for the cluster for cluster
+      expected_response: The SSH public key is set for the cluster.

189-196: Confirm expected tool-call path when requesting by name.

You expect a direct cluster_info call with cluster_id. If the agent resolves name→id via list_clusters, this test will fail. Either make the tool-call expectation tolerant (allow list_clusters then cluster_info) or confirm the agent caches the ID from the earlier create step.

If the harness supports multi-path expectations, consider documenting both sequences; otherwise, keep as-is but ensure the agent uses cached ID.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 390f056 and b4736e8.

📒 Files selected for processing (1)
  • test/evals/eval_data.yaml (4 hunks)
🔇 Additional comments (2)
test/evals/eval_data.yaml (2)

91-91: Confirm create_cluster schema exposes cpu_architecture

Nice: explicitly stating CPU architecture in the eval query. Repo search shows cpu_architecture only in test/evals/eval_data.yaml (expected_tool_calls); I could not find a create_cluster tool schema/handler exposing this parameter. Confirm create_cluster accepts cpu_architecture or add it to the tool schema/handler.


99-100: Verify acceptance of empty ssh_public_key (keep, omit, or null)

test/evals/eval_data.yaml contains ssh_public_key: "" at line 100 while other entries expect an ssh-rsa pattern (lines 74, 110). I found no validation handler in the repo — confirm whether the tool accepts an empty string.

Options:

  • Keep as empty string if schema/handler explicitly allows it (confirm).
  • Remove the field from expected_tool_calls:
-              ssh_public_key: ""
  • Or switch to a nullable representation:
-              ssh_public_key: ""
+              ssh_public_key: null

@andrej1991 andrej1991 force-pushed the stable_eval_test branch 2 times, most recently from 2eb582d to 79fb045 Compare September 18, 2025 11:02
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/evals/eval_data.yaml (1)

183-188: Cluster name mismatch will cause false negatives.

The eval creates 'eval-test2-uniq-cluster-name' (Line 184) but expected_response mentions 'eval-test-uniq-cluster-name' (Line 187). This will fail an accuracy check.

-      expected_response: I have created a cluster with name eval-test-uniq-cluster-name. Next, you'll need to download the Discovery ISO, then boot your hosts with it. Would you like me to get the Discovery ISO download URL?
+      expected_response: I have created a cluster with name eval-test2-uniq-cluster-name. Next, you'll need to download the Discovery ISO, then boot your hosts with it. Would you like me to get the Discovery ISO download URL?
🧹 Nitpick comments (3)
test/evals/eval_data.yaml (3)

43-44: Prefer canonical phrasing and casing for exact-match stability.

Current sentence uses a comma splice and non-canonical casing for "Assisted Installer." If your model responses elsewhere use the canonical phrasing, this test will become brittle. Consider tightening the text.

-      expected_response: "I do not support creating clusters with static networking, however, you can use the assisted-installer web-based wizard to configure a cluster with static networking."
+      expected_response: "I do not support creating clusters with static networking. However, you can use the Assisted Installer web-based wizard to configure a cluster with static networking."

121-127: Anchor the reference by name and fix minor grammar in expected response.

Referring to “the cluster you just created” relies on conversational state; naming the cluster increases determinism. Also, “hosts hasn't” → “hosts haven't”.

-      eval_query: I booted the hosts for the cluster you just created
+      eval_query: I booted the hosts for the cluster 'eval-test-multinode-uniq-cluster-name'
@@
-      expected_response: "hosts hasn't been discovered yet."
+      expected_response: "hosts haven't been discovered yet."

183-183: Typo in eval_id.

Rename create_single_node_clusercreate_single_node_cluster for consistency.

Be sure no external references rely on the old ID before renaming.

-    - eval_id: create_single_node_cluser
+    - eval_id: create_single_node_cluster
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2eb582d and 79fb045.

📒 Files selected for processing (1)
  • test/evals/eval_data.yaml (4 hunks)
🔇 Additional comments (3)
test/evals/eval_data.yaml (3)

91-91: LGTM: clearer, punctuated eval_query.

The added period reduces ambiguity for exact-match checks.


189-193: LGTM: prepend list_clusters for name→ID resolution.

This makes the flow robust before calling cluster_info.

Ensure list_clusters returns name and id so the harness can map the correct cluster deterministically.


99-100: Remove empty ssh_public_key from expected create_cluster args (or confirm adapter sends empty string).

test/evals/eval_data.yaml (line 100) contains ssh_public_key: ""; if the adapter omits the field instead of sending an empty string the eval will fail — remove that key from the expected_tool_calls or make the adapter forward "" exactly.

               cpu_architecture: "x86_64"
-              ssh_public_key: ""

@andrej1991 andrej1991 force-pushed the stable_eval_test branch 3 times, most recently from 9db80d6 to 54e062e Compare September 18, 2025 13:55
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
test/evals/eval.py (3)

91-93: Fail on any non‑zero failures — good; make summary access defensive and print breakdown.

Use .get() with defaults to avoid KeyError if the summary shape changes and include FAIL/ERROR counts for quick debugging.

- failed_evals_count = result_summary["FAIL"] + result_summary["ERROR"]
- if failed_evals_count:
-     print(f"❌ {failed_evals_count} evaluation(s) failed!")
+ failed = int(result_summary.get("FAIL", 0) or 0)
+ errors = int(result_summary.get("ERROR", 0) or 0)
+ failed_evals_count = failed + errors
+ if failed_evals_count > 0:
+     print(f"❌ {failed_evals_count} evaluation(s) failed (FAIL={failed}, ERROR={errors}).")
      sys.exit(1)

61-64: Help text mismatch for --judge_model default.

The help says “default: gemini-2.0-flash” but the actual default is “gemini-2.5-flash”.

-        help="Judge model for LLM evaluation (default: gemini-2.0-flash)",
+        help="Judge model for LLM evaluation (default: gemini-2.5-flash)",

81-97: Add a main guard to make the module import‑safe and more testable.

Prevents side effects on import and allows reuse from other scripts/tests.

-# Parse command line arguments
-args = parse_args()
-
-evaluator = AgentGoalEval(args)
-# Run Evaluation
-evaluator.run_evaluation()
-# Get result summary
-result_summary = evaluator.get_result_summary()
-
-failed_evals_count = result_summary["FAIL"] + result_summary["ERROR"]
-if failed_evals_count:
-    print(f"❌ {failed_evals_count} evaluation(s) failed!")
-    sys.exit(1)
-
-print("🎉 All evaluations passed!")
-sys.exit(0)
+def main() -> int:
+    # Parse command line arguments
+    args = parse_args()
+    evaluator = AgentGoalEval(args)
+    # Run Evaluation
+    evaluator.run_evaluation()
+    # Get result summary
+    result_summary = evaluator.get_result_summary()
+    failed_evals_count = result_summary["FAIL"] + result_summary["ERROR"]
+    if failed_evals_count:
+        print(f"❌ {failed_evals_count} evaluation(s) failed!")
+        return 1
+    print("🎉 All evaluations passed!")
+    return 0
+
+if __name__ == "__main__":
+    sys.exit(main())
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9db80d6 and 54e062e.

📒 Files selected for processing (2)
  • test/evals/eval.py (1 hunks)
  • test/evals/eval_data.yaml (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/evals/eval_data.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Red Hat Konflux / assisted-chat-saas-main-on-pull-request
  • GitHub Check: Red Hat Konflux / assisted-chat-test-image-saas-main-on-pull-request

@andrej1991
Copy link
Contributor Author

/test eval-test

1 similar comment
@andrej1991
Copy link
Contributor Author

/test eval-test

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
test/evals/eval_data.yaml (2)

43-44: Tighten exact-match phrasing; fix capitalization and comma splice for stability.

For accuracy checks, minor wording changes will fail the eval. Use “Assisted Installer” (consistent with the rest of the file) and split into two sentences.

-      expected_response: "I do not support creating clusters with static networking, however, you can use the assisted-installer web-based wizard to configure a cluster with static networking."
+      expected_response: "I do not support creating clusters with static networking. However, you can use the Assisted Installer web-based wizard to configure a cluster with static networking."

121-124: Fix grammar in exact-match response.

Use plural agreement and add article to avoid false negatives in accuracy mode.

-      expected_response: "hosts hasn't been discovered yet."
+      expected_response: "The hosts have not been discovered yet."
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54e062e and d40e7b7.

📒 Files selected for processing (2)
  • test/evals/eval.py (1 hunks)
  • test/evals/eval_data.yaml (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/evals/eval.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Red Hat Konflux / assisted-chat-saas-main-on-pull-request
  • GitHub Check: Red Hat Konflux / assisted-chat-test-image-saas-main-on-pull-request

Comment on lines +91 to 101
eval_query: Create a multi-node cluster named 'eval-test-multinode-uniq-cluster-name' with OpenShift 4.18.22 and domain test.local.
expected_tool_calls:
- - tool_name: create_cluster
arguments:
name: "eval-test-multinode-uniq-cluster-name"
version: "4\\.18\\.22"
base_domain: "test\\.local"
single_node: "(?i:false)"
cpu_architecture: None
ssh_public_key: None
cpu_architecture: "x86_64"
ssh_public_key: ""
expected_keywords: ["eval-test-multinode-uniq-cluster-name", "ID", "Discovery ISO", "cluster"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

🧩 Analysis chain

Make expected tool args tolerant of optional SSH key; current empty-string match is brittle.

The agent may omit ssh_public_key entirely or set it to null. Matching an empty string only will cause flaky failures. Prefer a permissive regex.

-              ssh_public_key: ""
+              ssh_public_key: ".*"

If the harness supports partial argument matching, consider omitting ssh_public_key from expectations entirely.


🏁 Script executed:

#!/usr/bin/env bash
# Check for consistency of SSH key expectations across evals
rg -nC1 'ssh_public_key:' test/evals/eval_data.yaml

Length of output: 687


Make ssh_public_key expectation tolerant and consistent across evals.

Empty-string at test/evals/eval_data.yaml (lines 99-101) is brittle; other entries (lines 73-75 and 109-111) use an ssh-rsa regex. Replace the empty-string with a pattern that accepts either a valid ssh key or nothing:

-              ssh_public_key: ""
+              ssh_public_key: '^(?:ssh-rsa\s+[A-Za-z0-9+/]+[=]{0,3}(?:\s+.+)?\s*)?$'

Or, if the harness supports partial matching, omit ssh_public_key from expectations entirely.

🤖 Prompt for AI Agents
In test/evals/eval_data.yaml around lines 91 to 101, the expected ssh_public_key
is an empty string which is brittle; replace the empty-string expectation with a
tolerant regex that accepts either nothing or a valid SSH public key (e.g. set
ssh_public_key to something like "(^$|^ssh-(rsa|ed25519) [A-Za-z0-9+/=]+(?:
.*)?$)") so it matches an empty value or common ssh-rsa/ssh-ed25519 keys;
alternatively, if the test harness allows partial matching, remove the
ssh_public_key key from the expected arguments entirely.

@andrej1991
Copy link
Contributor Author

/test eval-test

5 similar comments
@andrej1991
Copy link
Contributor Author

/test eval-test

@andrej1991
Copy link
Contributor Author

/test eval-test

@andrej1991
Copy link
Contributor Author

/test eval-test

@andrej1991
Copy link
Contributor Author

/test eval-test

@andrej1991
Copy link
Contributor Author

/test eval-test

@openshift-ci
Copy link

openshift-ci bot commented Sep 19, 2025

@andrej1991: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/eval-test d40e7b7 link false /test eval-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@andrej1991 andrej1991 closed this Sep 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant