an experiment for more stable eval tests#210
an experiment for more stable eval tests#210andrej1991 wants to merge 1 commit intorh-ecosystem-edge:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: andrej1991 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
WalkthroughSwitched several evals in Changes
Sequence Diagram(s)sequenceDiagram
participant T as Test Runner
participant D as Eval Data
participant R as Eval Executor
participant Tools as Tooling (create_cluster, list_clusters, cluster_info)
participant O as Exit
rect #f3f9ff
T->>D: Load updated eval definitions (`response_eval:accuracy`, updated queries)
T->>R: Execute eval (send eval_query)
R->>Tools: Optional tool calls (e.g., `list_clusters` -> `cluster_info`, `create_cluster` w/ cpu_architecture/ssh_public_key)
Tools-->>R: Tool responses
R-->>T: Eval result (pass / fail)
end
rect #fff7f0
alt All evals passed
T->>O: exit 0
else Any eval failed
T->>O: exit 1
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/evals/eval_data.yaml (1)
183-188: Cluster name mismatch will flake this convo.The expected_response mentions
eval-test-uniq-cluster-namebut the created cluster iseval-test2-uniq-cluster-name. Align them.Apply this diff:
- expected_response: I have created a cluster with name eval-test-uniq-cluster-name. Next, you'll need to download the Discovery ISO, then boot your hosts with it. Would you like me to get the Discovery ISO download URL? + expected_response: I have created a cluster with name eval-test2-uniq-cluster-name. Next, you'll need to download the Discovery ISO, then boot your hosts with it. Would you like me to get the Discovery ISO download URL?
🧹 Nitpick comments (2)
test/evals/eval_data.yaml (2)
127-127: Grammar nit: plural agreement.Use “hosts haven't been discovered yet.”
Apply this diff:
- expected_response: "hosts hasn't been discovered yet." + expected_response: "hosts haven't been discovered yet."
183-183: Typo in eval_id.Rename
create_single_node_clusertocreate_single_node_clusterfor consistency.Apply this diff:
- - eval_id: create_single_node_cluser + - eval_id: create_single_node_cluster
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
test/evals/eval_data.yaml(3 hunks)
🔇 Additional comments (2)
test/evals/eval_data.yaml (2)
121-121: OK to reference “previously created cluster,” but confirm harness context.Ensure the eval harness persists the just-created cluster context across turns in this conversation group so the following
cluster_infocall resolves correctly without an explicit name/ID.
189-196: Name-based query is fine; double-check tool-call path.Since the query is by name, confirm the agent uses the stored cluster_id from the creation step (vs calling
list_clustersto resolve the ID). The current expectation only checkscluster_infowith a UUID, which is fine if memory is used.
390f056 to
b4736e8
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/evals/eval_data.yaml (1)
183-188: Name mismatch and typo will break the scenario.
- eval_id typo: “create_single_node_cluser”.
- expected_response references eval-test-uniq-cluster-name but the query uses eval-test2-uniq-cluster-name.
- - eval_id: create_single_node_cluser + - eval_id: create_single_node_cluster @@ - expected_response: I have created a cluster with name eval-test-uniq-cluster-name. Next, you'll need to download the Discovery ISO, then boot your hosts with it. Would you like me to get the Discovery ISO download URL? + expected_response: I have created a cluster with name eval-test2-uniq-cluster-name. Next, you'll need to download the Discovery ISO, then boot your hosts with it. Would you like me to get the Discovery ISO download URL?
♻️ Duplicate comments (1)
test/evals/eval_data.yaml (1)
43-44: Fix capitalization/punctuation to avoid brittle accuracy match (repeat of prior note).Use “Assisted Installer” (proper noun) and avoid the comma splice; otherwise the accuracy check will fail if the assistant returns the corrected wording used elsewhere.
- expected_response: "I do not support creating clusters with static networking, however, you can use the assisted-installer web-based wizard to configure a cluster with static networking." + expected_response: "I do not support creating clusters with static networking. However, you can use the Assisted Installer web-based wizard to configure a cluster with static networking."
🧹 Nitpick comments (3)
test/evals/eval_data.yaml (3)
121-128: Grammar in expected_response will cause false negatives with accuracy eval.Plural subject requires plural verb.
- expected_response: "hosts hasn't been discovered yet." + expected_response: "Hosts haven't been discovered yet."
104-112: Fix duplicated words in SSH-key confirmation message.Minor copy edit to remove “for cluster” duplication.
- expected_response: The SSH public key is set for the cluster for cluster + expected_response: The SSH public key is set for the cluster.
189-196: Confirm expected tool-call path when requesting by name.You expect a direct cluster_info call with cluster_id. If the agent resolves name→id via list_clusters, this test will fail. Either make the tool-call expectation tolerant (allow list_clusters then cluster_info) or confirm the agent caches the ID from the earlier create step.
If the harness supports multi-path expectations, consider documenting both sequences; otherwise, keep as-is but ensure the agent uses cached ID.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
test/evals/eval_data.yaml(4 hunks)
🔇 Additional comments (2)
test/evals/eval_data.yaml (2)
91-91: Confirm create_cluster schema exposes cpu_architectureNice: explicitly stating CPU architecture in the eval query. Repo search shows cpu_architecture only in test/evals/eval_data.yaml (expected_tool_calls); I could not find a create_cluster tool schema/handler exposing this parameter. Confirm create_cluster accepts cpu_architecture or add it to the tool schema/handler.
99-100: Verify acceptance of empty ssh_public_key (keep, omit, or null)test/evals/eval_data.yaml contains ssh_public_key: "" at line 100 while other entries expect an ssh-rsa pattern (lines 74, 110). I found no validation handler in the repo — confirm whether the tool accepts an empty string.
Options:
- Keep as empty string if schema/handler explicitly allows it (confirm).
- Remove the field from expected_tool_calls:
- ssh_public_key: ""
- Or switch to a nullable representation:
- ssh_public_key: "" + ssh_public_key: null
2eb582d to
79fb045
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/evals/eval_data.yaml (1)
183-188: Cluster name mismatch will cause false negatives.The eval creates 'eval-test2-uniq-cluster-name' (Line 184) but expected_response mentions 'eval-test-uniq-cluster-name' (Line 187). This will fail an accuracy check.
- expected_response: I have created a cluster with name eval-test-uniq-cluster-name. Next, you'll need to download the Discovery ISO, then boot your hosts with it. Would you like me to get the Discovery ISO download URL? + expected_response: I have created a cluster with name eval-test2-uniq-cluster-name. Next, you'll need to download the Discovery ISO, then boot your hosts with it. Would you like me to get the Discovery ISO download URL?
🧹 Nitpick comments (3)
test/evals/eval_data.yaml (3)
43-44: Prefer canonical phrasing and casing for exact-match stability.Current sentence uses a comma splice and non-canonical casing for "Assisted Installer." If your model responses elsewhere use the canonical phrasing, this test will become brittle. Consider tightening the text.
- expected_response: "I do not support creating clusters with static networking, however, you can use the assisted-installer web-based wizard to configure a cluster with static networking." + expected_response: "I do not support creating clusters with static networking. However, you can use the Assisted Installer web-based wizard to configure a cluster with static networking."
121-127: Anchor the reference by name and fix minor grammar in expected response.Referring to “the cluster you just created” relies on conversational state; naming the cluster increases determinism. Also, “hosts hasn't” → “hosts haven't”.
- eval_query: I booted the hosts for the cluster you just created + eval_query: I booted the hosts for the cluster 'eval-test-multinode-uniq-cluster-name' @@ - expected_response: "hosts hasn't been discovered yet." + expected_response: "hosts haven't been discovered yet."
183-183: Typo in eval_id.Rename
create_single_node_cluser→create_single_node_clusterfor consistency.Be sure no external references rely on the old ID before renaming.
- - eval_id: create_single_node_cluser + - eval_id: create_single_node_cluster
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
test/evals/eval_data.yaml(4 hunks)
🔇 Additional comments (3)
test/evals/eval_data.yaml (3)
91-91: LGTM: clearer, punctuated eval_query.The added period reduces ambiguity for exact-match checks.
189-193: LGTM: prepend list_clusters for name→ID resolution.This makes the flow robust before calling cluster_info.
Ensure list_clusters returns name and id so the harness can map the correct cluster deterministically.
99-100: Remove empty ssh_public_key from expected create_cluster args (or confirm adapter sends empty string).test/evals/eval_data.yaml (line 100) contains
ssh_public_key: ""; if the adapter omits the field instead of sending an empty string the eval will fail — remove that key from the expected_tool_calls or make the adapter forward""exactly.cpu_architecture: "x86_64" - ssh_public_key: ""
9db80d6 to
54e062e
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
test/evals/eval.py (3)
91-93: Fail on any non‑zero failures — good; make summary access defensive and print breakdown.Use .get() with defaults to avoid KeyError if the summary shape changes and include FAIL/ERROR counts for quick debugging.
- failed_evals_count = result_summary["FAIL"] + result_summary["ERROR"] - if failed_evals_count: - print(f"❌ {failed_evals_count} evaluation(s) failed!") + failed = int(result_summary.get("FAIL", 0) or 0) + errors = int(result_summary.get("ERROR", 0) or 0) + failed_evals_count = failed + errors + if failed_evals_count > 0: + print(f"❌ {failed_evals_count} evaluation(s) failed (FAIL={failed}, ERROR={errors}).") sys.exit(1)
61-64: Help text mismatch for --judge_model default.The help says “default: gemini-2.0-flash” but the actual default is “gemini-2.5-flash”.
- help="Judge model for LLM evaluation (default: gemini-2.0-flash)", + help="Judge model for LLM evaluation (default: gemini-2.5-flash)",
81-97: Add a main guard to make the module import‑safe and more testable.Prevents side effects on import and allows reuse from other scripts/tests.
-# Parse command line arguments -args = parse_args() - -evaluator = AgentGoalEval(args) -# Run Evaluation -evaluator.run_evaluation() -# Get result summary -result_summary = evaluator.get_result_summary() - -failed_evals_count = result_summary["FAIL"] + result_summary["ERROR"] -if failed_evals_count: - print(f"❌ {failed_evals_count} evaluation(s) failed!") - sys.exit(1) - -print("🎉 All evaluations passed!") -sys.exit(0) +def main() -> int: + # Parse command line arguments + args = parse_args() + evaluator = AgentGoalEval(args) + # Run Evaluation + evaluator.run_evaluation() + # Get result summary + result_summary = evaluator.get_result_summary() + failed_evals_count = result_summary["FAIL"] + result_summary["ERROR"] + if failed_evals_count: + print(f"❌ {failed_evals_count} evaluation(s) failed!") + return 1 + print("🎉 All evaluations passed!") + return 0 + +if __name__ == "__main__": + sys.exit(main())
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
test/evals/eval.py(1 hunks)test/evals/eval_data.yaml(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- test/evals/eval_data.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Red Hat Konflux / assisted-chat-saas-main-on-pull-request
- GitHub Check: Red Hat Konflux / assisted-chat-test-image-saas-main-on-pull-request
|
/test eval-test |
1 similar comment
|
/test eval-test |
54e062e to
d40e7b7
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
test/evals/eval_data.yaml (2)
43-44: Tighten exact-match phrasing; fix capitalization and comma splice for stability.For accuracy checks, minor wording changes will fail the eval. Use “Assisted Installer” (consistent with the rest of the file) and split into two sentences.
- expected_response: "I do not support creating clusters with static networking, however, you can use the assisted-installer web-based wizard to configure a cluster with static networking." + expected_response: "I do not support creating clusters with static networking. However, you can use the Assisted Installer web-based wizard to configure a cluster with static networking."
121-124: Fix grammar in exact-match response.Use plural agreement and add article to avoid false negatives in accuracy mode.
- expected_response: "hosts hasn't been discovered yet." + expected_response: "The hosts have not been discovered yet."
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
test/evals/eval.py(1 hunks)test/evals/eval_data.yaml(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- test/evals/eval.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Red Hat Konflux / assisted-chat-saas-main-on-pull-request
- GitHub Check: Red Hat Konflux / assisted-chat-test-image-saas-main-on-pull-request
| eval_query: Create a multi-node cluster named 'eval-test-multinode-uniq-cluster-name' with OpenShift 4.18.22 and domain test.local. | ||
| expected_tool_calls: | ||
| - - tool_name: create_cluster | ||
| arguments: | ||
| name: "eval-test-multinode-uniq-cluster-name" | ||
| version: "4\\.18\\.22" | ||
| base_domain: "test\\.local" | ||
| single_node: "(?i:false)" | ||
| cpu_architecture: None | ||
| ssh_public_key: None | ||
| cpu_architecture: "x86_64" | ||
| ssh_public_key: "" | ||
| expected_keywords: ["eval-test-multinode-uniq-cluster-name", "ID", "Discovery ISO", "cluster"] |
There was a problem hiding this comment.
🛠️ Refactor suggestion
🧩 Analysis chain
Make expected tool args tolerant of optional SSH key; current empty-string match is brittle.
The agent may omit ssh_public_key entirely or set it to null. Matching an empty string only will cause flaky failures. Prefer a permissive regex.
- ssh_public_key: ""
+ ssh_public_key: ".*"If the harness supports partial argument matching, consider omitting ssh_public_key from expectations entirely.
🏁 Script executed:
#!/usr/bin/env bash
# Check for consistency of SSH key expectations across evals
rg -nC1 'ssh_public_key:' test/evals/eval_data.yamlLength of output: 687
Make ssh_public_key expectation tolerant and consistent across evals.
Empty-string at test/evals/eval_data.yaml (lines 99-101) is brittle; other entries (lines 73-75 and 109-111) use an ssh-rsa regex. Replace the empty-string with a pattern that accepts either a valid ssh key or nothing:
- ssh_public_key: ""
+ ssh_public_key: '^(?:ssh-rsa\s+[A-Za-z0-9+/]+[=]{0,3}(?:\s+.+)?\s*)?$'Or, if the harness supports partial matching, omit ssh_public_key from expectations entirely.
🤖 Prompt for AI Agents
In test/evals/eval_data.yaml around lines 91 to 101, the expected ssh_public_key
is an empty string which is brittle; replace the empty-string expectation with a
tolerant regex that accepts either nothing or a valid SSH public key (e.g. set
ssh_public_key to something like "(^$|^ssh-(rsa|ed25519) [A-Za-z0-9+/=]+(?:
.*)?$)") so it matches an empty value or common ssh-rsa/ssh-ed25519 keys;
alternatively, if the test harness allows partial matching, remove the
ssh_public_key key from the expected arguments entirely.
|
/test eval-test |
5 similar comments
|
/test eval-test |
|
/test eval-test |
|
/test eval-test |
|
/test eval-test |
|
/test eval-test |
|
@andrej1991: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary by CodeRabbit