Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion test/evals/eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ def parse_args():
result_summary = evaluator.get_result_summary()

failed_evals_count = result_summary["FAIL"] + result_summary["ERROR"]
if failed_evals_count > 2:
if failed_evals_count:
print(f"❌ {failed_evals_count} evaluation(s) failed!")
sys.exit(1)

Expand Down
34 changes: 7 additions & 27 deletions test/evals/eval_data.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,8 @@
conversation:
- eval_id: static_networking_support
eval_query: Create a cluster with static networking
eval_types: [response_eval:sub-string]
expected_keywords: ["I do not support creating clusters with static networking", "assisted-installer web-based wizard"]
eval_types: [response_eval:accuracy]
expected_response: "I do not support creating clusters with static networking, however, you can use the assisted-installer web-based wizard to configure a cluster with static networking."

- conversation_group: sno_requirements_conv
conversation:
Expand Down Expand Up @@ -88,16 +88,16 @@
conversation:
- eval_id: create_eval_test_multinode
eval_types: [tool_eval, response_eval:accuracy, response_eval:sub-string]
eval_query: Create a multi-node cluster named 'eval-test-multinode-uniq-cluster-name' with OpenShift 4.18.22 and domain test.local
eval_query: Create a multi-node cluster named 'eval-test-multinode-uniq-cluster-name' with OpenShift 4.18.22 and domain test.local.
expected_tool_calls:
- - tool_name: create_cluster
arguments:
name: "eval-test-multinode-uniq-cluster-name"
version: "4\\.18\\.22"
base_domain: "test\\.local"
single_node: "(?i:false)"
cpu_architecture: None
ssh_public_key: None
cpu_architecture: "x86_64"
ssh_public_key: ""
expected_keywords: ["eval-test-multinode-uniq-cluster-name", "ID", "Discovery ISO", "cluster"]
Comment on lines +91 to 101
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

🧩 Analysis chain

Make expected tool args tolerant of optional SSH key; current empty-string match is brittle.

The agent may omit ssh_public_key entirely or set it to null. Matching an empty string only will cause flaky failures. Prefer a permissive regex.

-              ssh_public_key: ""
+              ssh_public_key: ".*"

If the harness supports partial argument matching, consider omitting ssh_public_key from expectations entirely.


🏁 Script executed:

#!/usr/bin/env bash
# Check for consistency of SSH key expectations across evals
rg -nC1 'ssh_public_key:' test/evals/eval_data.yaml

Length of output: 687


Make ssh_public_key expectation tolerant and consistent across evals.

Empty-string at test/evals/eval_data.yaml (lines 99-101) is brittle; other entries (lines 73-75 and 109-111) use an ssh-rsa regex. Replace the empty-string with a pattern that accepts either a valid ssh key or nothing:

-              ssh_public_key: ""
+              ssh_public_key: '^(?:ssh-rsa\s+[A-Za-z0-9+/]+[=]{0,3}(?:\s+.+)?\s*)?$'

Or, if the harness supports partial matching, omit ssh_public_key from expectations entirely.

🤖 Prompt for AI Agents
In test/evals/eval_data.yaml around lines 91 to 101, the expected ssh_public_key
is an empty string which is brittle; replace the empty-string expectation with a
tolerant regex that accepts either nothing or a valid SSH public key (e.g. set
ssh_public_key to something like "(^$|^ssh-(rsa|ed25519) [A-Za-z0-9+/=]+(?:
.*)?$)") so it matches an empty value or common ssh-rsa/ssh-ed25519 keys;
alternatively, if the test harness allows partial matching, remove the
ssh_public_key key from the expected arguments entirely.

expected_response: I have created a cluster with name eval-test-multinode-uniq-cluster-name. Next, you'll need to download the Discovery ISO, then boot your hosts with it. Would you like me to get the Discovery ISO download URL?
- eval_id: set_ssh_key_eval_test_ssh
Expand All @@ -118,12 +118,8 @@
cluster_id: "[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}"
expected_keywords: ["Discovery ISO"]
- eval_id: host_booted_but_not_discovered
eval_query: I booted the hosts
eval_types: [tool_eval, response_eval:accuracy]
expected_tool_calls:
- - tool_name: cluster_info
arguments:
cluster_id: "[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}"
eval_query: I booted the hosts for the cluster you just created
eval_types: [response_eval:accuracy]
expected_response: "hosts hasn't been discovered yet."

- conversation_group: list_clusters_conv
Expand Down Expand Up @@ -178,19 +174,3 @@
expected_response: I can help you install OpenShift on-premise using the Assisted Installer, either on bare metal servers or virtual machines (e.g., vSphere, KVM, libvirt). I do not support public cloud platforms like AWS, Azure, or GCP.
description: Ensure the assistant concisely states supported on-prem baremetal/VM scope and references Assisted Installer

- conversation_group: cluster_id_from_name
conversation:
- eval_id: create_single_node_cluser
eval_query: Create a multi-node cluster named 'eval-test2-uniq-cluster-name' with OpenShift 4.18.22 and domain test.local. I do not have an SSH key to provide.
eval_types: [response_eval:accuracy, response_eval:sub-string]
expected_keywords: ["eval-test2-uniq-cluster-name", "ID", "Discovery ISO", "download", "cluster"]
expected_response: I have created a cluster with name eval-test-uniq-cluster-name. Next, you'll need to download the Discovery ISO, then boot your hosts with it. Would you like me to get the Discovery ISO download URL?
- eval_id: cluster_name_tool_call
eval_query: Show me information on cluster eval-test2-uniq-cluster-name
eval_types: [tool_eval, response_eval:sub-string]
expected_tool_calls:
- - tool_name: cluster_info
arguments:
cluster_id: "[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}"
expected_keywords: ["cluster", "eval-test2-uniq-cluster-name", "test.local", "4.18.22"]
description: Test handling requesting a cluster by name