-
Notifications
You must be signed in to change notification settings - Fork 21
Removing the openshift version and the base domain from the expected response #202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing the openshift version and the base domain from the expected response #202
Conversation
…response of the chat for a successful cluster creation, because the system prompt does not instruct the chat to repeat this information back. Sometimes the model/chat does repeat those information and sometimes it does not which makes the tests unstable.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: andrej1991 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughAdjusted evaluation expectations in test/evals/eval_data.yaml by removing OpenShift version and base-domain tokens from expected_keywords and expected_response texts, simplifying cluster-creation responses, and slightly altering keyword sets in one multinode block. Tool call expectations were unchanged. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks (3 passed)✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
test/evals/eval_data.yaml (4)
76-77: Good simplification; drop exact-match accuracy to prevent flakiness.You achieved the PR goal here. To further stabilize this eval, remove the exact-match accuracy check so we don’t couple the test to a specific sentence form.
- eval_types: [tool_eval, response_eval:sub-string, response_eval:accuracy] + eval_types: [tool_eval, response_eval:sub-string]Also applies to: 66-66
102-103: Restore “download” keyword for consistency and de-flake by removing accuracy.Multinode block omits "download" while SNO includes it; align both. Also drop the accuracy check like above.
- eval_types: [tool_eval, response_eval:accuracy, response_eval:sub-string] + eval_types: [tool_eval, response_eval:sub-string]- expected_keywords: ["eval-test-multinode-ClustER-NAme", "ID", "Discovery ISO", "cluster"] + expected_keywords: ["eval-test-multinode-ClustER-NAme", "ID", "Discovery ISO", "download", "cluster"]Also applies to: 91-91
180-181: LGTM; make keyword-only for stability.Content matches the new policy. Recommend removing the accuracy check here too.
- eval_types: [response_eval:accuracy, response_eval:sub-string] + eval_types: [response_eval:sub-string]Also applies to: 179-179
112-112: Fix redundant phrase in SSH key confirmation message.The expected response reads awkwardly and likely mismatches model outputs.
- expected_response: The SSH public key is set for the cluster for cluster + expected_response: The SSH public key is set for the cluster.
|
/test eval-test |
1 similar comment
|
/test eval-test |
|
@andrej1991: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/test eval-test |
Removing the openshift version and the base domain from the expected response response of the chat for a successful cluster creation, because the system prompt does not instruct the chat to repeat this information back. Sometimes the model/chat does repeat those information and sometimes it does not which makes the tests unstable.
Summary by CodeRabbit