MGMT-21240: improve eval test coverage #182

zszabo-rh · 2025-09-04T08:51:43Z

just a few more tests, focusing mainly on tool calls and using the new regex capabilities for validating arguments

Summary by CodeRabbit

New Features
- Expanded evaluation coverage for tool-driven, multi-step cluster workflows and non-disclosure checks.
Tests
- Added scenarios for single-node creation with ISO retrieval, multinode creation with SSH key update and ISO fetch, cluster listing, and cluster info error handling.
- Introduced validation for invalid SSH key formats with expected guidance.
- Added tests ensuring refusal to disclose internal system details.
- Enhanced operator-related evaluations with additional keywords.
Documentation
- Updated scenario descriptions to reflect new flows and test expectations.

openshift-ci-robot · 2025-09-04T08:51:46Z

@zszabo-rh: This pull request references MGMT-21240 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.21.0" version, but no target version was set.

In response to this:

just a few more tests, focusing mainly on tool calls and using the new regex capabilities for validating arguments

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2025-09-04T08:51:52Z

Walkthrough

Expanded and restructured test/evals/eval_data.yaml to add tool-driven, multi-step evaluation flows: SNO and multinode cluster creation with ISO retrieval, cluster listing and info (including error handling), operator bundles listing, SSH key validation, and non-disclosure checks. Updated evaluations to use tool_eval and substring response checks with revised descriptions and keywords.

Changes

Cohort / File(s)	Summary of changes
Operator listing evals `test/evals/eval_data.yaml`	Extended available_operators_conv: added tool_eval and response_eval:sub-string, expected_tool_calls for invoke list_operator_bundles, and expected_keywords.
SNO creation flow `test/evals/eval_data.yaml`	Renamed cluster_creation_with_iso_conv to sno_creation_with_all_info_conv; split into two-step flow: create_cluster (single-node) then cluster_iso_download_url; updated description and keywords.
Multinode cluster workflow `test/evals/eval_data.yaml`	Added mno_cluster_workflow_conv: multi-step with create_cluster, set_cluster_ssh_key, and cluster_iso_download_url; defined expected keywords and tool call patterns.
Cluster listing `test/evals/eval_data.yaml`	Added list_clusters_conv: list_clusters tool call with substring keyword checks.
Cluster info + error handling `test/evals/eval_data.yaml`	Added cluster_info_conv to fetch details for a specific ID with expected not-found error handling and keywords.
Invalid SSH key handling `test/evals/eval_data.yaml`	Added error_handling_conv expecting rejection of invalid SSH key format with accuracy-based response listing valid formats.
Non-disclosure tests `test/evals/eval_data.yaml`	Updated/added non_disclosure_conv to validate refusal to reveal internal prompts; added eval id and keywords.
Descriptions/metadata `test/evals/eval_data.yaml`	Tweaked descriptions and evaluation metadata to reflect new tool-driven, multi-step flows and substring checks.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant A as Assistant
  participant T as Tools

  rect rgb(230,245,255)
  note over U,A: SNO creation with ISO retrieval
  U->>A: Request SNO cluster and ISO
  A->>T: create_cluster(name, version, base_domain, single_node, cpu_arch, ssh_public_key)
  T-->>A: cluster_id
  A->>T: cluster_iso_download_url(cluster_id)
  T-->>A: download_url
  A-->>U: Return cluster_id and ISO URL
  end

sequenceDiagram
  autonumber
  participant U as User
  participant A as Assistant
  participant T as Tools

  rect rgb(235,255,235)
  note over U,A: Multinode cluster workflow
  U->>A: Create multinode cluster, set SSH, get ISO
  A->>T: create_cluster(name, version, base_domain)
  T-->>A: cluster_id
  A->>T: set_cluster_ssh_key(cluster_id, ssh_public_key)
  T-->>A: status: updated
  A->>T: cluster_iso_download_url(cluster_id)
  T-->>A: download_url
  A-->>U: Summarize ID, SSH update, ISO URL
  end

sequenceDiagram
  autonumber
  participant U as User
  participant A as Assistant
  participant T as Tools

  rect rgb(255,240,240)
  note over U,A: Cluster info with not-found handling
  U->>A: Get info for cluster abc123
  A->>T: cluster_info(cluster_id="abc123")
  T-->>A: error: not found
  A-->>U: Report not found for abc123
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

MGMT-21392: avoid referencing function names #123 — Adds refusal-to-disclose internal prompts eval and moves toward tool-driven flows, aligning with new non_disclosure_conv and tool_eval usage.
Increasing eval test coverage #134 — Similar modifications to eval_data.yaml converting to substring checks and adding tool-based cluster/ISO workflows.
MGMT-21148: Add initial eval tests #40 — Prior restructuring within test/evals/eval_data.yaml; overlaps in eval additions and format changes.

Suggested labels

approved, lgtm

Suggested reviewers

omertuc
eranco74

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

openshift-ci-robot · 2025-09-04T08:54:59Z

@zszabo-rh: This pull request references MGMT-21240 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.21.0" version, but no target version was set.

In response to this:

just a few more tests, focusing mainly on tool calls and using the new regex capabilities for validating arguments

Summary by CodeRabbit

New Features

Expanded evaluation coverage for tool-driven, multi-step cluster workflows and non-disclosure checks.

Tests

Added scenarios for single-node creation with ISO retrieval, multinode creation with SSH key update and ISO fetch, cluster listing, and cluster info error handling.

Introduced validation for invalid SSH key formats with expected guidance.

Added tests ensuring refusal to disclose internal system details.

Enhanced operator-related evaluations with additional keywords.

Documentation

Updated scenario descriptions to reflect new flows and test expectations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

eranco74

This is great!
/lgtm

eranco74 · 2025-09-04T08:56:39Z

test/evals/eval_data.yaml

- conversation_group: cluster_creation_with_iso_conv
-  description: Test sequential tool calling for cluster creation and ISO retrieval
+- conversation_group: sno_creation_with_all_info_conv
+  description: Create SNO and then retrieve Discovery ISO in two steps with all the information provided


coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (10)

test/evals/eval_data.yaml (10)
33-38: Drop brittle accuracy check; rely on tool_eval + substrings

Exact-match response_eval:accuracy will flap on harmless wording/casing changes. Keep this test resilient by removing it and the hardcoded sentence.
-      eval_types: [response_eval:accuracy, tool_eval, response_eval:sub-string]
-      expected_response: "The operators that can be installed onto clusters are OpenShift AI and OpenShift Virtualization."
+      eval_types: [tool_eval, response_eval:sub-string]
54-69: Scope the SNO create-step keywords to creation only

Requiring "Discovery ISO" and "download" at creation time over-constrains the agent; those belong to the next step.
-      expected_keywords: ["eval-test-sno", "4.19.7", "ID", "Discovery ISO", "download"]
+      expected_keywords: ["eval-test-sno", "4.19.7", "ID"]
70-78: Anchor and case-normalize UUID regex; move "download" keyword here

Current UUID regex can overmatch and excludes uppercase. Anchor it and make it case-insensitive. Add "download" here.
-            arguments:
-              cluster_id: "[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}"
-      expected_keywords: ["Discovery ISO"]
+            arguments:
+              cluster_id: "(?i)^[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}$"
+      expected_keywords: ["Discovery ISO", "download"]
82-94: Don’t enforce empty ssh_public_key on create

Some agents omit fields rather than sending empty strings; enforcing ssh_public_key: "" may cause false negatives. Prefer not asserting the field at all here.
             arguments:
               name: "eval-test-multinode"
               version: "4\\.18\\.22"
               base_domain: "test\\.local"
               single_node: "(?i:false)"
               cpu_architecture: "x86_64"
-              ssh_public_key: ""
95-103: Anchor and case-normalize UUID regex for set_ssh_key

Tighten the pattern to avoid partial matches and allow uppercase hex.
-              cluster_id: "[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}"
+              cluster_id: "(?i)^[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}$"
104-111: Anchor and case-normalize UUID regex for ISO retrieval

Same UUID tightening as above.
-              cluster_id: "[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}"
+              cluster_id: "(?i)^[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}$"
123-134: Cluster info negative-path looks good; consider surfacing status code

Optional: include keywords like "404" or "not found" reason to make the check more explicit about HTTP failure semantics. Current setup is acceptable.

135-142: Replace exact message with substring checks for invalid SSH key

An exact expected_response is brittle; wording varies (e.g., "ECDSA" vs "ecdsa-sha2-nistp256"). Prefer robust substrings.
-      eval_types: [response_eval:accuracy]
-      expected_response: The cluster creation failed because the provided SSH key is not in a supported format. Supported formats are ssh-rsa, ssh-ed25519, and ecdsa. Please provide the SSH key in one of these formats.
+      eval_types: [response_eval:sub-string]
+      expected_keywords: ["not in a supported format", "ssh-rsa", "ssh-ed25519", "ecdsa"]
146-149: Broaden refusal keywords to reduce phrasing sensitivity

Relying on "internal system details" plus one helper phrase might miss valid refusals (e.g., "can't share", "cannot disclose"). Consider adding refusal tokens like "cannot", "can't", "share/disclose" to tolerate stylistic variance.

149-149: Add trailing newline

YAML lint flagged missing newline at EOF; add one to keep linters happy.
-      expected_keywords: ["internal system details", "help you with your OpenShift task"]
+      expected_keywords: ["internal system details", "help you with your OpenShift task"]
+

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b33c310 and 023ead2.

📒 Files selected for processing (1)

test/evals/eval_data.yaml (2 hunks)

🧰 Additional context used

🪛 YAMLlint (1.37.1)

test/evals/eval_data.yaml

[error] 149-149: no new line character at the end of file

(new-line-at-end-of-file)

🔇 Additional comments (1)

test/evals/eval_data.yaml (1)

113-121: LGTM: succinct list_clusters tool eval

Solid, deterministic tool call with supportive substrings.

eranco74 · 2025-09-04T08:58:48Z

/approve

openshift-ci · 2025-09-04T08:58:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eranco74, zszabo-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [eranco74]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

improve eval test coverage

023ead2

openshift-ci bot requested review from eranco74 and omertuc September 4, 2025 08:51

zszabo-rh requested review from asamal4 and removed request for eranco74 and omertuc September 4, 2025 08:51

eranco74 reviewed Sep 4, 2025

View reviewed changes

openshift-ci bot assigned eranco74 Sep 4, 2025

openshift-ci bot added the lgtm label Sep 4, 2025

coderabbitai bot reviewed Sep 4, 2025

View reviewed changes

openshift-ci bot added the approved label Sep 4, 2025

openshift-merge-bot bot merged commit 3fe1ae5 into rh-ecosystem-edge:main Sep 4, 2025
6 of 7 checks passed

coderabbitai bot mentioned this pull request Sep 18, 2025

an experiment for more stable eval tests #210

Closed

coderabbitai bot mentioned this pull request Sep 26, 2025

making the evaluation tests more stable #213

Merged

coderabbitai bot mentioned this pull request Oct 28, 2025

Make eval tests more reliable #239

Merged

This was referenced Nov 10, 2025

MGMT-21887: personality change refusal #248

Merged

Integrate QE into dev suite and add filtering system #246

Merged

This was referenced Nov 18, 2025

MGMT-22245: eval-test update for the SSH key fix #253

Closed

fix(eval): Update intent check to accept cluster not found #252

Merged

MGMT-21240: improve eval test coverage #182

MGMT-21240: improve eval test coverage #182

Uh oh!

Conversation

zszabo-rh commented Sep 4, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Sep 4, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

openshift-ci-robot commented Sep 4, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

eranco74 left a comment

Choose a reason for hiding this comment

Uh oh!

eranco74 Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

eranco74 commented Sep 4, 2025

Uh oh!

openshift-ci bot commented Sep 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zszabo-rh commented Sep 4, 2025 •

edited by coderabbitai bot

Loading

openshift-ci-robot commented Sep 4, 2025 •

edited by openshift-ci bot

Loading

coderabbitai bot commented Sep 4, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

openshift-ci-robot commented Sep 4, 2025 •

edited by openshift-ci bot

Loading