Skip to content

Conversation

@carbonin
Copy link
Collaborator

@carbonin carbonin commented Jul 16, 2025

This commit replaces the instructions and placeholder tests with actual
eval tests that run and pass.

https://issues.redhat.com/browse/MGMT-21148

cc @eranco74 @omertuc @asamal4

Summary by CodeRabbit

Summary by CodeRabbit

  • Chores

    • Updated ignored files and directories to better support common development environments.
    • Added a Makefile target to run agent evaluation tests and included it in the help output.
  • Documentation

    • Added a new README documenting the agent task completion evaluation process, usage instructions, and example output.
  • Tests

    • Introduced a new evaluation script and YAML-based test data for agent goal evaluation.
    • Removed outdated evaluation scripts, test data, and related documentation.

This commit replaces the instructions and placeholder tests with actual
eval tests that run and pass.

Currently they just do a substring test on the output of the model, and
only handle read queries (rather than creating a cluster), but this is
just a start to get something that can be improved over time.

Additionally the tests themselves currently pass if _any_ of the listed
strings are present in the output, but it would be better to get some
more flexible matching behavior (at least an AND) option for things like
the versions test.
@coderabbitai
Copy link

coderabbitai bot commented Jul 16, 2025

Walkthrough

The changes update the .gitignore, add a new test-eval Makefile target, and replace the evaluation framework. The old evaluation script and test data are deleted, while new evaluation scripts, configuration, and documentation are introduced in the test/evals directory, streamlining agent task completion evaluation.

Changes

File(s) Change Summary
.gitignore Added entries for .vscode/, ocm_token.txt, .venv/, .python-version to ignore development artifacts.
Makefile Added test-eval target to refresh OCM token and run agent evaluation tests; updated .PHONY and help.
scripts/eval.py Deleted placeholder evaluation script with commented-out functions.
test/evals/AGENT_E2E_EVAL.md Deleted documentation for the previous end-to-end agent evaluation mechanism.
test/evals/basic_evals.yaml Deleted previous YAML test data for agent evaluation.
test/evals/README.md Added new documentation for Agent Task Completion Evaluation, including usage and example output.
test/evals/eval.py Added new evaluation script to run agent goal evaluations from YAML config, print results, and summarize.
test/evals/eval_data.yaml Added new YAML test data defining evaluation queries, types, and expected keywords.

Sequence Diagram(s)

sequenceDiagram
    participant Developer
    participant Makefile
    participant OCM Token Script
    participant Eval Script
    participant Agent API

    Developer->>Makefile: make test-eval
    Makefile->>OCM Token Script: source utils/ocm-token.sh & get_ocm_token
    OCM Token Script->>Makefile: Write token to test/evals/ocm_token.txt
    Makefile->>Eval Script: cd test/evals && python eval.py
    Eval Script->>Eval Script: Load eval_data.yaml
    loop For each evaluation
        Eval Script->>Agent API: Send query
        Agent API-->>Eval Script: Return response
        Eval Script->>Eval Script: Check response vs expected keywords
        Eval Script->>Developer: Print pass/fail result
    end
    Eval Script->>Developer: Print summary and exit status
Loading

Estimated code review effort

3 (30–60 minutes) — Moderate complexity due to multiple file deletions and additions affecting evaluation framework, scripts, and documentation.

Possibly related PRs

  • rh-ecosystem-edge/assisted-chat#31: Deletes the same end-to-end evaluation markdown documentation, indicating a direct connection through documentation changes.
  • rh-ecosystem-edge/assisted-chat#15: Removes the placeholder scripts/eval.py and deletes test/evals/basic_evals.yaml, showing a direct relationship in evaluation scaffolding and test data.
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@carbonin carbonin changed the title Add initial eval tests MGMT-21148: Add initial eval tests Jul 16, 2025
@carbonin
Copy link
Collaborator Author

Also note that running these contacts assisted prod and gemini so some changes will probably be needed before we put this into some kind of CI. Maybe we want to mock out assisted so we don't have to deploy it ourselves? Also maybe we won't want to run it super frequently (every commit on every PR) in case we start running up our Gemini bill.

@carbonin
Copy link
Collaborator Author

carbonin commented Jul 16, 2025

I also considered adding some kind of actual python dependency management, but figured we might want to wait on that until this isn't just in @asamal4's branch?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
test/evals/eval_data.yaml (1)

15-22: Consider making version checks more flexible.

The hardcoded version numbers (4.19, 4.18, 4.20) may become outdated as new OpenShift versions are released. Consider using a more flexible approach such as:

  • Regex patterns to match version format (e.g., 4.\d+)
  • Dynamic version retrieval from the API
  • More generic version-related keywords
- expected_key_words:
-   - "4.19"
-   - "4.18"
-   - "4.20"
+ expected_key_words:
+   - "4."
+   - "version"
+   - "OpenShift"
Makefile (1)

71-75: Add error handling for token generation and script dependencies.

The target implementation looks correct but lacks error handling. Consider adding checks to ensure dependencies exist and token generation succeeds.

 test-eval: ## Run agent evaluation tests
 	@echo "Refreshing OCM token..."
+	@if [ ! -f utils/ocm-token.sh ]; then echo "Error: utils/ocm-token.sh not found"; exit 1; fi
 	@. utils/ocm-token.sh && get_ocm_token && echo "$$OCM_TOKEN" > test/evals/ocm_token.txt
+	@if [ ! -s test/evals/ocm_token.txt ]; then echo "Error: Failed to generate OCM token"; exit 1; fi
 	@echo "Running agent evaluation tests..."
+	@if [ ! -f test/evals/eval.py ]; then echo "Error: test/evals/eval.py not found"; exit 1; fi
 	@cd test/evals && python eval.py
test/evals/README.md (1)

21-43: Fix markdown formatting issue.

The fenced code block lacks a language specification, which affects rendering and accessibility.

-```
+```text
 Refreshing OCM token...
 Running agent evaluation tests...
 Running 4 evaluation(s)...
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between efdb35e and 002635d.

📒 Files selected for processing (8)
  • .gitignore (1 hunks)
  • Makefile (3 hunks)
  • scripts/eval.py (0 hunks)
  • test/evals/AGENT_E2E_EVAL.md (0 hunks)
  • test/evals/README.md (1 hunks)
  • test/evals/basic_evals.yaml (0 hunks)
  • test/evals/eval.py (1 hunks)
  • test/evals/eval_data.yaml (1 hunks)
💤 Files with no reviewable changes (3)
  • test/evals/basic_evals.yaml
  • test/evals/AGENT_E2E_EVAL.md
  • scripts/eval.py
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
test/evals/README.md

21-21: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (6)
.gitignore (1)

3-6: LGTM! Proper handling of environment-specific and sensitive files.

The additions appropriately exclude development environment files and the sensitive ocm_token.txt authentication token generated by the evaluation tests.

test/evals/eval_data.yaml (1)

1-29: Well-structured evaluation data with clear test cases.

The YAML structure is clean and covers key functionality areas. The substring-based evaluation approach is appropriate for initial testing.

Makefile (1)

7-7: Proper integration with Makefile structure.

The addition to the .PHONY list and help output is correctly implemented.

test/evals/README.md (1)

1-44: Comprehensive documentation with clear prerequisites and usage instructions.

The README provides excellent coverage of the evaluation mechanism, prerequisites, and usage examples. The documentation structure is logical and user-friendly.

test/evals/eval.py (2)

4-25: Well-structured result formatting function.

The print_test_result function handles different evaluation types appropriately and provides clear, readable output formatting.


42-75: Clean evaluation loop with proper exit code handling.

The main evaluation logic is well-structured with immediate result printing and proper exit code handling for CI/CD integration.

Comment on lines 37 to 40
args = EvalArgs()

evaluator = AgentGoalEval(args)
configs = evaluator.config_manager.get_eval_data()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for missing dependencies.

The script assumes the agent_eval module and configuration files exist without validation.

+import os
+
+# Validate dependencies
+try:
+    from agent_eval import AgentGoalEval
+except ImportError:
+    print("Error: agent_eval module not found. Please install it using:")
+    print("pip install git+https://github.com/asamal4/lightspeed-evaluation.git@agent-goal-eval#subdirectory=agent_eval")
+    sys.exit(1)
+
 args = EvalArgs()
+
+# Validate configuration files
+if not os.path.exists(args.eval_data_yaml):
+    print(f"Error: {args.eval_data_yaml} not found")
+    sys.exit(1)
+
+if not os.path.exists(args.agent_auth_token_file):
+    print(f"Error: {args.agent_auth_token_file} not found")
+    sys.exit(1)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In test/evals/eval.py around lines 37 to 40, add error handling to check if the
agent_eval module and required configuration files are present before using
them. Implement try-except blocks to catch ImportError for the module and
FileNotFoundError or similar exceptions for configuration files, and provide
clear error messages or fallback behavior to handle missing dependencies
gracefully.

Comment on lines 26 to 36
class EvalArgs:
def __init__(self):
self.eval_data_yaml = 'eval_data.yaml'
self.agent_endpoint = 'http://localhost:8090'
self.agent_provider = 'gemini'
self.agent_model = 'gemini/gemini-2.5-flash'
self.judge_provider = None
self.judge_model = None
self.agent_auth_token_file = 'ocm_token.txt'
self.result_dir = 'results'

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Address hardcoded configuration values and indentation inconsistency.

Several issues with the configuration class:

  1. Mixed indentation (line 27 uses spaces instead of tabs)
  2. Hardcoded localhost endpoint may not work in containerized environments
  3. Hardcoded Gemini provider/model limits flexibility
 class EvalArgs:
-   def __init__(self):
+    def __init__(self):
-       self.eval_data_yaml = 'eval_data.yaml'
-       self.agent_endpoint = 'http://localhost:8090'
-       self.agent_provider = 'gemini'
-       self.agent_model = 'gemini/gemini-2.5-flash'
+        self.eval_data_yaml = 'eval_data.yaml'
+        self.agent_endpoint = os.getenv('AGENT_ENDPOINT', 'http://localhost:8090')
+        self.agent_provider = os.getenv('AGENT_PROVIDER', 'gemini')
+        self.agent_model = os.getenv('AGENT_MODEL', 'gemini/gemini-2.5-flash')

Also add import os at the top of the file.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In test/evals/eval.py around lines 26 to 36, fix the mixed indentation by using
consistent tabs or spaces throughout the EvalArgs class. Replace the hardcoded
localhost URL with a configurable environment variable fallback using os.getenv
to support containerized environments. Similarly, make the agent_provider and
agent_model configurable via environment variables instead of hardcoding
'gemini' values. Also, add 'import os' at the top of the file to enable
environment variable usage.

@carbonin carbonin marked this pull request as draft July 17, 2025 13:25
@carbonin
Copy link
Collaborator Author

Moved this to a draft to see if I can use the judge llm instead of the string matching (which doesn't really validate what we want particularly well)

@carbonin
Copy link
Collaborator Author

This works well now with the judge, but I'm going to leave it as a draft until we get a package that isn't in @asamal4 's branch

@asamal4
Copy link
Contributor

asamal4 commented Jul 21, 2025

This works well now with the judge, but I'm going to leave it as a draft until we get a package that isn't in @asamal4 's branch

LSC eval is merged..

Also I see that you are writing separate print statements. I will add more verbose statements in original LSC eval logic, so that minimal logic would be required here.

Copy link
Collaborator

@eranco74 eranco74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@carbonin
Copy link
Collaborator Author

Also I see that you are writing separate print statements. I will add more verbose statements in original LSC eval logic, so that minimal logic would be required here.

Yeah, IIRC I wasn't getting enough feedback for failed tests so I just stopped using the original API from your example.

@carbonin
Copy link
Collaborator Author

I'll update this to use the merged eval code today then undraft.

eval_type: judge-llm
expected_response: "There are several versions of OpenShift available. The most recent production version is 4.19, 4.20 pre release versions are available as well as several previous verrsions."

- eval_id: available_operators

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The naming is a bit confusing to me, since these are Operator bundles rather than operators.

This happened to me also when I asked a question to the service, so I understand the naming was chosen like that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happened to me also when I asked a question to the service, so I understand the naming was chosen like that?

What happened, exactly?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to reproduce it and update here later.
IIRC, I asked about individual operators and the response was about operator bundles, but without calling them "bundles".

It can also be confusing because for the two bundles, we have similarly named operators.

@carbonin carbonin marked this pull request as ready for review July 21, 2025 13:29
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
test/evals/eval.py (2)

1-17: Add error handling for missing dependencies.

The script imports lsc_agent_eval without validation. If this dependency is missing, the script will fail with an unclear error message.

+import os
+
+# Validate dependencies
+try:
+    from lsc_agent_eval import AgentGoalEval
+except ImportError:
+    print("Error: lsc_agent_eval module not found. Please install it using:")
+    print("pip install git+https://github.com/lightspeed-core/lightspeed-evaluation.git#subdirectory=lsc_agent_eval")
+    sys.exit(1)
+
 import sys
 import logging
 import argparse
-from lsc_agent_eval import AgentGoalEval

40-51: Make configuration values configurable and add file validation.

The configuration uses hardcoded values that limit flexibility, especially in containerized environments. Additionally, the script should validate that required files exist before proceeding.

+import os
+
 # Create proper Namespace object for AgentGoalEval
 args = argparse.Namespace()
-args.eval_data_yaml = 'eval_data.yaml'
-args.agent_endpoint = 'http://localhost:8090'
-args.agent_provider = 'gemini'
-args.agent_model = 'gemini/gemini-2.5-flash'
+args.eval_data_yaml = os.getenv('EVAL_DATA_YAML', 'eval_data.yaml')
+args.agent_endpoint = os.getenv('AGENT_ENDPOINT', 'http://localhost:8090')
+args.agent_provider = os.getenv('AGENT_PROVIDER', 'gemini')
+args.agent_model = os.getenv('AGENT_MODEL', 'gemini/gemini-2.5-flash')
 # Set up judge model for LLM evaluation
-args.judge_provider = 'gemini'
-args.judge_model = 'gemini-2.5-flash'
-args.agent_auth_token_file = 'ocm_token.txt'
-args.result_dir = 'results'
+args.judge_provider = os.getenv('JUDGE_PROVIDER', 'gemini')
+args.judge_model = os.getenv('JUDGE_MODEL', 'gemini-2.5-flash')
+args.agent_auth_token_file = os.getenv('AGENT_AUTH_TOKEN_FILE', 'ocm_token.txt')
+args.result_dir = os.getenv('RESULT_DIR', 'results')
+
+# Validate required files exist
+if not os.path.exists(args.eval_data_yaml):
+    print(f"Error: {args.eval_data_yaml} not found")
+    sys.exit(1)
+
+if not os.path.exists(args.agent_auth_token_file):
+    print(f"Error: {args.agent_auth_token_file} not found")
+    sys.exit(1)
🧹 Nitpick comments (1)
test/evals/README.md (1)

22-22: Add language specification to fenced code block.

The fenced code block should specify a language for proper syntax highlighting and better documentation quality.

-```
+```bash
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 002635d and 9140c9c.

📒 Files selected for processing (3)
  • test/evals/README.md (1 hunks)
  • test/evals/eval.py (1 hunks)
  • test/evals/eval_data.yaml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/evals/eval_data.yaml
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
test/evals/README.md

22-22: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (6)
test/evals/README.md (2)

1-6: Clear documentation structure and comprehensive overview.

The introduction effectively explains the evaluation mechanism and supported types. The reference to the external documentation provides helpful context for users.


16-21: Concise and clear instructions.

The running tests section provides straightforward guidance for users.

test/evals/eval.py (4)

18-39: Well-structured result display function.

The function effectively handles different evaluation types and provides clear, user-friendly output for both success and failure cases.


52-57: Clean evaluator initialization and setup.

The evaluator initialization and configuration loading is well-structured and provides good user feedback.


58-76: Excellent evaluation loop with clear progress tracking.

The loop provides good user feedback with numbered progress indicators and immediate result display. The counting logic is sound and handles both success and failure cases properly.


77-89: Proper final summary and exit code handling.

The final summary provides clear results and uses appropriate exit codes for automation and CI integration.

Comment on lines +7 to +15
## Prerequisites
- **Python**: Version 3.11.1 to 3.12.9
- **Assisted Chat API**: Must be running (`make build-images run`)
- Install lightspeed-core **agent e2e eval**
```bash
pip install git+https://github.com/lightspeed-core/lightspeed-evaluation.git#subdirectory=lsc_agent_eval
```
- `GEMINI_API_KEY` env var is set

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify Python version requirements and repository availability.

The Python version range (3.11.1 to 3.12.9) seems very specific. Consider whether this range is actually required or if it could be more inclusive. Also, ensure the referenced GitHub repository and subdirectory path are correct and accessible.


🏁 Script executed:

#!/bin/bash
# Verify the GitHub repository and subdirectory exist
curl -s -o /dev/null -w "%{http_code}" https://github.com/lightspeed-core/lightspeed-evaluation/tree/main/lsc_agent_eval

# Check if the repository is accessible
gh repo view lightspeed-core/lightspeed-evaluation --json name,visibility,isPrivate

Length of output: 284


🏁 Script executed:

#!/bin/bash
set -e

echo "Searching for python_requires in project files..."
rg -n "python_requires" .
rg -n "requires-python" .

echo -e "\nSearching GitHub Actions workflows for Python version matrix..."
rg -n "python-version" -g "*.yml" .
rg -n "matrix.*python" -g "*.yml" .

Length of output: 143


Ensure Python version constraints are enforced and docs are accurate

It looks like the codebase doesn’t declare any Python version bounds in its packaging metadata (no python_requires or requires-python in setup.py/pyproject.toml) nor in its CI workflows, yet the README pins “Python 3.11.1 to 3.12.9”. Please confirm the actual supported Python versions and either:

  • Add a python_requires=">=3.11.1,<3.13" (or whatever range you support) to your packaging config
  • Update the README to match the true compatibility tested in CI
  • (Optional) Add those versions to your GitHub Actions matrix to enforce coverage

The GitHub repo and subdirectory path (lightspeed-core/lightspeed-evaluation#main/lsc_agent_eval) are public and valid.

🤖 Prompt for AI Agents
In test/evals/README.md around lines 7 to 15, the README specifies Python
version 3.11.1 to 3.12.9 but the packaging metadata and CI workflows do not
enforce this constraint. Verify the actual supported Python versions, then add a
python_requires field with the correct version range to setup.py or
pyproject.toml, update the README to reflect the accurate supported versions,
and optionally include these versions in the GitHub Actions matrix to ensure CI
coverage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants