-
Notifications
You must be signed in to change notification settings - Fork 21
MGMT-21148: Add initial eval tests #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit replaces the instructions and placeholder tests with actual eval tests that run and pass. Currently they just do a substring test on the output of the model, and only handle read queries (rather than creating a cluster), but this is just a start to get something that can be improved over time. Additionally the tests themselves currently pass if _any_ of the listed strings are present in the output, but it would be better to get some more flexible matching behavior (at least an AND) option for things like the versions test.
WalkthroughThe changes update the Changes
Sequence Diagram(s)sequenceDiagram
participant Developer
participant Makefile
participant OCM Token Script
participant Eval Script
participant Agent API
Developer->>Makefile: make test-eval
Makefile->>OCM Token Script: source utils/ocm-token.sh & get_ocm_token
OCM Token Script->>Makefile: Write token to test/evals/ocm_token.txt
Makefile->>Eval Script: cd test/evals && python eval.py
Eval Script->>Eval Script: Load eval_data.yaml
loop For each evaluation
Eval Script->>Agent API: Send query
Agent API-->>Eval Script: Return response
Eval Script->>Eval Script: Check response vs expected keywords
Eval Script->>Developer: Print pass/fail result
end
Eval Script->>Developer: Print summary and exit status
Estimated code review effort3 (30–60 minutes) — Moderate complexity due to multiple file deletions and additions affecting evaluation framework, scripts, and documentation. Possibly related PRs
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
|
Also note that running these contacts assisted prod and gemini so some changes will probably be needed before we put this into some kind of CI. Maybe we want to mock out assisted so we don't have to deploy it ourselves? Also maybe we won't want to run it super frequently (every commit on every PR) in case we start running up our Gemini bill. |
|
I also considered adding some kind of actual python dependency management, but figured we might want to wait on that until this isn't just in @asamal4's branch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
test/evals/eval_data.yaml (1)
15-22: Consider making version checks more flexible.The hardcoded version numbers (4.19, 4.18, 4.20) may become outdated as new OpenShift versions are released. Consider using a more flexible approach such as:
- Regex patterns to match version format (e.g.,
4.\d+)- Dynamic version retrieval from the API
- More generic version-related keywords
- expected_key_words: - - "4.19" - - "4.18" - - "4.20" + expected_key_words: + - "4." + - "version" + - "OpenShift"Makefile (1)
71-75: Add error handling for token generation and script dependencies.The target implementation looks correct but lacks error handling. Consider adding checks to ensure dependencies exist and token generation succeeds.
test-eval: ## Run agent evaluation tests @echo "Refreshing OCM token..." + @if [ ! -f utils/ocm-token.sh ]; then echo "Error: utils/ocm-token.sh not found"; exit 1; fi @. utils/ocm-token.sh && get_ocm_token && echo "$$OCM_TOKEN" > test/evals/ocm_token.txt + @if [ ! -s test/evals/ocm_token.txt ]; then echo "Error: Failed to generate OCM token"; exit 1; fi @echo "Running agent evaluation tests..." + @if [ ! -f test/evals/eval.py ]; then echo "Error: test/evals/eval.py not found"; exit 1; fi @cd test/evals && python eval.pytest/evals/README.md (1)
21-43: Fix markdown formatting issue.The fenced code block lacks a language specification, which affects rendering and accessibility.
-``` +```text Refreshing OCM token... Running agent evaluation tests... Running 4 evaluation(s)...
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
.gitignore(1 hunks)Makefile(3 hunks)scripts/eval.py(0 hunks)test/evals/AGENT_E2E_EVAL.md(0 hunks)test/evals/README.md(1 hunks)test/evals/basic_evals.yaml(0 hunks)test/evals/eval.py(1 hunks)test/evals/eval_data.yaml(1 hunks)
💤 Files with no reviewable changes (3)
- test/evals/basic_evals.yaml
- test/evals/AGENT_E2E_EVAL.md
- scripts/eval.py
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
test/evals/README.md
21-21: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (6)
.gitignore (1)
3-6: LGTM! Proper handling of environment-specific and sensitive files.The additions appropriately exclude development environment files and the sensitive
ocm_token.txtauthentication token generated by the evaluation tests.test/evals/eval_data.yaml (1)
1-29: Well-structured evaluation data with clear test cases.The YAML structure is clean and covers key functionality areas. The substring-based evaluation approach is appropriate for initial testing.
Makefile (1)
7-7: Proper integration with Makefile structure.The addition to the .PHONY list and help output is correctly implemented.
test/evals/README.md (1)
1-44: Comprehensive documentation with clear prerequisites and usage instructions.The README provides excellent coverage of the evaluation mechanism, prerequisites, and usage examples. The documentation structure is logical and user-friendly.
test/evals/eval.py (2)
4-25: Well-structured result formatting function.The
print_test_resultfunction handles different evaluation types appropriately and provides clear, readable output formatting.
42-75: Clean evaluation loop with proper exit code handling.The main evaluation logic is well-structured with immediate result printing and proper exit code handling for CI/CD integration.
test/evals/eval.py
Outdated
| args = EvalArgs() | ||
|
|
||
| evaluator = AgentGoalEval(args) | ||
| configs = evaluator.config_manager.get_eval_data() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for missing dependencies.
The script assumes the agent_eval module and configuration files exist without validation.
+import os
+
+# Validate dependencies
+try:
+ from agent_eval import AgentGoalEval
+except ImportError:
+ print("Error: agent_eval module not found. Please install it using:")
+ print("pip install git+https://github.com/asamal4/lightspeed-evaluation.git@agent-goal-eval#subdirectory=agent_eval")
+ sys.exit(1)
+
args = EvalArgs()
+
+# Validate configuration files
+if not os.path.exists(args.eval_data_yaml):
+ print(f"Error: {args.eval_data_yaml} not found")
+ sys.exit(1)
+
+if not os.path.exists(args.agent_auth_token_file):
+ print(f"Error: {args.agent_auth_token_file} not found")
+ sys.exit(1)Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In test/evals/eval.py around lines 37 to 40, add error handling to check if the
agent_eval module and required configuration files are present before using
them. Implement try-except blocks to catch ImportError for the module and
FileNotFoundError or similar exceptions for configuration files, and provide
clear error messages or fallback behavior to handle missing dependencies
gracefully.
test/evals/eval.py
Outdated
| class EvalArgs: | ||
| def __init__(self): | ||
| self.eval_data_yaml = 'eval_data.yaml' | ||
| self.agent_endpoint = 'http://localhost:8090' | ||
| self.agent_provider = 'gemini' | ||
| self.agent_model = 'gemini/gemini-2.5-flash' | ||
| self.judge_provider = None | ||
| self.judge_model = None | ||
| self.agent_auth_token_file = 'ocm_token.txt' | ||
| self.result_dir = 'results' | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Address hardcoded configuration values and indentation inconsistency.
Several issues with the configuration class:
- Mixed indentation (line 27 uses spaces instead of tabs)
- Hardcoded localhost endpoint may not work in containerized environments
- Hardcoded Gemini provider/model limits flexibility
class EvalArgs:
- def __init__(self):
+ def __init__(self):
- self.eval_data_yaml = 'eval_data.yaml'
- self.agent_endpoint = 'http://localhost:8090'
- self.agent_provider = 'gemini'
- self.agent_model = 'gemini/gemini-2.5-flash'
+ self.eval_data_yaml = 'eval_data.yaml'
+ self.agent_endpoint = os.getenv('AGENT_ENDPOINT', 'http://localhost:8090')
+ self.agent_provider = os.getenv('AGENT_PROVIDER', 'gemini')
+ self.agent_model = os.getenv('AGENT_MODEL', 'gemini/gemini-2.5-flash')Also add import os at the top of the file.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In test/evals/eval.py around lines 26 to 36, fix the mixed indentation by using
consistent tabs or spaces throughout the EvalArgs class. Replace the hardcoded
localhost URL with a configurable environment variable fallback using os.getenv
to support containerized environments. Similarly, make the agent_provider and
agent_model configurable via environment variables instead of hardcoding
'gemini' values. Also, add 'import os' at the top of the file to enable
environment variable usage.
|
Moved this to a draft to see if I can use the judge llm instead of the string matching (which doesn't really validate what we want particularly well) |
|
This works well now with the judge, but I'm going to leave it as a draft until we get a package that isn't in @asamal4 's branch |
LSC eval is merged.. Also I see that you are writing separate print statements. I will add more verbose statements in original LSC eval logic, so that minimal logic would be required here. |
eranco74
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Yeah, IIRC I wasn't getting enough feedback for failed tests so I just stopped using the original API from your example. |
|
I'll update this to use the merged eval code today then undraft. |
| eval_type: judge-llm | ||
| expected_response: "There are several versions of OpenShift available. The most recent production version is 4.19, 4.20 pre release versions are available as well as several previous verrsions." | ||
|
|
||
| - eval_id: available_operators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The naming is a bit confusing to me, since these are Operator bundles rather than operators.
This happened to me also when I asked a question to the service, so I understand the naming was chosen like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happened to me also when I asked a question to the service, so I understand the naming was chosen like that?
What happened, exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to reproduce it and update here later.
IIRC, I asked about individual operators and the response was about operator bundles, but without calling them "bundles".
It can also be confusing because for the two bundles, we have similarly named operators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
test/evals/eval.py (2)
1-17: Add error handling for missing dependencies.The script imports
lsc_agent_evalwithout validation. If this dependency is missing, the script will fail with an unclear error message.+import os + +# Validate dependencies +try: + from lsc_agent_eval import AgentGoalEval +except ImportError: + print("Error: lsc_agent_eval module not found. Please install it using:") + print("pip install git+https://github.com/lightspeed-core/lightspeed-evaluation.git#subdirectory=lsc_agent_eval") + sys.exit(1) + import sys import logging import argparse -from lsc_agent_eval import AgentGoalEval
40-51: Make configuration values configurable and add file validation.The configuration uses hardcoded values that limit flexibility, especially in containerized environments. Additionally, the script should validate that required files exist before proceeding.
+import os + # Create proper Namespace object for AgentGoalEval args = argparse.Namespace() -args.eval_data_yaml = 'eval_data.yaml' -args.agent_endpoint = 'http://localhost:8090' -args.agent_provider = 'gemini' -args.agent_model = 'gemini/gemini-2.5-flash' +args.eval_data_yaml = os.getenv('EVAL_DATA_YAML', 'eval_data.yaml') +args.agent_endpoint = os.getenv('AGENT_ENDPOINT', 'http://localhost:8090') +args.agent_provider = os.getenv('AGENT_PROVIDER', 'gemini') +args.agent_model = os.getenv('AGENT_MODEL', 'gemini/gemini-2.5-flash') # Set up judge model for LLM evaluation -args.judge_provider = 'gemini' -args.judge_model = 'gemini-2.5-flash' -args.agent_auth_token_file = 'ocm_token.txt' -args.result_dir = 'results' +args.judge_provider = os.getenv('JUDGE_PROVIDER', 'gemini') +args.judge_model = os.getenv('JUDGE_MODEL', 'gemini-2.5-flash') +args.agent_auth_token_file = os.getenv('AGENT_AUTH_TOKEN_FILE', 'ocm_token.txt') +args.result_dir = os.getenv('RESULT_DIR', 'results') + +# Validate required files exist +if not os.path.exists(args.eval_data_yaml): + print(f"Error: {args.eval_data_yaml} not found") + sys.exit(1) + +if not os.path.exists(args.agent_auth_token_file): + print(f"Error: {args.agent_auth_token_file} not found") + sys.exit(1)
🧹 Nitpick comments (1)
test/evals/README.md (1)
22-22: Add language specification to fenced code block.The fenced code block should specify a language for proper syntax highlighting and better documentation quality.
-``` +```bash
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
test/evals/README.md(1 hunks)test/evals/eval.py(1 hunks)test/evals/eval_data.yaml(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- test/evals/eval_data.yaml
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
test/evals/README.md
22-22: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (6)
test/evals/README.md (2)
1-6: Clear documentation structure and comprehensive overview.The introduction effectively explains the evaluation mechanism and supported types. The reference to the external documentation provides helpful context for users.
16-21: Concise and clear instructions.The running tests section provides straightforward guidance for users.
test/evals/eval.py (4)
18-39: Well-structured result display function.The function effectively handles different evaluation types and provides clear, user-friendly output for both success and failure cases.
52-57: Clean evaluator initialization and setup.The evaluator initialization and configuration loading is well-structured and provides good user feedback.
58-76: Excellent evaluation loop with clear progress tracking.The loop provides good user feedback with numbered progress indicators and immediate result display. The counting logic is sound and handles both success and failure cases properly.
77-89: Proper final summary and exit code handling.The final summary provides clear results and uses appropriate exit codes for automation and CI integration.
| ## Prerequisites | ||
| - **Python**: Version 3.11.1 to 3.12.9 | ||
| - **Assisted Chat API**: Must be running (`make build-images run`) | ||
| - Install lightspeed-core **agent e2e eval** | ||
| ```bash | ||
| pip install git+https://github.com/lightspeed-core/lightspeed-evaluation.git#subdirectory=lsc_agent_eval | ||
| ``` | ||
| - `GEMINI_API_KEY` env var is set | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify Python version requirements and repository availability.
The Python version range (3.11.1 to 3.12.9) seems very specific. Consider whether this range is actually required or if it could be more inclusive. Also, ensure the referenced GitHub repository and subdirectory path are correct and accessible.
🏁 Script executed:
#!/bin/bash
# Verify the GitHub repository and subdirectory exist
curl -s -o /dev/null -w "%{http_code}" https://github.com/lightspeed-core/lightspeed-evaluation/tree/main/lsc_agent_eval
# Check if the repository is accessible
gh repo view lightspeed-core/lightspeed-evaluation --json name,visibility,isPrivateLength of output: 284
🏁 Script executed:
#!/bin/bash
set -e
echo "Searching for python_requires in project files..."
rg -n "python_requires" .
rg -n "requires-python" .
echo -e "\nSearching GitHub Actions workflows for Python version matrix..."
rg -n "python-version" -g "*.yml" .
rg -n "matrix.*python" -g "*.yml" .Length of output: 143
Ensure Python version constraints are enforced and docs are accurate
It looks like the codebase doesn’t declare any Python version bounds in its packaging metadata (no python_requires or requires-python in setup.py/pyproject.toml) nor in its CI workflows, yet the README pins “Python 3.11.1 to 3.12.9”. Please confirm the actual supported Python versions and either:
- Add a
python_requires=">=3.11.1,<3.13"(or whatever range you support) to your packaging config - Update the README to match the true compatibility tested in CI
- (Optional) Add those versions to your GitHub Actions matrix to enforce coverage
The GitHub repo and subdirectory path (lightspeed-core/lightspeed-evaluation#main/lsc_agent_eval) are public and valid.
🤖 Prompt for AI Agents
In test/evals/README.md around lines 7 to 15, the README specifies Python
version 3.11.1 to 3.12.9 but the packaging metadata and CI workflows do not
enforce this constraint. Verify the actual supported Python versions, then add a
python_requires field with the correct version range to setup.py or
pyproject.toml, update the README to reflect the accurate supported versions,
and optionally include these versions in the GitHub Actions matrix to ensure CI
coverage.
This commit replaces the instructions and placeholder tests with actual
eval tests that run and pass.
https://issues.redhat.com/browse/MGMT-21148
cc @eranco74 @omertuc @asamal4
Summary by CodeRabbit
Summary by CodeRabbit
Chores
Documentation
Tests