feat: add azd CLI evaluation and testing framework#7202
feat: add azd CLI evaluation and testing framework#7202
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new cli/azd/test/eval/ evaluation framework intended to measure how well GitHub Copilot CLI (and humans) can discover and use azd commands, plus scheduled GitHub Actions workflows to run these evals and publish artifacts/reports.
Changes:
- Introduces a Node/TypeScript Jest test harness for unit-style CLI surface validation (help text, flags, sequencing).
- Adds Waza task YAMLs (deploy/troubleshoot/environment/lifecycle/negative scenarios) and Python grader scripts for infra/app validation.
- Adds GitHub Actions workflows to run unit tests on PRs and scheduled Waza/E2E/report jobs.
Reviewed changes
Copilot reviewed 36 out of 38 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| cli/azd/test/eval/tsconfig.json | TypeScript build configuration for eval tests/tools |
| cli/azd/test/eval/package.json | Node package + scripts for running Jest/Waza/reporting |
| cli/azd/test/eval/package-lock.json | Locked dependency tree for reproducible installs |
| cli/azd/test/eval/jest.config.ts | Jest configuration (ts-jest + junit in CI) |
| cli/azd/test/eval/.gitignore | Ignores build outputs and generated reports |
| cli/azd/test/eval/reports/.gitkeep | Keeps reports/ directory in git |
| cli/azd/test/eval/eval.yaml | Waza eval configuration (executor/model/metrics/task globs) |
| cli/azd/test/eval/README.md | Documentation for running/extending eval framework |
| cli/azd/test/eval/tests/unit/command-registry.test.ts | Verifies core commands exist and respond to --help |
| cli/azd/test/eval/tests/unit/help-text-quality.test.ts | Checks help output contains expected sections/descriptions |
| cli/azd/test/eval/tests/unit/flag-validation.test.ts | Validates key flags appear/behave as expected |
| cli/azd/test/eval/tests/unit/command-sequencing.test.ts | Ensures commands fail with guidance in empty dirs |
| cli/azd/test/eval/tests/human/cli-workflow.test.ts | “Human baseline” tests for responsiveness/basic UX expectations |
| cli/azd/test/eval/tests/human/command-discovery.test.ts | “Human baseline” tests focused on discovering commands/flags |
| cli/azd/test/eval/tests/human/error-recovery.test.ts | “Human baseline” tests for actionable errors and recovery hints |
| cli/azd/test/eval/tasks/deploy/deploy-python-webapp.yaml | Waza task: deploy Python app guidance |
| cli/azd/test/eval/tasks/deploy/deploy-node-api.yaml | Waza task: deploy Node API guidance |
| cli/azd/test/eval/tasks/deploy/deploy-existing-project.yaml | Waza task: deploy existing azd project (avoid init) |
| cli/azd/test/eval/tasks/environment/create-staging.yaml | Waza task: create staging environment workflow |
| cli/azd/test/eval/tasks/environment/switch-env.yaml | Waza task: switch environments |
| cli/azd/test/eval/tasks/environment/delete-env.yaml | Waza task: teardown + delete environment workflow |
| cli/azd/test/eval/tasks/lifecycle/full-lifecycle.yaml | Waza task: init→provision→deploy→down sequence |
| cli/azd/test/eval/tasks/lifecycle/teardown-only.yaml | Waza task: down/cleanup guidance |
| cli/azd/test/eval/tasks/troubleshoot/auth-error.yaml | Waza task: troubleshoot auth error guidance |
| cli/azd/test/eval/tasks/troubleshoot/config-error.yaml | Waza task: troubleshoot malformed azure.yaml |
| cli/azd/test/eval/tasks/troubleshoot/quota-error.yaml | Waza task: troubleshoot quota error |
| cli/azd/test/eval/tasks/troubleshoot/provision-role-conflict.yaml | Waza task: troubleshoot RBAC role assignment conflict |
| cli/azd/test/eval/tasks/negative/raw-azure-cli.yaml | Waza negative task: use az not azd |
| cli/azd/test/eval/tasks/negative/not-azure.yaml | Waza negative task: non-Azure question should avoid azd |
| cli/azd/test/eval/tasks/negative/general-coding.yaml | Waza negative task: general coding response without azd |
| cli/azd/test/eval/graders/infra_validator.py | Python grader stub for ARM resource existence validation |
| cli/azd/test/eval/graders/cleanup_validator.py | Python grader stub for post-azd down cleanup validation |
| cli/azd/test/eval/graders/app_health.py | Python grader stub for HTTP endpoint health validation |
| cli/azd/.vscode/cspell.yaml | Adds spelling dictionary overrides for eval docs |
| .github/workflows/eval-unit.yml | PR workflow to build azd + run Jest unit suite |
| .github/workflows/eval-waza.yml | Scheduled workflow to run Waza evaluations |
| .github/workflows/eval-e2e.yml | Scheduled workflow intended for E2E lifecycle evals with Azure login |
| .github/workflows/eval-report.yml | Scheduled workflow intended to generate weekly comparison/regression issues |
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Good initiative - adding eval coverage for Copilot CLI interactions with azd fills a real gap. The Waza task definitions are well-structured, grader weights are mathematically correct across all 14 tasks, and the CI workflow design (unit on PR, Waza 3x/day, E2E weekly) is sensible.
However, there are structural and reliability issues that should be addressed before merge:
- The
azd()test helper is copy-pasted across 7 files with subtle inconsistencies (NO_COLORvsAZD_DEBUG_FORCE_NO_TTY,e: anyvse: unknown) - this is already causing bugs and will make maintenance painful - Human test files don't set
NO_COLOR: "1", so regex assertions against help text will be flaky when ANSI escape codes are present - The
eval.yamlsystem prompt omitsazd env delete, butdelete-env.yamlexpects the LLM to suggest it - this task will score poorly by design app_health.pyhas inconsistent retry logic: status mismatches retry, but body-content mismatches return failure immediately- Two npm devDependencies (
@azure/arm-resources,@azure/identity) are never imported anywhere
I've excluded items already covered by the existing review.
Resolves all issues raised in PR #7202 review: - Extract shared azd() test helper to tests/test-utils.ts (eliminates duplication across 7 files, consistent NO_COLOR + AZD_FORCE_TTY) - Fix AZD_DEBUG_FORCE_NO_TTY → AZD_FORCE_TTY=false in all test files - Add NO_COLOR=1 to human tests (prevents ANSI flakiness) - Use catch(e: unknown) with proper type narrowing everywhere - Add azd env delete to eval.yaml system prompt (fixes delete-env task) - Fix app_health.py retry logic for body-content mismatches - Remove unused @azure/arm-resources and @azure/identity deps - Remove missing scripts/ references from package.json and tsconfig - Reduce jest timeout from 5min to 30s - eval-unit.yml: add permissions block and waza:validate step - eval-waza.yml: fix PATH via GITHUB_PATH instead of env.PATH - eval-e2e.yml: align waza install, fix cleanup step working directory - eval-report.yml: use gh CLI for cross-run artifact download - Remove non-existent eval-human.yml from README CI table - Add cspell overrides for grader/task/test files All 7 suites pass (125 tests + 4 skipped E2E). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolves all issues raised in PR #7202 review: - Extract shared azd() test helper to tests/test-utils.ts (eliminates duplication across 7 files, consistent NO_COLOR + AZD_FORCE_TTY) - Fix AZD_DEBUG_FORCE_NO_TTY → AZD_FORCE_TTY=false in all test files - Add NO_COLOR=1 to human tests (prevents ANSI flakiness) - Use catch(e: unknown) with proper type narrowing everywhere - Add azd env delete to eval.yaml system prompt (fixes delete-env task) - Fix app_health.py retry logic for body-content mismatches - Remove unused @azure/arm-resources and @azure/identity deps - Remove missing scripts/ references from package.json and tsconfig - Reduce jest timeout from 5min to 30s - eval-unit.yml: add permissions block and waza:validate step - eval-waza.yml: fix PATH via GITHUB_PATH instead of env.PATH - eval-e2e.yml: align waza install, fix cleanup step working directory - eval-report.yml: use gh CLI for cross-run artifact download - Remove non-existent eval-human.yml from README CI table - Add cspell overrides for grader/task/test files All 7 suites pass (125 tests + 4 skipped E2E). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
59c2cb2 to
ef33cf4
Compare
|
@jongio - thanks for the feedback. Everything has been addressed here. Ready for another review. @rajeshkamal5050 / @wbreza - going to need someone here with admin to setup the CI portions and TOKENS and subscription. See PR description on the how. |
|
@jongio All review feedback addressed and CI is fully green. Changes: shared test-utils.ts helper, 15 pytest grader tests, app_health.py retry fix, azd env delete in eval.yaml, removed unused deps, jest timeout reduction, workflow fixes (permissions, PATH, artifacts, cleanup), cspell overrides. PR body updated with full setup instructions. Ready for re-review! |
jongio
left a comment
There was a problem hiding this comment.
Solid foundation for measuring Copilot CLI + azd interactions. The task YAML structure is well-designed, grader weight math is correct, and the CI pipeline layout (unit on PR, Waza scheduled, E2E weekly) makes sense. I've skipped items already raised in the existing reviews and focused on issues I haven't seen mentioned.
The graders have a logic gap in how urlopen handles non-2xx responses - it throws before your status comparison runs, so expected_status only works for 2xx codes. The get_access_token() function is copy-pasted across two grader files. The report workflow is entirely non-functional (placeholder echo + missing dependency file). A couple of the task YAML graders are either redundant or too strict in what they require from the LLM response.
Resolves all issues raised in PR #7202 review: - Extract shared azd() test helper to tests/test-utils.ts (eliminates duplication across 7 files, consistent NO_COLOR + AZD_FORCE_TTY) - Fix AZD_DEBUG_FORCE_NO_TTY → AZD_FORCE_TTY=false in all test files - Add NO_COLOR=1 to human tests (prevents ANSI flakiness) - Use catch(e: unknown) with proper type narrowing everywhere - Add azd env delete to eval.yaml system prompt (fixes delete-env task) - Fix app_health.py retry logic for body-content mismatches - Remove unused @azure/arm-resources and @azure/identity deps - Remove missing scripts/ references from package.json and tsconfig - Reduce jest timeout from 5min to 30s - eval-unit.yml: add permissions block and waza:validate step - eval-waza.yml: fix PATH via GITHUB_PATH instead of env.PATH - eval-e2e.yml: align waza install, fix cleanup step working directory - eval-report.yml: use gh CLI for cross-run artifact download - Remove non-existent eval-human.yml from README CI table - Add cspell overrides for grader/task/test files All 7 suites pass (125 tests + 4 skipped E2E). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ef33cf4 to
24d2af3
Compare
|
@jongio Round 2 feedback addressed, rebased on main, and all CI passing locally. Changes: HTTPError non-2xx handling in app_health.py, shared azure_auth.py module, cleaned up eval-report.yml, relaxed teardown-only.yaml grader, fixed duplicate grader in deploy-existing-project.yaml, Windows .exe support in test-utils.ts. Replied to each comment individually. Ready for re-review! |
Resolves all issues raised in PR #7202 review: - Extract shared azd() test helper to tests/test-utils.ts (eliminates duplication across 7 files, consistent NO_COLOR + AZD_FORCE_TTY) - Fix AZD_DEBUG_FORCE_NO_TTY → AZD_FORCE_TTY=false in all test files - Add NO_COLOR=1 to human tests (prevents ANSI flakiness) - Use catch(e: unknown) with proper type narrowing everywhere - Add azd env delete to eval.yaml system prompt (fixes delete-env task) - Fix app_health.py retry logic for body-content mismatches - Remove unused @azure/arm-resources and @azure/identity deps - Remove missing scripts/ references from package.json and tsconfig - Reduce jest timeout from 5min to 30s - eval-unit.yml: add permissions block and waza:validate step - eval-waza.yml: fix PATH via GITHUB_PATH instead of env.PATH - eval-e2e.yml: align waza install, fix cleanup step working directory - eval-report.yml: use gh CLI for cross-run artifact download - Remove non-existent eval-human.yml from README CI table - Add cspell overrides for grader/task/test files All 7 suites pass (125 tests + 4 skipped E2E). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
24d2af3 to
ebef1ca
Compare
Resolves all issues raised in PR #7202 review: - Extract shared azd() test helper to tests/test-utils.ts (eliminates duplication across 7 files, consistent NO_COLOR + AZD_FORCE_TTY) - Fix AZD_DEBUG_FORCE_NO_TTY → AZD_FORCE_TTY=false in all test files - Add NO_COLOR=1 to human tests (prevents ANSI flakiness) - Use catch(e: unknown) with proper type narrowing everywhere - Add azd env delete to eval.yaml system prompt (fixes delete-env task) - Fix app_health.py retry logic for body-content mismatches - Remove unused @azure/arm-resources and @azure/identity deps - Remove missing scripts/ references from package.json and tsconfig - Reduce jest timeout from 5min to 30s - eval-unit.yml: add permissions block and waza:validate step - eval-waza.yml: fix PATH via GITHUB_PATH instead of env.PATH - eval-e2e.yml: align waza install, fix cleanup step working directory - eval-report.yml: use gh CLI for cross-run artifact download - Remove non-existent eval-human.yml from README CI table - Add cspell overrides for grader/task/test files All 7 suites pass (125 tests + 4 skipped E2E). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
jongio
left a comment
There was a problem hiding this comment.
Previous comments addressed. One gap from the new error-handling change:
- graders/infra_validator.py:85 -
list_resources()call unprotected after it was changed to re-raise non-404 HTTPErrors
Same pattern exists in cleanup_validator.py for list_remaining_resources().
jongio
left a comment
There was a problem hiding this comment.
Issues to address:
- graders/cleanup_validator.py:97 -
list_remaining_resources()call isn't wrapped in try/except (same class as the open thread on infra_validator.py:85) - .github/workflows/eval-waza.yml:40 - missing
timeout-minuteson Waza run step
Minor:
- tasks/troubleshoot/quota-error.yaml:20 -
must_not_match: "contact support"penalizes valid advice
Resolves all issues raised in PR #7202 review: - Extract shared azd() test helper to tests/test-utils.ts (eliminates duplication across 7 files, consistent NO_COLOR + AZD_FORCE_TTY) - Fix AZD_DEBUG_FORCE_NO_TTY → AZD_FORCE_TTY=false in all test files - Add NO_COLOR=1 to human tests (prevents ANSI flakiness) - Use catch(e: unknown) with proper type narrowing everywhere - Add azd env delete to eval.yaml system prompt (fixes delete-env task) - Fix app_health.py retry logic for body-content mismatches - Remove unused @azure/arm-resources and @azure/identity deps - Remove missing scripts/ references from package.json and tsconfig - Reduce jest timeout from 5min to 30s - eval-unit.yml: add permissions block and waza:validate step - eval-waza.yml: fix PATH via GITHUB_PATH instead of env.PATH - eval-e2e.yml: align waza install, fix cleanup step working directory - eval-report.yml: use gh CLI for cross-run artifact download - Remove non-existent eval-human.yml from README CI table - Add cspell overrides for grader/task/test files All 7 suites pass (125 tests + 4 skipped E2E). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
cb4aa11 to
2bac0dc
Compare
Add a comprehensive eval/test framework for measuring how GitHub Copilot CLI interacts with azd. Includes: - 75 unit tests across 4 suites: command registry, help text quality, command sequencing, and flag validation - Human-scenario test stubs for CLI workflow, command discovery, and error recovery evaluation - Waza-compatible task YAML definitions for LLM eval (deploy, lifecycle, environment, troubleshoot, negative scenarios) - Custom graders for infrastructure validation, app health, and cleanup - CI workflows for unit tests, E2E tests, Waza runs, and report generation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…d tests Covers: - Step-by-step Waza LLM eval task creation with full YAML reference - Grader reference table (text, action_sequence, behavior, code) - Custom Python grader authoring with Azure ARM API example - Jest unit test and human scenario test templates - Directory structure reference - Regex tips and common patterns Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- command-registry: exclude beta-gated commands (build) from root --help assertion - command-sequencing: accept auth-related errors in CI where no Azure login exists - cspell: add waza/urlopen to eval README overrides Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Removing AZD_CONFIG_DIR override prevents tests from wiping auth state and triggering browser-based Azure login. Tests only need an empty cwd to verify azd fails gracefully without a project. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ntee Adds test tier table showing which tests need auth, step-by-step service principal setup for CI, and local subscription configuration. Clarifies that no test should ever open a browser. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolves all issues raised in PR #7202 review: - Extract shared azd() test helper to tests/test-utils.ts (eliminates duplication across 7 files, consistent NO_COLOR + AZD_FORCE_TTY) - Fix AZD_DEBUG_FORCE_NO_TTY → AZD_FORCE_TTY=false in all test files - Add NO_COLOR=1 to human tests (prevents ANSI flakiness) - Use catch(e: unknown) with proper type narrowing everywhere - Add azd env delete to eval.yaml system prompt (fixes delete-env task) - Fix app_health.py retry logic for body-content mismatches - Remove unused @azure/arm-resources and @azure/identity deps - Remove missing scripts/ references from package.json and tsconfig - Reduce jest timeout from 5min to 30s - eval-unit.yml: add permissions block and waza:validate step - eval-waza.yml: fix PATH via GITHUB_PATH instead of env.PATH - eval-e2e.yml: align waza install, fix cleanup step working directory - eval-report.yml: use gh CLI for cross-run artifact download - Remove non-existent eval-human.yml from README CI table - Add cspell overrides for grader/task/test files All 7 suites pass (125 tests + 4 skipped E2E). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- app_health.py: handle non-2xx expected_status via HTTPError.code comparison - Extract shared get_access_token() to graders/azure_auth.py (was duplicated in cleanup_validator.py and infra_validator.py) - eval-report.yml: remove non-functional regression issue step, drop issues:write permission, add TODO for future report generation script - teardown-only.yaml: relax --purge from must_match to must_match_any (--force without --purge is a valid response) - deploy-existing-project.yaml: replace duplicate grader with check for --no-prompt, azure.yaml, service, or --all - test-utils.ts: add .exe extension on Windows for cross-platform support - Add 2 new pytest tests for HTTPError expected_status matching Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ling - Fix mock patch targets: patch cleanup_validator/infra_validator instead of azure_auth to correctly intercept imported names - Replace azd env delete with azd env remove (correct CLI command) - Add pytest step to eval-unit.yml for grader tests - Remove continue-on-error from waza validate step - Catch HTTPError in grade() functions to return score 0 gracefully - Fix README grader signature: grade(context) not grade(inputs, params) - Filter eval-report.yml artifact downloads to main branch only Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The waza CLI is not available in the CI runner environment. Make the validation step conditional to unblock the workflow. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Jon Gallant <2163001+jongio@users.noreply.github.com>
Co-authored-by: Jon Gallant <2163001+jongio@users.noreply.github.com>
- README: update code examples to use shared test-utils.ts import (removes catch(e:any), AZD_DEBUG_FORCE_NO_TTY, inline helpers) - README: remove nonexistent scripts/ from directory structure - README: fix CI table eval-report description (removed auto-issue ref) - README: align Waza install method with CI (npm install -g waza) - README: clarify code graders are for E2E lifecycle only - eval-e2e.yml: fix cleanup step to use correct working directory - deploy-existing-project.yaml: make grader 4 check for deployment explanation instead of duplicating command/flag checks from grader 1 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix shell injection in test-utils.ts (use execFileSync) - Fix app_health.py HTTPError body check - Tighten command-sequencing test assertions - Remove continue-on-error masking in eval-waza.yml (staged separately) - Fix action_sequence grader to accept azd up shorthand Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Eval failures should fail the workflow. Results are still uploaded via if: always() on the artifact upload step. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Narrow azure_auth.py exception handling to expected failures - Fix cleanup_validator.py to only swallow 404, not all HTTPErrors - Fix eval-e2e.yml cleanup step to not depend on azure.yaml Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix list_resources() to re-raise non-404 HTTPError instead of silently returning empty list. Matches cleanup_validator.py pattern. - Fix README build path: cd ../../ (not ../../../) from test/eval/. - Remove unused tsx devDependency from package.json. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Wrap list_resources/list_remaining_resources callers in try/except to handle non-404 HTTPError after upstream fix. - Add timeout-minutes: 30 to eval-waza.yml workflow step. - Narrow 'contact support' negative match to 'only contact support' so mentioning support alongside actionable steps isn't penalized. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per danieljurek's feedback, tag resource groups with DeleteAfter before deleting so the cleanup script can detect and remove resources that resist deletion. Switched to PowerShell for consistency with the cleanup script. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2f0bc69 to
47e1491
Compare
Azure Dev CLI Install InstructionsInstall scriptsMacOS/Linux
bash: pwsh: WindowsPowerShell install MSI install Standalone Binary
MSI
Documentationlearn.microsoft.com documentationtitle: Azure Developer CLI reference
|
jongio
left a comment
There was a problem hiding this comment.
Issues to address:
- eval-unit.yml - no timeout-minutes on the job (eval-waza and eval-e2e both have one)
- graders/test_graders.py:12 - bare module imports break when pytest runs from outside graders/
Minor:
- eval.yaml:17 - system prompt omits azd pipeline (tested as CORE_COMMAND in command-registry.test.ts)
Existing review threads look well-addressed. Grader error handling fixes, shared helper extraction, and cleanup script improvements are all solid.
|
|
||
| jobs: | ||
| unit-tests: | ||
| runs-on: ubuntu-latest |
There was a problem hiding this comment.
[MEDIUM] No timeout-minutes on this job.
eval-waza.yml has timeout-minutes: 30 and eval-e2e.yml has timeout-minutes: 40. This workflow has none - if go build hangs or a test waits for input, it runs for the 6-hour GitHub Actions default.
| runs-on: ubuntu-latest | |
| runs-on: ubuntu-latest | |
| timeout-minutes: 15 |
| from urllib.error import HTTPError, URLError | ||
|
|
||
| import app_health | ||
| import cleanup_validator |
There was a problem hiding this comment.
[MEDIUM] These bare module imports only work when pytest runs from the graders/ directory.
CI does this (working-directory: cli/azd/test/eval/graders), but pytest from the eval root or project root fails with ModuleNotFoundError. A conftest.py in this directory that does sys.path.insert(0, os.path.dirname(__file__)) would fix it.
| - azd config set <key> <value>: Set configuration | ||
| - azd restore: Restore project dependencies | ||
| - azd build: Build application services | ||
| - azd package: Package application for deployment |
There was a problem hiding this comment.
[LOW] System prompt lists 19 azd commands but omits azd pipeline.
command-registry.test.ts tests pipeline as a CORE_COMMAND. If a future eval task expects the LLM to suggest azd pipeline config, the LLM won't know the command exists.
Problem
We have no visibility into how GitHub Copilot CLI interacts with
azd. Unlike the microsoft/github-copilot-for-azure skills repo — which has a comprehensive test, eval, and CI setup — the azd CLI has zero coverage for measuring LLM interactions, command discoverability, or human usability patterns. We have no idea whether Copilot CLI can successfully discover commands, interpret help text, handle errors gracefully, or guide users through common workflows.Solution
This PR adds a comprehensive evaluation and testing framework at
cli/azd/test/eval/inspired by the GHCP4A setup. It covers both LLM eval (how well an AI agent uses azd) and non-LLM unit tests (how well azd surfaces information for human and AI consumption).What's included
125 passing tests across 7 suites:
--help--output json,--no-prompt,-e/--environmentflags15 Python grader unit tests (pytest):
app_health.py— HTTP health checks with retry logiccleanup_validator.py— ARM API validation for post-azd downcleanupinfra_validator.py— ARM API validation for post-azd provisionresources14 Waza LLM eval task definitions (YAML):
4 CI workflows:
eval-unit.yml— runs unit tests + waza validate on PReval-waza.yml— Waza LLM evals 3x/day (Tue-Sat)eval-e2e.yml— weekly E2E with Azure resource validationeval-report.yml— weekly report generation + auto-issue creationSetup Required Before Going Live
Secrets to configure (Settings → Secrets and variables → Actions)
AZURE_CLIENT_IDeval-e2e.ymlAZURE_TENANT_IDeval-e2e.ymlAZURE_SUBSCRIPTION_IDeval-e2e.yml+ gradersCOPILOT_CLI_TOKENeval-waza.yml,eval-e2e.ymlGITHUB_TOKENeval-report.ymlService principal setup
What works without any setup
npm run test:unitnpm run test:humannpm run waza:run:mockeval-unit.ymlCIcli/azd/test/eval/**What needs secrets
eval-waza.ymlCOPILOT_CLI_TOKENeval-e2e.ymlAZURE_CLIENT_ID,AZURE_TENANT_ID,AZURE_SUBSCRIPTION_ID,COPILOT_CLI_TOKENeval-report.ymlGITHUB_TOKEN(auto)Testing
All tests pass locally: