Skip to content

Python: Flaky test report#5342

Merged
giles17 merged 9 commits intomainfrom
flaky-test-report
Apr 22, 2026
Merged

Python: Flaky test report#5342
giles17 merged 9 commits intomainfrom
flaky-test-report

Conversation

@giles17
Copy link
Copy Markdown
Contributor

@giles17 giles17 commented Apr 17, 2026

Motivation and Context

As part of CI/CD hardening, we need better visibility into flaky integration tests to reduce noise and improve signal quality. Currently, there's no easy way to see which tests are intermittently failing across CI runs, which makes it hard to prioritize fixes or identify regressions.

This PR adds automated flaky test trend reporting and enables additional integration tests in CI.

Description

Flaky Test Trend Report

Adds a new CI job (python-flaky-test-report) that runs after all integration test jobs complete. It follows the same artifact-upload → aggregate → Job Summary pattern used by scripts/sample_validation/.

How it works:

  1. Each of the 6 integration test jobs (OpenAI, Azure OpenAI, Misc, Functions, Foundry, Cosmos) now produces a JUnit XML file (pytest.xml) and uploads it as an artifact
  2. The report job downloads all artifacts, parses the XML, and merges results into a single run
  3. Results are combined with cached history (up to 5 previous runs) to generate a markdown trend table
  4. The trend table is posted to the GitHub Actions Job Summary

Report features:

  • Overall pass/fail/skip counts per run
  • Per-test status grid with emoji indicators (✅ passed, ❌ failed, ⏭️ skipped, ⚠️ xfail)
  • File column showing the test module for easy identification
  • Provider column showing which integration suite the test belongs to
  • Robust XML parsing — corrupt/truncated files are skipped with a warning, not a crash

CI Workflow Changes

  • Added --junitxml=pytest.xml to all integration test jobs in both workflow files
  • Added actions/upload-artifact steps to upload JUnit XML from each job
  • Added python-flaky-test-report aggregation job to both python-merge-tests.yml and python-integration-tests.yml
  • Fixed Cosmos job --junitxml path (pre-existing bug: uv run --directory wrote XML to the wrong location)
  • Added Foundry embedding env vars (FOUNDRY_MODELS_ENDPOINT, FOUNDRY_MODELS_API_KEY, FOUNDRY_EMBEDDING_MODEL, FOUNDRY_EMBEDDING_DIMENSIONS) to python-merge-tests.yml to enable the Foundry embedding integration test in CI

Files Changed

  • python/scripts/flaky_report/__init__.py — New package
  • python/scripts/flaky_report/__main__.py — CLI entry point
  • python/scripts/flaky_report/aggregate.py — JUnit XML parser, history management, trend report generator
  • .github/workflows/python-merge-tests.yml — Artifact uploads, report job, Cosmos path fix, Foundry embedding env vars
  • .github/workflows/python-integration-tests.yml — Artifact uploads, report job, --junitxml flag

Note: The report job is intentionally NOT added to the merge gate's needs list — it runs as an informational side-job and cannot block merges.

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

giles17 and others added 3 commits April 17, 2026 12:33
Parse JUnit XML (pytest.xml) from each integration test job and
aggregate results into a markdown trend report showing per-test
pass/fail/skip status across the last 5 runs.

Changes:
- Add python/scripts/flaky_report/ package (JUnit XML parser + trend
  report generator following the sample_validation pattern)
- Add upload-artifact steps to all 6 integration test jobs in both
  python-merge-tests.yml and python-integration-tests.yml
- Add python-flaky-test-report aggregation job with history caching
- Add --junitxml=pytest.xml to integration-tests.yml jobs (already
  present in merge-tests.yml)
- Fix Cosmos job --junitxml path (use absolute path since uv run
  --directory changes cwd)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Guard against missing reports directory in load_current_run()
- Only run report job when at least one integration test job completed
  (skip when all jobs are skipped, e.g. on pull_request events)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use explicit provider name mapping in _derive_provider() so OpenAI
  renders correctly instead of 'Openai'
- Fix operator precedence in workflow if-expressions by wrapping
  success/failure checks in parentheses

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 17, 2026 22:17
@giles17 giles17 marked this pull request as draft April 17, 2026 22:18
@github-actions github-actions Bot changed the title Flaky test report Python: Flaky test report Apr 17, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Python-based “flaky test trend” report that aggregates per-job JUnit XML outputs from CI, persists a short history via GitHub Actions cache, and posts a consolidated markdown report to the workflow job summary.

Changes:

  • Introduces python/scripts/flaky_report to parse multiple pytest.xml artifacts, merge them, and generate a markdown trend report.
  • Updates Python CI workflows to produce --junitxml=pytest.xml per provider job and upload those XML files as artifacts.
  • Adds a new “Flaky Test Report” job to download artifacts, restore/save history cache, and publish the unified report.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
python/scripts/flaky_report/aggregate.py Implements JUnit XML aggregation, history persistence, and markdown trend report generation.
python/scripts/flaky_report/main.py Adds python -m scripts.flaky_report ... entry point that dispatches to the aggregator CLI.
python/scripts/flaky_report/init.py Documents the flaky report package purpose and usage.
.github/workflows/python-merge-tests.yml Uploads per-job pytest.xml artifacts and adds a downstream aggregation/report job with caching.
.github/workflows/python-integration-tests.yml Uploads per-job pytest.xml artifacts and adds the same downstream aggregation/report job with caching.

Comment thread python/scripts/flaky_report/aggregate.py
Comment thread python/scripts/flaky_report/aggregate.py Outdated
Comment thread python/scripts/flaky_report/aggregate.py Outdated
Comment thread python/scripts/flaky_report/aggregate.py Outdated
- Add File column showing module name (e.g., test_openai_chat_client)
  to disambiguate tests with the same function name across files
- Detect pytest xfail tests in JUnit XML (type=pytest.xfail) and
  show them with a distinct warning emoji instead of skip emoji
- Update legend to include xfail explanation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
giles17 and others added 2 commits April 21, 2026 10:39
When a test is inside a class, pytest writes the classname as e.g.
'pkg.test_file.TestClass'. The previous rsplit logic extracted
'TestClass' instead of 'test_file'. Now detect uppercase-starting
segments as class names and use the preceding segment instead.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ocstring

- Use datetime.now(timezone.utc) for accurate UTC timestamps
- Catch ET.ParseError per-file so corrupt XML doesn't crash the report
- Remove separate 'error' key from summary (errors folded into 'failed')
- Fix _short_name docstring to show actual dotted classname::name format

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@giles17 giles17 marked this pull request as ready for review April 21, 2026 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants