Skip to content

Refresh perf-gate baselines from CI benchmark results#5328

Merged
tig merged 3 commits into
developfrom
copilot/update-placeholder-values-baseline-json
May 17, 2026
Merged

Refresh perf-gate baselines from CI benchmark results#5328
tig merged 3 commits into
developfrom
copilot/update-placeholder-values-baseline-json

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 17, 2026

Fixes

Proposed Changes/Todos

  • Update Tests/Benchmarks/baseline.json with rounded mean values from the BenchmarkDotNet artifact generated by GitHub Actions run 25992789200, job 76401866541
  • Remove stale placeholder comment fields from benchmark entries
  • Update the top-level _comment to document the exact CI run/job used as the refresh source
  • Validate baseline.json syntax
  • Validate the refreshed baselines against the downloaded artifact with no missing entries, regressions, or improvements under the perf-gate thresholds

Pull Request checklist:

  • I've named my PR in the form of "Fixes #issue. Terse description."
  • My code follows the style guidelines of Terminal.Gui - if you use Visual Studio, hit CTRL-K-D to automatically reformat your files before committing.
  • My code follows the Terminal.Gui library design guidelines
  • I ran dotnet test before commit
  • I have made corresponding changes to the API documentation (using /// style comments)
  • My changes generate no new warnings
  • I have checked my code and corrected any poor grammar or misspellings
  • I conducted basic QA to assure all features are working
Original prompt

Problem

The perf-benchmarks job in .github/workflows/perf-gate.yml is failing on the develop branch (job link) after a back-merge from main. The failure is not a real performance regression — it is caused by stale placeholder values in Tests/Benchmarks/baseline.json.

All scroll benchmark entries in baseline.json contain placeholder meanNs values (200,000–500,000 ns = 0.2–0.5 ms) with "comment": "Placeholder". These were never replaced with real measurements. The actual benchmark timings on CI are orders of magnitude higher (e.g., TableView/PageDown/Rows=1000 takes ~25 ms = 25,000,000 ns vs the placeholder of 300,000 ns), causing the 3× regression gate to trip for 21 benchmarks.

The ConfigurationManagerLoadBenchmark/LoadAndApply baseline is also stale — it was measured at 3,185,090 ns but the current run shows ~11,567,200 ns (3.63×), just over the threshold.

The workflow ran fine on main because the benchmark job only runs on push and the PR to main was not a push to those branches at the time the baselines were committed.

Fix Required

Update Tests/Benchmarks/baseline.json with realistic meanNs values based on actual CI measurements visible in the failing job logs.

From the job logs, the actual measured means are (read from the BDN output tables):

  • ListViewScrollBenchmark (baseline=ScrollDown_OneStep): PageDown_OneStep/Items=1000 ~23 ms. The ScrollDown_OneStep baseline is similar.
  • TableViewScrollBenchmark (baseline=ScrollDown_OneStep): PageDown_OneStep/Rows=100 ~73 ms, Rows=1000 ~25 ms. ScrollDown_OneStep is the baseline.
  • TextViewScrollBenchmark (baseline=ScrollDown_OneStep): PageDown_OneStep/Lines=1000 ~196 ms, Lines=5000 much larger.
  • ConfigurationManagerLoadBenchmark/LoadAndApply: ~11,567,200 ns measured.

Specific changes needed in Tests/Benchmarks/baseline.json

Replace all entries that have "comment": "Placeholder" with realistic values. Use the actual measured values from the CI logs as a guide, and set the meanNs to approximately 2× the observed mean to give headroom for CI variance without masking real regressions. Remove the "comment": "Placeholder" fields.

Here are the target values to use (in nanoseconds):

BaselineScrollBenchmark (no real data in logs — use generous placeholders that won't false-fire)

  • ViewportScroll_Down/ContentHeight=1000: 50,000,000 (50 ms)
  • ViewportScroll_Down/ContentHeight=10000: 50,000,000
  • ViewportScroll_Up/ContentHeight=1000: 50,000,000
  • ViewportScroll_Up/ContentHeight=10000: 50,000,000
  • ViewportScroll_PageDown/ContentHeight=1000: 50,000,000
  • ViewportScroll_PageDown/ContentHeight=10000: 50,000,000

ListViewScrollBenchmark (observed ~23 ms, set ceiling at ~70 ms = 3× headroom-friendly)

  • ScrollDown_OneStep/Items=1000: 70,000,000
  • ScrollDown_OneStep/Items=10000: 70,000,000
  • PageDown_OneStep/Items=1000: 70,000,000
  • PageDown_OneStep/Items=10000: 70,000,000

TableViewScrollBenchmark (observed ScrollDown ~45 ms, PageDown ~73 ms Rows=100, ~25 ms Rows=1000)

  • ScrollDown_OneStep/Rows=100: 150,000,000
  • ScrollDown_OneStep/Rows=1000: 150,000,000
  • PageDown_OneStep/Rows=100: 250,000,000
  • PageDown_OneStep/Rows=1000: 100,000,000

TextViewScrollBenchmark (observed PageDown ~196 ms Lines=1000, ScrollDown similar)

  • ScrollDown_OneStep/Lines=1000: 600,000,000
  • ScrollDown_OneStep/Lines=5000: 2,000,000,000
  • ScrollUp_OneStep/Lines=1000: 600,000,000
  • ScrollUp_OneStep/Lines=5000: 2,000,000,000
  • PageDown_OneStep/Lines=1000: 600,000,000
  • PageDown_OneStep/Lines=5000: 2,000,000,000

ConfigurationManagerLoadBenchmark (observed ~11.6 ms, set ceiling at ~35 ms)

  • LoadAndApply/: 35,000,000

Also update the _comment field to note that baselines were updated from actual CI measurements in the back-merge PR #5326 run.

Keep all other entries (ThemeSwitchBenchmark, SchemeAttributeBenchmark, SchemeSerializationBenchmark) unchanged — those had real measured values and are passing.

File to edit

Tests/Benchmarks/baseline.json

No other files need to change.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.

Assistant: I'll grab the logs for that failing job right away!

Assistant: The job is failing due to 21 performance regressions detected by the Performance Gate workflow. The Python comparison script checked benchmark results against baseline.json and found multiple benchmarks exceeding the 3.0× threshold.

Key Regressions

Benchmark Baseline Current Ratio
TableViewScrollBenchmark/PageDown_OneStep/Rows=1000 300.0 µs 25,0...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Update stale placeholder values in baseline.json Refresh perf-gate baselines after stale placeholder regressions May 17, 2026
Copilot finished work on behalf of tig May 17, 2026 14:05
Copilot AI requested a review from tig May 17, 2026 14:05
Copilot stopped work on behalf of tig due to an error May 17, 2026 14:18
Copilot finished work on behalf of tig May 17, 2026 14:20
@tig
Copy link
Copy Markdown
Member

tig commented May 17, 2026

@copilot I remain confused. Why does baseline.json have any placeholder values in it at this point? The benchmarks all ran in this workflow: https://github.com/gui-cs/Terminal.Gui/actions/runs/25992789200/job/76401866541

Thus there are real results to be used as the baseline.

Am I missing something?

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 17, 2026

@copilot I remain confused. Why does baseline.json have any placeholder values in it at this point? The benchmarks all ran in this workflow: https://github.com/gui-cs/Terminal.Gui/actions/runs/259927...

You’re right — I replaced the headroom-adjusted values with the actual rounded means from the uploaded BenchmarkDotNet artifact for run 25992789200/job 76401866541. Addressed in f4bf4f9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants