Refresh perf-gate baselines from CI benchmark results#5328
Merged
Conversation
Agent-Logs-Url: https://github.com/gui-cs/Terminal.Gui/sessions/c394411c-09a0-4f49-9083-3bdb5fb32571 Co-authored-by: tig <585482+tig@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Update stale placeholder values in baseline.json
Refresh perf-gate baselines after stale placeholder regressions
May 17, 2026
Copilot stopped work on behalf of
tig due to an error
May 17, 2026 14:18
Member
|
@copilot I remain confused. Why does baseline.json have any placeholder values in it at this point? The benchmarks all ran in this workflow: https://github.com/gui-cs/Terminal.Gui/actions/runs/25992789200/job/76401866541 Thus there are real results to be used as the baseline. Am I missing something? |
Agent-Logs-Url: https://github.com/gui-cs/Terminal.Gui/sessions/7e30b537-f067-4ab4-b7de-c30beccbb9a7 Co-authored-by: tig <585482+tig@users.noreply.github.com>
Contributor
Author
You’re right — I replaced the headroom-adjusted values with the actual rounded means from the uploaded BenchmarkDotNet artifact for run 25992789200/job 76401866541. Addressed in f4bf4f9. |
Copilot
AI
changed the title
Refresh perf-gate baselines after stale placeholder regressions
Refresh perf-gate baselines from CI benchmark results
May 17, 2026
This was referenced May 21, 2026
This was referenced May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes
Proposed Changes/Todos
Tests/Benchmarks/baseline.jsonwith rounded mean values from the BenchmarkDotNet artifact generated by GitHub Actions run25992789200, job76401866541_commentto document the exact CI run/job used as the refresh sourcebaseline.jsonsyntaxPull Request checklist:
CTRL-K-Dto automatically reformat your files before committing.dotnet testbefore commit///style comments)Original prompt
Problem
The
perf-benchmarksjob in.github/workflows/perf-gate.ymlis failing on thedevelopbranch (job link) after a back-merge frommain. The failure is not a real performance regression — it is caused by stale placeholder values inTests/Benchmarks/baseline.json.All scroll benchmark entries in
baseline.jsoncontain placeholdermeanNsvalues (200,000–500,000 ns = 0.2–0.5 ms) with"comment": "Placeholder". These were never replaced with real measurements. The actual benchmark timings on CI are orders of magnitude higher (e.g.,TableView/PageDown/Rows=1000takes ~25 ms = 25,000,000 ns vs the placeholder of 300,000 ns), causing the 3× regression gate to trip for 21 benchmarks.The
ConfigurationManagerLoadBenchmark/LoadAndApplybaseline is also stale — it was measured at 3,185,090 ns but the current run shows ~11,567,200 ns (3.63×), just over the threshold.The workflow ran fine on
mainbecause the benchmark job only runs onpushand the PR to main was not a push to those branches at the time the baselines were committed.Fix Required
Update
Tests/Benchmarks/baseline.jsonwith realisticmeanNsvalues based on actual CI measurements visible in the failing job logs.From the job logs, the actual measured means are (read from the BDN output tables):
ListViewScrollBenchmark(baseline=ScrollDown_OneStep):PageDown_OneStep/Items=1000~23 ms. TheScrollDown_OneStepbaseline is similar.TableViewScrollBenchmark(baseline=ScrollDown_OneStep):PageDown_OneStep/Rows=100~73 ms,Rows=1000~25 ms.ScrollDown_OneStepis the baseline.TextViewScrollBenchmark(baseline=ScrollDown_OneStep):PageDown_OneStep/Lines=1000~196 ms,Lines=5000much larger.ConfigurationManagerLoadBenchmark/LoadAndApply: ~11,567,200 ns measured.Specific changes needed in
Tests/Benchmarks/baseline.jsonReplace all entries that have
"comment": "Placeholder"with realistic values. Use the actual measured values from the CI logs as a guide, and set themeanNsto approximately 2× the observed mean to give headroom for CI variance without masking real regressions. Remove the"comment": "Placeholder"fields.Here are the target values to use (in nanoseconds):
BaselineScrollBenchmark (no real data in logs — use generous placeholders that won't false-fire)
ViewportScroll_Down/ContentHeight=1000: 50,000,000 (50 ms)ViewportScroll_Down/ContentHeight=10000: 50,000,000ViewportScroll_Up/ContentHeight=1000: 50,000,000ViewportScroll_Up/ContentHeight=10000: 50,000,000ViewportScroll_PageDown/ContentHeight=1000: 50,000,000ViewportScroll_PageDown/ContentHeight=10000: 50,000,000ListViewScrollBenchmark (observed ~23 ms, set ceiling at ~70 ms = 3× headroom-friendly)
ScrollDown_OneStep/Items=1000: 70,000,000ScrollDown_OneStep/Items=10000: 70,000,000PageDown_OneStep/Items=1000: 70,000,000PageDown_OneStep/Items=10000: 70,000,000TableViewScrollBenchmark (observed ScrollDown ~45 ms, PageDown ~73 ms Rows=100, ~25 ms Rows=1000)
ScrollDown_OneStep/Rows=100: 150,000,000ScrollDown_OneStep/Rows=1000: 150,000,000PageDown_OneStep/Rows=100: 250,000,000PageDown_OneStep/Rows=1000: 100,000,000TextViewScrollBenchmark (observed PageDown ~196 ms Lines=1000, ScrollDown similar)
ScrollDown_OneStep/Lines=1000: 600,000,000ScrollDown_OneStep/Lines=5000: 2,000,000,000ScrollUp_OneStep/Lines=1000: 600,000,000ScrollUp_OneStep/Lines=5000: 2,000,000,000PageDown_OneStep/Lines=1000: 600,000,000PageDown_OneStep/Lines=5000: 2,000,000,000ConfigurationManagerLoadBenchmark (observed ~11.6 ms, set ceiling at ~35 ms)
LoadAndApply/: 35,000,000Also update the
_commentfield to note that baselines were updated from actual CI measurements in the back-merge PR #5326 run.Keep all other entries (ThemeSwitchBenchmark, SchemeAttributeBenchmark, SchemeSerializationBenchmark) unchanged — those had real measured values and are passing.
File to edit
Tests/Benchmarks/baseline.jsonNo other files need to change.
The following is the prior conversation context from the user's chat exploration (may be truncated):
User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.
Assistant: I'll grab the logs for that failing job right away!
Assistant: The job is failing due to 21 performance regressions detected by the
Performance Gateworkflow. The Python comparison script checked benchmark results againstbaseline.jsonand found multiple benchmarks exceeding the 3.0× threshold.Key Regressions
TableViewScrollBenchmark/PageDown_OneStep/Rows=1000This pull request was created from Copilot chat.