ci: Releasing benchmarks and benchmarking PR#2432
Conversation
|
Apologies for opening a new PR, the old PR got cluttered because of testing, but here it is. Two major comments on the previous PR for context
|
270cd88 to
874a845
Compare
Codecov ReportBase: 53.35% // Head: 53.30% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #2432 +/- ##
==========================================
- Coverage 53.35% 53.30% -0.06%
==========================================
Files 116 116
Lines 10270 10270
==========================================
- Hits 5480 5474 -6
- Misses 4370 4374 +4
- Partials 420 422 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
5114ccc to
4f46bdd
Compare
|
@sozercan Github API returns JSON data with unescaped character sequences. Hence, a parser like |
a45f0b7 to
51f48fc
Compare
| export GOPATH="$HOME/go" | ||
| PATH="$GOPATH/bin:$PATH" | ||
| go install golang.org/x/perf/cmd/benchstat@latest | ||
| benchstat release_benchmarks.txt pr_benchmarks.txt > benchstat.txt |
There was a problem hiding this comment.
Do we know if the release benchmark was performed on an identical machine/load profile? If not, how much does that impact the accuracy of the comparison? Are we measuring differences in code performance or machine performance?
There was a problem hiding this comment.
Yes, Since the release job also runs on ubuntu-latest the same as comparing the benchmark job which is a standard runner in github, we may say that we are trying to compare code performance of update (the PR in question) against what is released in the latest version.
There was a problem hiding this comment.
I'm not seeing anything about CPU architecture. Are they guaranteed to run on a fixed generation?
Easiest way to eliminate most infra variables would be to run both PRs next to each other on the same machine (maybe pulling the current build from Dockerhub to save time), though I'm open to other methods.
Maybe we could run the benchmark multiple times (separate runs so on different machines) to get a sense of the variance and storing that?
There was a problem hiding this comment.
Maybe we could run the benchmark multiple times (separate runs so on different machines) to get a sense of the variance and storing that?
We are doing this with count=5 but it's on the same machine. Running on diff machines would require multiple runners and then we'll need another runner to combine the results, seems a bit too complex?
https://github.com/open-policy-agent/gatekeeper/pull/2432/files#diff-76ed074a9305c04054cdebb9e9aad2d818052b07091de1f20cad0bbac34ffb52R110
There was a problem hiding this comment.
We need to have some kind of signal that the performance numbers are apples-to-apples, otherwise we'll spend time trying to diagnose a regression that's really just an architecture difference.
I'm open to whatever gives us confidence this is the case. Unfortunately thread count alone probably isn't sufficient, since not all threads are created equal.
TBH, my preferred order (of alternatives I can think of) would be:
- run old/new code on the same machine for each test run (I think OPA does this)
- Running old/new code on different machines, with the both run several times to get some measure of per-machine variance (an attempt to quantify the likelihood of processor skew)
- Assurances that hardware/software stack are identical across runs (might be impossible to guarantee this, since upgrades are likely/expected, and so there will always be the possibility that maintenance has been performed between runs)
I might be missing some alternatives?
Or are we not looking to make conclusions by doing comparative analysis? If not, what are we hoping to use this data for?
There was a problem hiding this comment.
One drawback is though we are running two benchmarking process that increases the time to complete workflow by 10 mins avg.
There was a problem hiding this comment.
tests seem to be taking 20mins+ now which makes e2e take over twice as long
@acpana should we decrease count to 5? anything else we can do?
perhaps this can be a manual command (like /benchmark)?
There was a problem hiding this comment.
+1, we could hide the benchmark job behind a manual command
There was a problem hiding this comment.
Definitely running benchmarks less frequently (or manually) could be a good solution here. It will at least let us know if something went wrong.
There was a problem hiding this comment.
Done. After this PR is merged, we could run benchmarks on PR with commenting /benchmark
acpana
left a comment
There was a problem hiding this comment.
i really like this PR! I think it's important to know how our changes impact perf.
i left a couple of call outs and questions 💯
| with: | ||
| issue-number: ${{ github.event.pull_request.number }} | ||
| body: | | ||
| This PR compares its performance to the latest released version. If it performs significantly lower, consider optimizing your changes to improve the performance. |
There was a problem hiding this comment.
can we define what significantly lower means? e.g. 5% ns/op delta, etc
There was a problem hiding this comment.
And on further thought, could we actually enhance this action to some processing of the results? Like show which benches have changed meaningfully?
As I read this right now, do we comment out the whole benchmark output? That may be a bit too much
There was a problem hiding this comment.
Let me look into this and the capabilities of benchstat to see if this is possible or not.
There was a problem hiding this comment.
Good point on defining delta. Never gave that a thought! Good catch
There was a problem hiding this comment.
What do you think the delta should be? Do you know any tools that could process the benchstat output to prettify it? As of now, we are posting the complete result of benchstat on PR.
There was a problem hiding this comment.
We can pick a value to start with and then adjust as needed. But if we are only running this job thru the comment cmd /benchmark, then maybe we don't need to set a threshold and just rely on one off analyses from the pr authors & reviewers
|
|
||
| .PHONY: benchmark-test | ||
| benchmark-test: | ||
| go test ./pkg/... -bench . -run="^#" -count 5 > ${BENCHMARK_FILE_NAME} |
There was a problem hiding this comment.
3 callouts here
-
from my experience benchmarking
count=5can yield statistically insignificant results; I usually usecount=10,20(some 10 multiple), -
the default timeout here is 10min but the GH action sets a timeout of 30 minutes; for debugging this may cause some confusion. Let's use the same timeout value in both places if that makes sense.
-
I think it's best practice when benchmarking to set
GOMAXPROCShere to decrease hardware variance. This is along the lines of Max's thoughts above. In particular for this cmd, let's set this toGOMAXPROCS=1 go test ...or some env var that we control but let's set it to have more consistent results across hardware.
There was a problem hiding this comment.
Ohh thanks for this input!
| files: | | ||
| _dist/sha256sums.txt | ||
| _dist/*.tar.gz | ||
| _dist/release_benchmarks.txt |
There was a problem hiding this comment.
do we really want to publish benchmark results for the release now? I think this could be useful, but per Max's comments above, if we only compare the benchmark run against tip of main/ base of branch then would this be actually used?
There was a problem hiding this comment.
Yeah, if we are considering running a benchmark for head every time, then we don't need to run/publish benchmark in the release.
| export GOPATH="$HOME/go" | ||
| PATH="$GOPATH/bin:$PATH" | ||
| go install golang.org/x/perf/cmd/benchstat@latest | ||
| benchstat release_benchmarks.txt pr_benchmarks.txt > benchstat.txt |
There was a problem hiding this comment.
( chiming in to throw my 2cents ) I think benchmarking against the head of the branch would be more incrementally useful as we iterate over a change.
i.e. "does my latest commit change the performance in this branch" vs "does my latest commit OR something that was checked into main change the performance in this branch" ?
| run: | | ||
| export GOPATH="$HOME/go" | ||
| PATH="$GOPATH/bin:$PATH" | ||
| go install golang.org/x/perf/cmd/benchstat@latest |
There was a problem hiding this comment.
can we pin the benchstat version and/ or expose it as an env var?
There was a problem hiding this comment.
i don't think there are tagged versions. can we specify a digest here? https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
| done | ||
| done | ||
|
|
||
| benchmark: |
There was a problem hiding this comment.
how did you test this end to end? (i ask bc I am writing a GH wf too hehe)
There was a problem hiding this comment.
I tested this job on my fork by manually creating releases and generating needed artifacts. Then testing this job against two branches in my fork.
There was a problem hiding this comment.
i've been using act to test locally too. it helps w the development iteration: https://github.com/nektos/act
| - name: Download release benchmark file | ||
| uses: robinraju/release-downloader@v1.6 | ||
| id: get-latest-benchmark | ||
| continue-on-error: true |
There was a problem hiding this comment.
also could we make this wf fail actually? what were the arguments for this again?
I don't want us to be tricked by the ✅ in the checks summary only to find out that we can't actually download the data that we need.
At least an ❌ signals to the author to investigate. Wdyt?
There was a problem hiding this comment.
The argument was that just missing release benchmark data is not enough reason to fail the workflow. But, yeah I think somehow outputting the error data or indicating that there was an error please look at the result of the job would be useful. I think we could post the message on the PR the same way we are outputting benchstat data to indicate the error on retrieving past data.
a069f31 to
ed53110
Compare
|
|
||
| - name: Run benchmark with incoming changes | ||
| run: | | ||
| curl -L -O "https://github.com/kubernetes-sigs/kubebuilder/releases/download/v${KUBEBUILDER_VERSION}/kubebuilder_${KUBEBUILDER_VERSION}_linux_amd64.tar.gz" &&\ |
There was a problem hiding this comment.
we can combine these i think and install once
204f507 to
cb9fccc
Compare
e0c31de to
b88efa4
Compare
|
|
||
| .PHONY: benchmark-test | ||
| benchmark-test: | ||
| go test ./pkg/... -bench . -run="^#" -count 10 > ${BENCHMARK_FILE_NAME} |
There was a problem hiding this comment.
looks like this is not used anymore?
There was a problem hiding this comment.
Yes! I kept it since it's still useful if dev wants to run it locally.
There was a problem hiding this comment.
can we add GOMAXPROCS=1 here and use this in the github action
6c393e6 to
aefe465
Compare
acpana
left a comment
There was a problem hiding this comment.
Thanks for making all the changes and for your patience. This overall LGTM w a few questions/ open threads!
| contents: write | ||
| pull-requests: write | ||
| steps: | ||
| - uses: izhangzhihao/delete-comment@master |
There was a problem hiding this comment.
Ah I can't recall, why are we removing the comments from the bot here?
There was a problem hiding this comment.
Consider this situation -> I ran the benchmark and got the result on PR, but now I changed something and ran the benchmark again on the PR and got the result on PR. Since the benchmark result is really long the PR can get cluttered easily. And in any case, we only care about the most recent benchmarks, so we aren't losing anything by deleting previous comments from the bot.
There was a problem hiding this comment.
Would it be good to have the history to see the impact of various changes?
Also, are older comments collapsed-by-default once the threads get large?
There was a problem hiding this comment.
I don't have any strong opinions about keeping or not keeping the older stats in the comments.
While testing multiple benchmarks on my fork, I got frustrated because of multiple results and a list of /benchmark comments. So I added the part to have previous comments by bot deleted.
I am fine with either merging this PR as is and then following up after if we feel that we want all the benchmarks results on PR, or updating this PR to keep all the bot comments and then following up with deleting comments if need be in the future.
How do folks feel about this?
There was a problem hiding this comment.
Okay then! Let's merge this and will follow up if needed.
maxsmythe
left a comment
There was a problem hiding this comment.
LGTM, thanks for doing this!
3db14ae to
2e9a237
Compare
| contents: write | ||
| pull-requests: write | ||
| steps: | ||
| - uses: izhangzhihao/delete-comment@master |
6dc378e to
cf833a9
Compare
Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
Signed-off-by: Jaydip Gabani <gabanijaydip@gmail.com>
What this PR does / why we need it:
Releasing benchmark stats with each release and comparing benchmark data from each PR against latest released benchmark stats.
Which issue(s) this PR fixes (optional, using
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when the PR gets merged):Fixes #692
Special notes for your reviewer: