evaluate2.sh: Check output of warmup run and abort early if failed #333

hundredwatt · 2024-01-11T21:50:04Z

Closes #313

Change instructions that create out_expected.txt to create measurements_1B.out
Run ./test.sh before hyperfine
Replace hyperfine warmup with ./test.sh <fork> measurements_1B.txt and compare its result to measurements_1B.out

New Output

In screenshot below:

hundredwatt fork passes tests
all_bad fails to pass for all tests
bad_1b passes test.sh default suite, but fails on test file

NOTE: for development purposes, I used a 100m row file and only 3 runs. This is reflected in the screenshot below, but not in this PR's code.

hundredwatt · 2024-01-11T21:53:50Z

evaluate2.sh

+function print_and_execute() {
+  echo "+ $@" >&2
+  "$@"
+}


Some of the changes in this PR are refactoring to use this method instead of xtrace or echo statements

evaluate2.sh

test.sh

hundredwatt · 2024-01-12T05:22:04Z

I pushed changes based on @AlexanderYastrebov's review. I had some questions previously, but I closed them and ended up resolving the issues (I believe) so that you two can re-review and merge this before I wake up if it looks good to you 😄

My resolution was: instead of silencing ./test.sh with redirects, I added a --quiet flag to that script to be used only by evaluate2.sh.

This way, it's no longer noisy in the happy path but we still see the diffs printed on test failures :)

I replaced the git diff diff printer with diff plus tocsv.sh, see discussion: #333 (comment)

Happy Path

(sending the output from test.sh's invocation of prepare_$fork.sh to /dev/null still didn't silence sdkman's output even if I added 2> &1... not sure why)

On Error

test.sh

…sh also invokes it

This reverts commit 13e9fb7.

…ailure

hundredwatt · 2024-01-12T16:25:05Z

@AlexanderYastrebov @gunnarmorling I removed the ./test.sh --quiet flag and instead am capturing the test output to a file, then printing it upon failure. Much better now 👍

hundredwatt · 2024-01-12T16:49:04Z

Added colors 😄

gunnarmorling · 2024-01-12T19:44:07Z

Added colors 😄

Nice. On that diff, any chance we could get word diff? That's why I used git diff --no-index --word-diff at some point, as it easily highlights the diff also for the 10K file.

AlexanderYastrebov · 2024-01-12T20:25:31Z

@gunnarmorling git word diff has some issues working with process substitution, see #333 (comment)

IMO the diff from #333 (comment) is quite clear (and shows the problem #49)

gunnarmorling · 2024-01-12T21:00:23Z

IMO the diff from #333 (comment) is quite clear

I'm not so sure :) How would you find that difference in the 10K test case? Some for the expected output of the challenge. That's where word diff helps a lot.

gunnarmorling · 2024-01-12T21:01:44Z

Ah, hold on. This is diffing now one station per line due to to-csv, right? Yeah, in that case I'm on board.

hundredwatt · 2024-01-12T23:33:23Z

Ah, hold on. This is diffing now one station per line due to to-csv, right? Yeah, in that case I'm on board.

Yes... so this should be ready to merge 👍

gunnarmorling · 2024-01-13T11:19:12Z

Tested this some more, all looking good. Merging now. Thanks a lot, @hundredwatt! It's a very nice improvement, all in one go now, sweet!

gunnarmorling · 2024-01-13T11:19:45Z

I'm also gonna delete the old evaluate script, it's not needed any more.

gunnarmorling · 2024-01-13T11:49:51Z

@hundredwatt, so after some more consideration, can we bring back the logging of the test output? I.e. the "Validating..." lines, and also the output of Graal VM native binary builds. Right now, one is flying a bit blind in regards to progress, in particular when the latter takes a few seconds longer.

hundredwatt · 2024-01-13T16:29:39Z

@gunnarmorling Sure: #377

…unnarmorling#333) * refactor: replace xtrace with "print_and_execute" function * nit: stylize error messages * replace out_expected.txt with measurements_1B.out * print * prevent errors on cleanup * run tests and check warmup run output before running benchmark * move "git diff" pretty diff output to test.sh * Ensure "set -e" is re-enabled if we followed a "continue" branch * add timeouts to test.sh invocations * use diff with tocsv.sh to show differences on failed test * add --quiet mode to test.sh * move prepare_$fork.sh invocation to right below hyperfine since test.sh also invokes it * Revert "add --quiet mode to test.sh" This reverts commit 13e9fb7. * use tee to capture test output to a temp file and print contents on failure --------- Co-authored-by: Jason Nochlin <[email protected]>

hundredwatt commented Jan 11, 2024

View reviewed changes

hundredwatt mentioned this pull request Jan 11, 2024

Eagerly abort hyperfine run when output differs #313

Closed

hundredwatt force-pushed the evaluate2-tests branch from 8a8efc6 to 7f009de Compare January 11, 2024 22:12

hundredwatt force-pushed the evaluate2-tests branch 6 times, most recently from 7c5c3ed to cdd2117 Compare January 12, 2024 05:55

gunnarmorling reviewed Jan 12, 2024

View reviewed changes

test.sh Outdated Show resolved Hide resolved

AlexanderYastrebov reviewed Jan 12, 2024

View reviewed changes

test.sh Outdated Show resolved Hide resolved

hundredwatt added 14 commits January 12, 2024 09:20

refactor: replace xtrace with "print_and_execute" function

37e1ea4

nit: stylize error messages

47bb8d8

replace out_expected.txt with measurements_1B.out

fc3646a

print

0acec7b

prevent errors on cleanup

6cdb0a1

run tests and check warmup run output before running benchmark

2d3a78c

move "git diff" pretty diff output to test.sh

aebe985

Ensure "set -e" is re-enabled if we followed a "continue" branch

44d6a0e

add timeouts to test.sh invocations

49215d8

use diff with tocsv.sh to show differences on failed test

0015bb6

add --quiet mode to test.sh

a34026d

move prepare_$fork.sh invocation to right below hyperfine since test.…

2ddcd74

…sh also invokes it

Revert "add --quiet mode to test.sh"

e8c731c

This reverts commit 13e9fb7.

use tee to capture test output to a temp file and print contents on f…

f609760

…ailure

hundredwatt force-pushed the evaluate2-tests branch from cdd2117 to f609760 Compare January 12, 2024 16:24

gunnarmorling merged commit eff73db into gunnarmorling:main Jan 13, 2024
1 check passed

hundredwatt deleted the evaluate2-tests branch January 13, 2024 16:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluate2.sh: Check output of warmup run and abort early if failed #333

evaluate2.sh: Check output of warmup run and abort early if failed #333

hundredwatt commented Jan 11, 2024 •

edited

Loading

hundredwatt Jan 11, 2024

hundredwatt commented Jan 12, 2024 •

edited

Loading

hundredwatt commented Jan 12, 2024 •

edited

Loading

hundredwatt commented Jan 12, 2024

gunnarmorling commented Jan 12, 2024

AlexanderYastrebov commented Jan 12, 2024

gunnarmorling commented Jan 12, 2024

gunnarmorling commented Jan 12, 2024

hundredwatt commented Jan 12, 2024

gunnarmorling commented Jan 13, 2024

gunnarmorling commented Jan 13, 2024

gunnarmorling commented Jan 13, 2024

hundredwatt commented Jan 13, 2024

evaluate2.sh: Check output of warmup run and abort early if failed #333

evaluate2.sh: Check output of warmup run and abort early if failed #333

Conversation

hundredwatt commented Jan 11, 2024 • edited Loading

New Output

hundredwatt Jan 11, 2024

Choose a reason for hiding this comment

hundredwatt commented Jan 12, 2024 • edited Loading

Happy Path

On Error

hundredwatt commented Jan 12, 2024 • edited Loading

hundredwatt commented Jan 12, 2024

gunnarmorling commented Jan 12, 2024

AlexanderYastrebov commented Jan 12, 2024

gunnarmorling commented Jan 12, 2024

gunnarmorling commented Jan 12, 2024

hundredwatt commented Jan 12, 2024

gunnarmorling commented Jan 13, 2024

gunnarmorling commented Jan 13, 2024

gunnarmorling commented Jan 13, 2024

hundredwatt commented Jan 13, 2024

hundredwatt commented Jan 11, 2024 •

edited

Loading

hundredwatt commented Jan 12, 2024 •

edited

Loading

hundredwatt commented Jan 12, 2024 •

edited

Loading