Skip to content

roachtest: report timeout failures accordingly#96743

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
smg260:roachtest_timeouts
Feb 9, 2023
Merged

roachtest: report timeout failures accordingly#96743
craig[bot] merged 1 commit intocockroachdb:masterfrom
smg260:roachtest_timeouts

Conversation

@smg260
Copy link
Copy Markdown
Contributor

@smg260 smg260 commented Feb 7, 2023

Previously, a timeout failure would be deferred until after artifacts were collected, which sometimes resulted in subsequent failures being attributed as the primary cause.

  • Timeout failures are now recorded at actual timeout, with subsequent failures secondary. Context cancellation occurs at the end of test teardown
  • addFailure accepts a depth parameter and no longer includes context cancellation, which is done separately.

Epic: none
Fixes: #91237
Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@smg260 smg260 force-pushed the roachtest_timeouts branch from d4e9d18 to 4bcf740 Compare February 7, 2023 20:50
@smg260 smg260 requested a review from herkolategan February 8, 2023 01:40
Timeout failures are recorded at actual timeout, with
subsequent failures secondary.

`addFailure` accepts a depth parameter and no longer
includes context cancellation, which is done separately.

Epic: none
Fixes: cockroachdb#91237
Release note: None
@smg260 smg260 force-pushed the roachtest_timeouts branch from 4bcf740 to 25045db Compare February 8, 2023 13:57
@smg260 smg260 marked this pull request as ready for review February 8, 2023 14:17
@smg260 smg260 requested a review from a team as a code owner February 8, 2023 14:17
@smg260 smg260 requested review from srosenberg and removed request for a team February 8, 2023 14:17
Copy link
Copy Markdown

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! A couple of comments/question:

  • Commit messages typically follow the format packageName: description. I think this also applies to merge commits, so I'd suggest updating the PR title here.

  • Could you explain with a few more words in the commit message how this change fixes the linked issue? How is artifact collection related to the timeouts being reported as SSH flakes?

I'm also curious to learn how to reproduce these SSH flake errors on timeout: were you able to find a way to reliably produce this behavior?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, and @srosenberg)


pkg/cmd/roachtest/test_runner.go line 1110 at r1 (raw file):

			t.mu.cancel()
		}
		t.L().Printf("test timed out; check __stacks.log and CRDB logs for goroutine dumps")

Could we move this message back to the block where we set timedOut = true? With this change, the message here will now be logged on teardown.log, which I assume is much less frequently inspected than test.log. Also, IMO this is crucial information on the reason for the failure that should be on test.log.

Copy link
Copy Markdown
Contributor Author

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The commit message is currently roachtest: report timeout failures accordingly which I believe is OK. I can update the title accordingly though - not sure why its different

  • I will add some more detail. There is an existing comment at the artifact collection code site which explains.

Though I could not repro the exact issue initially seen, I was able time out a test, in which the underlying ssh command exited with a 255 soon after. This resulted in an ssh flake being reported, even though the timeout occurred first.

e.g. A roachtest with a 20s timeout, simply running sleep 25; exit 255

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)


pkg/cmd/roachtest/test_runner.go line 1110 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Could we move this message back to the block where we set timedOut = true? With this change, the message here will now be logged on teardown.log, which I assume is much less frequently inspected than test.log. Also, IMO this is crucial information on the reason for the failure that should be on test.log.

t.addFailure(0, "test timed out (%s)", timeout) already logs that information to the test.log, and is located where you mentioned.

e.g. test.log
21:16:28 test_impl.go:344: test failure #1: full stack retained in failure_1.log: (test_runner.go:989).runTest: test timed out (20s)

@smg260 smg260 changed the title Roachtest timeouts roachtest: report timeout failures accordingly Feb 8, 2023
Copy link
Copy Markdown

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Thanks for explaining!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)

Copy link
Copy Markdown
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @renatolabs and @srosenberg)

@smg260
Copy link
Copy Markdown
Contributor Author

smg260 commented Feb 9, 2023

tftr

bors r=renatolabs,herkolategan

@craig
Copy link
Copy Markdown
Contributor

craig Bot commented Feb 9, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: don't report SSH flake if test times out

4 participants