roachtest: capture verbose SSH logging on failures#92667
roachtest: capture verbose SSH logging on failures#92667craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
6b3ab03 to
7aa1aee
Compare
| format = strings.Repeat(" %v", len(args))[1:] | ||
| } | ||
| reportFailure := newFailure(errors.NewWithDepthf(1, format, args...), collectErrors(args)) | ||
| reportFailure := newFailure(errors.NewWithDepthf(2, format, args...), collectErrors(args)) |
There was a problem hiding this comment.
It's not immediately obvious why 2 is better than 1. Presumably, this is so that runtime.Callers will skip this frame (and the ones below it)? It might be worth explaining this in a comment; personally, I find the depth arithmetic error-prone.
There was a problem hiding this comment.
The depth here tells us how many of the most recent frames to ignore. That should give us the, in the logs, where t.Fatalf was called from (which can be from roachtest internal, or an actual roachtest).
Previously, we were creating this error from t.Fatalf, and the depth was 1. During more recent refactoring, the error creation was moved to this shared addFailure, which is 1 more call away, but the depth was not incremented to account for that.
9decf06 to
58a92bd
Compare
srosenberg
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @smg260)
pkg/cmd/roachtest/test_impl.go line 336 at r1 (raw file):
Previously, smg260 (Miral Gadani) wrote…
The depth here tells us how many of the most recent frames to ignore. That should give us the, in the logs, where
t.Fatalfwas called from (which can be from roachtest internal, or an actual roachtest).Previously, we were creating this error from
t.Fatalf, and the depth was 1. During more recent refactoring, the error creation was moved to this sharedaddFailure, which is 1 more call away, but the depth was not incremented to account for that.
Right. It might still be worth a single-line comment since in majority of the case you'd expect to see 1. Btw, just saw the same fix was also applied in this PR: https://github.com/cockroachdb/cockroach/pull/94361/files#diff-c0e178682162fce5de4707f6c5cd4fb00efecb1b57f9b7d3922b5938877b719cL336-R336
srosenberg
left a comment
There was a problem hiding this comment.
Looks good overall! Are there any open items? I noticed one TBD that may have been already resolved? Meanwhile, I'd like to take a look at the CI run to see how the ssh debug looks with the rest of the logs.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @smg260)
pkg/roachprod/install/cluster_synced.go line 145 at r2 (raw file):
// number and a wrapper error. func runWithMaybeRetry( l *logger.Logger, runRetryOpts *RunRetryOpts, f func() (*RunResultDetails, error),
Nit: runRetryOpts seems redundant; I actually liked the previous name; curious why you thought it needed to be more explicit.
pkg/roachprod/install/session.go line 51 at r2 (raw file):
} // TODO: MG investigate - ssh log is not produced for the 3rd and final attempt
Still unresolved?
pkg/roachprod/install/session.go line 55 at r2 (raw file):
var logfile string var loggingArgs []string cl, err := l.ChildLogger(fmt.Sprintf("ssh_%s_%s", host, timeutil.Now().Format(time.RFC3339)))
Since these will be appended to l.File, maybe prefix with something like ssh_debug to make it easy to filter; i.e., grep -v ssh_debug_?
Also, I wonder if interleaving could pose a problem; it might be better to redirect all ssh debug to a separate file.
pkg/roachprod/install/session.go line 56 at r2 (raw file):
var loggingArgs []string cl, err := l.ChildLogger(fmt.Sprintf("ssh_%s_%s", host, timeutil.Now().Format(time.RFC3339))) // running roachprod from the cli will result in a fileless logger
Nit: comment is a not a full sentence :)
pkg/roachprod/install/session.go line 57 at r2 (raw file):
cl, err := l.ChildLogger(fmt.Sprintf("ssh_%s_%s", host, timeutil.Now().Format(time.RFC3339))) // running roachprod from the cli will result in a fileless logger if err == nil && l.File != nil {
Might be worth adding a debug flag to disable verbose (ssh) logging (quickly) in case we run into an unwanted side-effect.
smg260
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)
pkg/roachprod/install/cluster_synced.go line 145 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Nit:
runRetryOptsseems redundant; I actually liked the previous name; curious why you thought it needed to be more explicit.
Agreed, and I can't remember why - may have happened as part of a rename/refactor by the ide.
pkg/roachprod/install/session.go line 51 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Still unresolved?
Seems like you may have looked at an older revision in reviewable - that TODO is gone.
pkg/roachprod/install/session.go line 55 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Since these will be appended to
l.File, maybe prefix with something likessh_debugto make it easy to filter; i.e.,grep -v ssh_debug_?Also, I wonder if interleaving could pose a problem; it might be better to redirect all ssh debug to a separate file.
They are currently both prefixed with ssh_ and stored in a separate ssh/ directory next to the other logs. The ssh log names are also encoded in a similar manner to the run logs, which includes a nano precision ts, node number and command.
Here is a nightly run with ssh debugging for failures:
pkg/roachprod/install/session.go line 57 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Might be worth adding a debug flag to disable verbose (ssh) logging (quickly) in case we run into an unwanted side-effect.
Sure we can do that - as it stands, we are only preserving the logs for SSH commands that return a non-zero exit code. Perhaps the flag can be used to specify an ssh debug mode None, Failed, All?
58a92bd to
488306e
Compare
srosenberg
left a comment
There was a problem hiding this comment.
Reviewed 1 of 2 files at r2, 1 of 4 files at r6, 3 of 5 files at r9, 3 of 3 files at r10, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @smg260)
pkg/roachprod/install/session.go line 55 at r2 (raw file):
They are currently both prefixed with ssh_ and stored in a separate ssh/ directory next to the other logs.
Nice, I didn't realize they were being stored in a separate dir. One downside of using ssh debug is the lack of timestamps. E.g., looking at [1], we mostly care about commands executed, the rest is mostly noise,
debug1: Sending command: export ROACHPROD=1 GOTRACEBACK=crash && bash -c "cd /mnt/data1/activerecord-cockroachdb-adapter/ && sudo RUBYOPT=\\"-W0\\" TESTOPTS=\\"-v\\" bundle exec rake test"
This reminded me of a bastion that my former colleague open-sourced. He is using a tty-recorder [2] which timestamps all commands and as a bonus makes it easy to replay the whole session. The setup is pretty simple [3]. I wonder if we should incorporate it instead of ssh debug?
[1] https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_RoachtestNightlyGceBazel/7901775:id/activerecord/run_1/artifacts.zip!/ssh/ssh_145107.952226743_n1_cd-mntdata1activerec.log
[2] https://pkg.go.dev/maze.io/x/ttyrec
[3] https://github.com/Inpher/sb/blob/main/cmd/ttyrec.go#L113-L137
488306e to
49be9d7
Compare
|
Adding timestamps to SSH debug statements in stderr works but relies on redirecting to a file, thus remove stderr from the This PR's use of Will need to revisit adding timestamps at a later stage |
srosenberg
left a comment
There was a problem hiding this comment.
Will need to revisit adding timestamps at a later stage
Any ideas on how we'd recover timestamps without refactoring CombinedOutput?
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @smg260)
|
Nothing simple comes to mind since the only solutions involve omitting We could write a timestamp at the beginning and end of a command. Not ideal, but better than nothing? |
srosenberg
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @smg260)
renatolabs
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, and @srosenberg)
pkg/roachprod/install/cluster_synced.go line 132 at r11 (raw file):
// defaultSCPRetry assumes any error is retryable var defaultSCPRetry = newRunRetryOpts(defaultRetryOpt, nil)
Nit: a more readable version of this would be to define a alwaysRetry function and pass that here. nil doesn't have an obvious interpretation, meaning you have to resort to comments (which can get outdated), or reading the code of the retrying function to understand the semantics.
pkg/roachprod/install/cluster_synced.go line 147 at r11 (raw file):
l *logger.Logger, retryOpts *RunRetryOpts, f func() (*RunResultDetails, error), ) (*RunResultDetails, error) { if retryOpts == nil {
Nit: could we have a default config when none is passed? We'd avoid the special case and it would make the errors more uniform.
pkg/roachprod/install/session.go line 63 at r11 (raw file):
if command.debugName == "" { command.debugName = GenFilenameFromArgs(20, strings.Fields(command.cmd)...)
Nit: we don't have to use strings.Fields()..., do we?
pkg/roachprod/install/session.go line 78 at r11 (raw file):
logfile = cl.File.Name() loggingArgs = []string{ // Extra verbosity is unlikely to give us meaningful information when commands fail.
I don't understand: why is it unlikely? Also, why only -v and not -vvv (as the commit message says)?
pkg/roachprod/install/session.go line 86 at r11 (raw file):
loggingArgs = []string{"-q"} } //const logfile = ""
🔥
pkg/roachprod/install/session.go line 112 at r11 (raw file):
func (s *remoteSession) errWithDebug(err error) error { if err != nil && s.logfile != "" { err = errors.Wrapf(err, "ssh verbose log retained in %s", filepath.Base(s.logfile))
Wouldn't this remove the ssh/ directory?
renatolabs
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, and @srosenberg)
a discussion (no related file):
Looking forward to this! 🎉
smg260
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)
pkg/roachprod/install/cluster_synced.go line 132 at r11 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Nit: a more readable version of this would be to define a
alwaysRetryfunction and pass that here.nildoesn't have an obvious interpretation, meaning you have to resort to comments (which can get outdated), or reading the code of the retrying function to understand the semantics.
Good idea
pkg/roachprod/install/session.go line 63 at r11 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Nit: we don't have to use
strings.Fields()..., do we?
Good catch - we don't. A previous iteration of GenFilenameFromArgs did need it.
pkg/roachprod/install/session.go line 78 at r11 (raw file):
Previously, renatolabs (Renato Costa) wrote…
I don't understand: why is it unlikely? Also, why only
-vand not-vvv(as the commit message says)?
I will change back to vvv - since we only keep erroneous command logs. I find the extra verbosity obfuscates any useful information. This will result in client/server/extra debugging.
smg260
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)
pkg/roachprod/install/session.go line 112 at r11 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Wouldn't this remove the
ssh/directory?
This just ensures that when we record an error, we don't delete the file on Close()
if s.logfile != "" {
_ = os.Remove(s.logfile)
}
Currently we do not capture SSH logs in the event of a command failure, which can be useful in debugging issues, transient or otherwise. This commit enables logging via the ssh switch -vvv and specifying a log filename, to be stored under an ssh/ directory in the test log root. The debug file is deleted upon successful (zero) exit of the command, but preserved for non-zero exits for further inspection. Additionally, - The name of the log is consistent with the corresponding run log and encodes a node number and timestamp. - SSH sessions must now be initialised with the command itself to re-inforce its single use nature. - Debug friendly command names can optionally be specified to influence the name of the run/ssh logs. - Retry options can optionally be omitted from any call to ParallelE to disable retries Release note: None Epic: CRDB-21386
49be9d7 to
1de6c82
Compare
renatolabs
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @smg260, and @srosenberg)
pkg/roachprod/install/session.go line 112 at r11 (raw file):
Previously, smg260 (Miral Gadani) wrote…
This just ensures that when we record an error, we don't delete the file on
Close()if s.logfile != "" { _ = os.Remove(s.logfile) }
I meant the filepath.Base part -- the filename in the logs will omit the ssh/ directory which could be confusing.
smg260
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @renatolabs, and @srosenberg)
pkg/roachprod/install/session.go line 112 at r11 (raw file):
Previously, renatolabs (Renato Costa) wrote…
I meant the
filepath.Basepart -- the filename in the logs will omit thessh/directory which could be confusing.
Ah, it will but I'd think that the ssh directory itself would be enough of a clue. Will prefix with ssh in a future PR if it there is any confusion.
|
TFTR bors r=srosenberg |
|
Build succeeded: |
This change captures verbose SSH log output to the same directory as other test logs. The output file is deleted if no errors were encountered.
Also included in this PR:
c.ParallelEthus making retries optional. This will later be exposed via API in roachtests.ssh|run_timestamp_node_cmd)Release note: None
Epic: CRDB-21386