xtrabackup: Add a timeout on closing backup files.#5193
xtrabackup: Add a timeout on closing backup files.#5193enisoc merged 2 commits intovitessio:masterfrom
Conversation
We've seen backup attempts that apparently stalled while waiting for Close() on the file returned by AddFile() to return. We've only seen this on xtrabackup backups, likely because we perform a small number of long-running file uploads, instead of uploading each file individually. This adds a timeout to the Close() step. If it times out, the backup will be aborted and will need to be retried from scratch. However, that's better than getting stuck forever. Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
| // returned to abort. | ||
| cancelAddFiles() | ||
| } | ||
| }() |
There was a problem hiding this comment.
I was about to merge, but... I'm not sure this solves the problem as described. The above go func will cancel the context and exit on the time out, but the outer defer func can still hang because closeFile doesn't care about cancelAddFiles.
Did you mean to do it the other way round, where the closeFile runs in a goroutine but the defer func should exit on timeout without waiting for closeFile to finish?
There was a problem hiding this comment.
If we fail to make Close() return, then abandoning its goroutine will leak it, which could have unpredictable side effects down the line. It seems safer to me to hang if Close() hangs. We could also force a hard crash, but I'd rather give this a try first and add the crash only if we really need it. WDYT?
There was a problem hiding this comment.
I added a comment to document the above caveat.
Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
We've seen backup attempts that apparently stalled while waiting for
Close() on the file returned by AddFile() to return. We've only seen
this on xtrabackup backups, likely because we perform a small number of
long-running file uploads, instead of uploading each file individually.
This adds a timeout to the Close() step. If it times out, the backup
will be aborted and will need to be retried from scratch. However,
that's better than getting stuck forever.
Signed-off-by: Anthony Yeh enisoc@planetscale.com