Skip to content

Gracefully handle corrupted private keys#61126

Merged
atburke merged 1 commit intomasterfrom
atburke/corrupted-pem
Nov 14, 2025
Merged

Gracefully handle corrupted private keys#61126
atburke merged 1 commit intomasterfrom
atburke/corrupted-pem

Conversation

@atburke
Copy link
Copy Markdown
Contributor

@atburke atburke commented Nov 6, 2025

This change updates tsh to gracefully handle corrupted private key files. Previously, if the key was corrupted, every tsh command would fail (including logout), and the only way to fix it would be to manually remove ~/.tsh. Now, if the key is corrupted tsh logout will successfully log out, and any command that calls RetryWithRelogin will also work properly.

Part of #59549.

Changelog: Fixed corrupted private keys breaking tsh

Test Plan

Assume each test case starts logged in with a manually corrupted key (this can be done by running echo "some data" > ~/.tsh/keys/<proxy-addr>/<teleport-user>.

  • tsh status reports that the session is no longer valid.
  • tsh logout successfully logs out the user.
  • A command with RetryWithRelogin (e.g. tsh ls) successfully logs in when the key is corrupted.

@zmb3
Copy link
Copy Markdown
Collaborator

zmb3 commented Nov 7, 2025

Do we also plan to address the cause of the corruption?

@atburke
Copy link
Copy Markdown
Contributor Author

atburke commented Nov 7, 2025

Do we also plan to address the cause of the corruption?

Not in this change. My understanding is that #59549 only requests handling the key once corrupted, as we don't have a way to properly reproduce the corruption.

@zmb3
Copy link
Copy Markdown
Collaborator

zmb3 commented Nov 7, 2025

Nic suspects the corruption comes from here:

teleport/lib/utils/fs.go

Lines 339 to 352 in 8489b48

if runtime.GOOS == "windows" {
// Windows can't unlink the file before overwriting.
if f != nil {
for range 3 {
if err := overwriteFile(f, fi); err != nil {
break
}
}
}
// The file should be closed before removing it on Windows.
closeErr := trace.ConvertSystemError(f.Close())
removeErr := trace.ConvertSystemError(os.Remove(filePath))
return trace.NewAggregate(closeErr, removeErr)
} else {

@atburke
Copy link
Copy Markdown
Contributor Author

atburke commented Nov 11, 2025

I agree that the corruption likely happens there, due to os.Remove() failing when there's still an open handle to the file. The trouble is finding that open handle; I haven't been able to find anywhere where we either forget to close or are concurrently operating on the key file that would trigger that scenario.

This change allows tsh to treat corrupted private keys as though
they were not present (and trigger relogin flows) It also attempts
to prevent said corruption in the first place.
@atburke atburke force-pushed the atburke/corrupted-pem branch from f57219f to b664a9e Compare November 14, 2025 16:41
@atburke atburke enabled auto-merge November 14, 2025 16:41
@atburke atburke added this pull request to the merge queue Nov 14, 2025
Merged via the queue into master with commit 10bfff2 Nov 14, 2025
44 checks passed
@atburke atburke deleted the atburke/corrupted-pem branch November 14, 2025 17:25
atburke added a commit that referenced this pull request Nov 14, 2025
This change allows tsh to treat corrupted private keys as though
they were not present (and trigger relogin flows) It also attempts
to prevent said corruption in the first place.
atburke added a commit that referenced this pull request Nov 14, 2025
This change allows tsh to treat corrupted private keys as though
they were not present (and trigger relogin flows) It also attempts
to prevent said corruption in the first place.
github-merge-queue bot pushed a commit that referenced this pull request Nov 15, 2025
This change allows tsh to treat corrupted private keys as though
they were not present (and trigger relogin flows) It also attempts
to prevent said corruption in the first place.
github-merge-queue bot pushed a commit that referenced this pull request Nov 15, 2025
This change allows tsh to treat corrupted private keys as though
they were not present (and trigger relogin flows) It also attempts
to prevent said corruption in the first place.
@nklaassen
Copy link
Copy Markdown
Contributor

I agree that the corruption likely happens there, due to os.Remove() failing when there's still an open handle to the file. The trouble is finding that open handle; I haven't been able to find anywhere where we either forget to close or are concurrently operating on the key file that would trigger that scenario.

os.Remove will fail even if another process has the file open. It could be Connect or another invocation of tsh or tsh proxy or a tctl command or really any other process

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants