MWI: Fix flaky test in SPIFFE Workload APIs#60668
Merged
Conversation
The `workload-identity-api` and `spiffe-workload-api` tests frequently fail the
Flake Detector by timing out after 10 minutes.
On taking a look at the logs, the following stacktrace stood out to me:
2025-10-28T12:42:46.4097768Z goroutine 9965029 [runnable]:
2025-10-28T12:42:46.4097973Z internal/poll.(*FD).Read(0xc0333a8cc0, {0xc029ff0000, 0x8000, 0x8000})
...
github.com/gravitational/teleport/lib/tbot/workloadidentity/workloadattest.copyAtMost({0x7fe928bbdbf0, 0xc09d285900}, {0x1ade5660, 0xc011dfecf0}, 0x40000000)
2025-10-28T12:42:46.4100772Z /__w/teleport/teleport/lib/tbot/workloadidentity/workloadattest/unix.go:178 +0x8f
Which is the codepath that reads a workload's executable using the `/proc/<pid>/exe`
symlink in order to attest its SHA256 checksum. This symlink is special in that
it points to an *inode* rather than a regular path, so that if you replace the
executable (e.g. during a rolling deploy) it will still point to the *original*
file.
My current theory is that the combination of overlayfs, debugfs, and any other
layers of indirection in our GitHub Actions environment break inode stability
causing reads to stall indefinitely. Lowering `binary_hash_max_size_bytes` and
therefore the number of reads we perform seems to fix this!
I think it's unlikely we would see this in a real production deployment (due to
the odd filesystem shenanigans required), but as it's not impossible, I've added
a 15 second timeout to the read.
strideynet
approved these changes
Oct 28, 2025
GavinFrazar
approved these changes
Oct 28, 2025
boxofrad
added a commit
that referenced
this pull request
Oct 29, 2025
Backport #60668 to branch/v18
boxofrad
added a commit
that referenced
this pull request
Oct 29, 2025
Backport #60668 to branch/v17
github-merge-queue bot
pushed a commit
that referenced
this pull request
Oct 29, 2025
* [v17] MWI: Automatically report service statuses in oneshot mode Backport #60148 to branch/v17 * [v17] MWI: Add `AllServicesReported` method to `readyz.Register` Backport #60059 to branch/v17 * [v17] MWI: Wait for service health before sending first heartbeat Backport #60087 to branch/v17 * [v17] MWI: Add service health to bot heartbeats Backport #60093 to branch/v17 * [v17] MWI: Simpler auto-generated `tbot` service names Backport #60052 to branch/v17 * Fix `testing/synctest` on CI * Fix linting of synctest files on CI * [v17] MWI: Fix flaky test in SPIFFE Workload APIs Backport #60668 to branch/v17
github-merge-queue bot
pushed a commit
that referenced
this pull request
Oct 29, 2025
* [v18] MWI: Automatically report service statuses in oneshot mode Backport #60148 to branch/v18 * [v18] MWI: Add `AllServicesReported` method to `readyz.Register` Backport #60059 to branch/v18 * [v18] MWI: Wait for service health before sending first heartbeat Backport #60087 to branch/v18 * [v18] MWI: Add service health to bot heartbeats Backport #60093 to branch/v18 * [v18] MWI: Simpler auto-generated `tbot` service names Backport #60052 to branch/v18 * Fix linting of synctest files on CI * [v18] MWI: Fix flaky test in SPIFFE Workload APIs Backport #60668 to branch/v18
mmcallister
pushed a commit
that referenced
this pull request
Nov 6, 2025
* MWI: Fix flaky test in SPIFFE Workload APIs
The `workload-identity-api` and `spiffe-workload-api` tests frequently fail the
Flake Detector by timing out after 10 minutes.
On taking a look at the logs, the following stacktrace stood out to me:
2025-10-28T12:42:46.4097768Z goroutine 9965029 [runnable]:
2025-10-28T12:42:46.4097973Z internal/poll.(*FD).Read(0xc0333a8cc0, {0xc029ff0000, 0x8000, 0x8000})
...
github.com/gravitational/teleport/lib/tbot/workloadidentity/workloadattest.copyAtMost({0x7fe928bbdbf0, 0xc09d285900}, {0x1ade5660, 0xc011dfecf0}, 0x40000000)
2025-10-28T12:42:46.4100772Z /__w/teleport/teleport/lib/tbot/workloadidentity/workloadattest/unix.go:178 +0x8f
Which is the codepath that reads a workload's executable using the `/proc/<pid>/exe`
symlink in order to attest its SHA256 checksum. This symlink is special in that
it points to an *inode* rather than a regular path, so that if you replace the
executable (e.g. during a rolling deploy) it will still point to the *original*
file.
My current theory is that the combination of overlayfs, debugfs, and any other
layers of indirection in our GitHub Actions environment break inode stability
causing reads to stall indefinitely. Lowering `binary_hash_max_size_bytes` and
therefore the number of reads we perform seems to fix this!
I think it's unlikely we would see this in a real production deployment (due to
the odd filesystem shenanigans required), but as it's not impossible, I've added
a 15 second timeout to the read.
* Extract binary hashing into separate function
mmcallister
pushed a commit
that referenced
this pull request
Nov 19, 2025
* MWI: Fix flaky test in SPIFFE Workload APIs
The `workload-identity-api` and `spiffe-workload-api` tests frequently fail the
Flake Detector by timing out after 10 minutes.
On taking a look at the logs, the following stacktrace stood out to me:
2025-10-28T12:42:46.4097768Z goroutine 9965029 [runnable]:
2025-10-28T12:42:46.4097973Z internal/poll.(*FD).Read(0xc0333a8cc0, {0xc029ff0000, 0x8000, 0x8000})
...
github.com/gravitational/teleport/lib/tbot/workloadidentity/workloadattest.copyAtMost({0x7fe928bbdbf0, 0xc09d285900}, {0x1ade5660, 0xc011dfecf0}, 0x40000000)
2025-10-28T12:42:46.4100772Z /__w/teleport/teleport/lib/tbot/workloadidentity/workloadattest/unix.go:178 +0x8f
Which is the codepath that reads a workload's executable using the `/proc/<pid>/exe`
symlink in order to attest its SHA256 checksum. This symlink is special in that
it points to an *inode* rather than a regular path, so that if you replace the
executable (e.g. during a rolling deploy) it will still point to the *original*
file.
My current theory is that the combination of overlayfs, debugfs, and any other
layers of indirection in our GitHub Actions environment break inode stability
causing reads to stall indefinitely. Lowering `binary_hash_max_size_bytes` and
therefore the number of reads we perform seems to fix this!
I think it's unlikely we would see this in a real production deployment (due to
the odd filesystem shenanigans required), but as it's not impossible, I've added
a 15 second timeout to the read.
* Extract binary hashing into separate function
mmcallister
pushed a commit
that referenced
this pull request
Nov 20, 2025
* MWI: Fix flaky test in SPIFFE Workload APIs
The `workload-identity-api` and `spiffe-workload-api` tests frequently fail the
Flake Detector by timing out after 10 minutes.
On taking a look at the logs, the following stacktrace stood out to me:
2025-10-28T12:42:46.4097768Z goroutine 9965029 [runnable]:
2025-10-28T12:42:46.4097973Z internal/poll.(*FD).Read(0xc0333a8cc0, {0xc029ff0000, 0x8000, 0x8000})
...
github.com/gravitational/teleport/lib/tbot/workloadidentity/workloadattest.copyAtMost({0x7fe928bbdbf0, 0xc09d285900}, {0x1ade5660, 0xc011dfecf0}, 0x40000000)
2025-10-28T12:42:46.4100772Z /__w/teleport/teleport/lib/tbot/workloadidentity/workloadattest/unix.go:178 +0x8f
Which is the codepath that reads a workload's executable using the `/proc/<pid>/exe`
symlink in order to attest its SHA256 checksum. This symlink is special in that
it points to an *inode* rather than a regular path, so that if you replace the
executable (e.g. during a rolling deploy) it will still point to the *original*
file.
My current theory is that the combination of overlayfs, debugfs, and any other
layers of indirection in our GitHub Actions environment break inode stability
causing reads to stall indefinitely. Lowering `binary_hash_max_size_bytes` and
therefore the number of reads we perform seems to fix this!
I think it's unlikely we would see this in a real production deployment (due to
the odd filesystem shenanigans required), but as it's not impossible, I've added
a 15 second timeout to the read.
* Extract binary hashing into separate function
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The
workload-identity-apiandspiffe-workload-apitests frequently fail the Flake Detector by timing out after 10 minutes.On taking a look at the logs, the following stacktrace stood out to me:
Which is the codepath that reads a workload's executable using the
/proc/<pid>/exesymlink in order to attest its SHA256 checksum. This symlink is special in that it points to an inode rather than a regular path, so that if you replace the executable (e.g. during a rolling deploy) it will still point to the original file.My current theory is that the combination of overlayfs, debugfs, and any other layers of indirection in our GitHub Actions environment break inode stability causing reads to stall indefinitely. Lowering
binary_hash_max_size_bytesand therefore the number of reads we perform seems to fix this!I think it's unlikely we would see this in a real production deployment (due to the odd filesystem shenanigans required), but as it's not impossible, I've added a 15 second timeout to the read.