MWI: Fix flaky test in SPIFFE Workload APIs by boxofrad · Pull Request #60668 · gravitational/teleport

boxofrad · 2025-10-28T16:45:50Z

The workload-identity-api and spiffe-workload-api tests frequently fail the Flake Detector by timing out after 10 minutes.

On taking a look at the logs, the following stacktrace stood out to me:

2025-10-28T12:42:46.4097768Z goroutine 9965029 [runnable]:
2025-10-28T12:42:46.4097973Z internal/poll.(*FD).Read(0xc0333a8cc0, {0xc029ff0000, 0x8000, 0x8000})

...

github.com/gravitational/teleport/lib/tbot/workloadidentity/workloadattest.copyAtMost({0x7fe928bbdbf0, 0xc09d285900}, {0x1ade5660, 0xc011dfecf0}, 0x40000000)
2025-10-28T12:42:46.4100772Z 	/__w/teleport/teleport/lib/tbot/workloadidentity/workloadattest/unix.go:178 +0x8f

Which is the codepath that reads a workload's executable using the /proc/<pid>/exe symlink in order to attest its SHA256 checksum. This symlink is special in that it points to an inode rather than a regular path, so that if you replace the executable (e.g. during a rolling deploy) it will still point to the original file.

My current theory is that the combination of overlayfs, debugfs, and any other layers of indirection in our GitHub Actions environment break inode stability causing reads to stall indefinitely. Lowering binary_hash_max_size_bytes and therefore the number of reads we perform seems to fix this!

I think it's unlikely we would see this in a real production deployment (due to the odd filesystem shenanigans required), but as it's not impossible, I've added a 15 second timeout to the read.

The `workload-identity-api` and `spiffe-workload-api` tests frequently fail the Flake Detector by timing out after 10 minutes. On taking a look at the logs, the following stacktrace stood out to me: 2025-10-28T12:42:46.4097768Z goroutine 9965029 [runnable]: 2025-10-28T12:42:46.4097973Z internal/poll.(*FD).Read(0xc0333a8cc0, {0xc029ff0000, 0x8000, 0x8000}) ... github.com/gravitational/teleport/lib/tbot/workloadidentity/workloadattest.copyAtMost({0x7fe928bbdbf0, 0xc09d285900}, {0x1ade5660, 0xc011dfecf0}, 0x40000000) 2025-10-28T12:42:46.4100772Z /__w/teleport/teleport/lib/tbot/workloadidentity/workloadattest/unix.go:178 +0x8f Which is the codepath that reads a workload's executable using the `/proc/<pid>/exe` symlink in order to attest its SHA256 checksum. This symlink is special in that it points to an *inode* rather than a regular path, so that if you replace the executable (e.g. during a rolling deploy) it will still point to the *original* file. My current theory is that the combination of overlayfs, debugfs, and any other layers of indirection in our GitHub Actions environment break inode stability causing reads to stall indefinitely. Lowering `binary_hash_max_size_bytes` and therefore the number of reads we perform seems to fix this! I think it's unlikely we would see this in a real production deployment (due to the odd filesystem shenanigans required), but as it's not impossible, I've added a 15 second timeout to the read.

Backport #60668 to branch/v18

Backport #60668 to branch/v17

* [v17] MWI: Automatically report service statuses in oneshot mode Backport #60148 to branch/v17 * [v17] MWI: Add `AllServicesReported` method to `readyz.Register` Backport #60059 to branch/v17 * [v17] MWI: Wait for service health before sending first heartbeat Backport #60087 to branch/v17 * [v17] MWI: Add service health to bot heartbeats Backport #60093 to branch/v17 * [v17] MWI: Simpler auto-generated `tbot` service names Backport #60052 to branch/v17 * Fix `testing/synctest` on CI * Fix linting of synctest files on CI * [v17] MWI: Fix flaky test in SPIFFE Workload APIs Backport #60668 to branch/v17

* [v18] MWI: Automatically report service statuses in oneshot mode Backport #60148 to branch/v18 * [v18] MWI: Add `AllServicesReported` method to `readyz.Register` Backport #60059 to branch/v18 * [v18] MWI: Wait for service health before sending first heartbeat Backport #60087 to branch/v18 * [v18] MWI: Add service health to bot heartbeats Backport #60093 to branch/v18 * [v18] MWI: Simpler auto-generated `tbot` service names Backport #60052 to branch/v18 * Fix linting of synctest files on CI * [v18] MWI: Fix flaky test in SPIFFE Workload APIs Backport #60668 to branch/v18

* MWI: Fix flaky test in SPIFFE Workload APIs The `workload-identity-api` and `spiffe-workload-api` tests frequently fail the Flake Detector by timing out after 10 minutes. On taking a look at the logs, the following stacktrace stood out to me: 2025-10-28T12:42:46.4097768Z goroutine 9965029 [runnable]: 2025-10-28T12:42:46.4097973Z internal/poll.(*FD).Read(0xc0333a8cc0, {0xc029ff0000, 0x8000, 0x8000}) ... github.com/gravitational/teleport/lib/tbot/workloadidentity/workloadattest.copyAtMost({0x7fe928bbdbf0, 0xc09d285900}, {0x1ade5660, 0xc011dfecf0}, 0x40000000) 2025-10-28T12:42:46.4100772Z /__w/teleport/teleport/lib/tbot/workloadidentity/workloadattest/unix.go:178 +0x8f Which is the codepath that reads a workload's executable using the `/proc/<pid>/exe` symlink in order to attest its SHA256 checksum. This symlink is special in that it points to an *inode* rather than a regular path, so that if you replace the executable (e.g. during a rolling deploy) it will still point to the *original* file. My current theory is that the combination of overlayfs, debugfs, and any other layers of indirection in our GitHub Actions environment break inode stability causing reads to stall indefinitely. Lowering `binary_hash_max_size_bytes` and therefore the number of reads we perform seems to fix this! I think it's unlikely we would see this in a real production deployment (due to the odd filesystem shenanigans required), but as it's not impossible, I've added a 15 second timeout to the read. * Extract binary hashing into separate function

boxofrad added machine-id no-changelog Indicates that a PR does not require a changelog entry labels Oct 28, 2025

boxofrad requested review from strideynet and timothyb89 October 28, 2025 16:45

github-actions bot added the size/sm label Oct 28, 2025

github-actions bot requested a review from GavinFrazar October 28, 2025 16:46

strideynet approved these changes Oct 28, 2025

View reviewed changes

Comment thread lib/tbot/workloadidentity/workloadattest/unix.go

GavinFrazar approved these changes Oct 28, 2025

View reviewed changes

public-teleport-github-review-bot bot removed the request for review from timothyb89 October 28, 2025 17:11

Extract binary hashing into separate function

7de81e6

boxofrad added a commit that referenced this pull request Oct 29, 2025

[v18] MWI: Fix flaky test in SPIFFE Workload APIs

7b1d79d

Backport #60668 to branch/v18

boxofrad mentioned this pull request Oct 29, 2025

[v18] MWI: Heartbeat bot service health #60458

Merged

boxofrad added this pull request to the merge queue Oct 29, 2025

boxofrad added a commit that referenced this pull request Oct 29, 2025

[v17] MWI: Fix flaky test in SPIFFE Workload APIs

f0f8774

Backport #60668 to branch/v17

boxofrad mentioned this pull request Oct 29, 2025

[v17] MWI: Heartbeat bot service health #60459

Merged

Merged via the queue into master with commit 5ff3b2c Oct 29, 2025
41 checks passed

boxofrad deleted the boxofrad/fix-tbot-test-flake branch October 29, 2025 11:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MWI: Fix flaky test in SPIFFE Workload APIs#60668

MWI: Fix flaky test in SPIFFE Workload APIs#60668
boxofrad merged 2 commits intomasterfrom
boxofrad/fix-tbot-test-flake

boxofrad commented Oct 28, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

boxofrad commented Oct 28, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants