Skip to content

MWI: Fix flaky test in SPIFFE Workload APIs#60668

Merged
boxofrad merged 2 commits intomasterfrom
boxofrad/fix-tbot-test-flake
Oct 29, 2025
Merged

MWI: Fix flaky test in SPIFFE Workload APIs#60668
boxofrad merged 2 commits intomasterfrom
boxofrad/fix-tbot-test-flake

Conversation

@boxofrad
Copy link
Copy Markdown
Contributor

The workload-identity-api and spiffe-workload-api tests frequently fail the Flake Detector by timing out after 10 minutes.

On taking a look at the logs, the following stacktrace stood out to me:

2025-10-28T12:42:46.4097768Z goroutine 9965029 [runnable]:
2025-10-28T12:42:46.4097973Z internal/poll.(*FD).Read(0xc0333a8cc0, {0xc029ff0000, 0x8000, 0x8000})

...

github.com/gravitational/teleport/lib/tbot/workloadidentity/workloadattest.copyAtMost({0x7fe928bbdbf0, 0xc09d285900}, {0x1ade5660, 0xc011dfecf0}, 0x40000000)
2025-10-28T12:42:46.4100772Z 	/__w/teleport/teleport/lib/tbot/workloadidentity/workloadattest/unix.go:178 +0x8f

Which is the codepath that reads a workload's executable using the /proc/<pid>/exe symlink in order to attest its SHA256 checksum. This symlink is special in that it points to an inode rather than a regular path, so that if you replace the executable (e.g. during a rolling deploy) it will still point to the original file.

My current theory is that the combination of overlayfs, debugfs, and any other layers of indirection in our GitHub Actions environment break inode stability causing reads to stall indefinitely. Lowering binary_hash_max_size_bytes and therefore the number of reads we perform seems to fix this!

I think it's unlikely we would see this in a real production deployment (due to the odd filesystem shenanigans required), but as it's not impossible, I've added a 15 second timeout to the read.

The `workload-identity-api` and `spiffe-workload-api` tests frequently fail the
Flake Detector by timing out after 10 minutes.

On taking a look at the logs, the following stacktrace stood out to me:

    2025-10-28T12:42:46.4097768Z goroutine 9965029 [runnable]:
    2025-10-28T12:42:46.4097973Z internal/poll.(*FD).Read(0xc0333a8cc0, {0xc029ff0000, 0x8000, 0x8000})

    ...

    github.com/gravitational/teleport/lib/tbot/workloadidentity/workloadattest.copyAtMost({0x7fe928bbdbf0, 0xc09d285900}, {0x1ade5660, 0xc011dfecf0}, 0x40000000)
    2025-10-28T12:42:46.4100772Z 	/__w/teleport/teleport/lib/tbot/workloadidentity/workloadattest/unix.go:178 +0x8f

Which is the codepath that reads a workload's executable using the `/proc/<pid>/exe`
symlink in order to attest its SHA256 checksum. This symlink is special in that
it points to an *inode* rather than a regular path, so that if you replace the
executable (e.g. during a rolling deploy) it will still point to the *original*
file.

My current theory is that the combination of overlayfs, debugfs, and any other
layers of indirection in our GitHub Actions environment break inode stability
causing reads to stall indefinitely. Lowering `binary_hash_max_size_bytes` and
therefore the number of reads we perform seems to fix this!

I think it's unlikely we would see this in a real production deployment (due to
the odd filesystem shenanigans required), but as it's not impossible, I've added
a 15 second timeout to the read.
@boxofrad boxofrad added machine-id no-changelog Indicates that a PR does not require a changelog entry labels Oct 28, 2025
@github-actions github-actions bot requested a review from GavinFrazar October 28, 2025 16:46
Comment thread lib/tbot/workloadidentity/workloadattest/unix.go
@public-teleport-github-review-bot public-teleport-github-review-bot bot removed the request for review from timothyb89 October 28, 2025 17:11
boxofrad added a commit that referenced this pull request Oct 29, 2025
@boxofrad boxofrad added this pull request to the merge queue Oct 29, 2025
boxofrad added a commit that referenced this pull request Oct 29, 2025
Merged via the queue into master with commit 5ff3b2c Oct 29, 2025
41 checks passed
@boxofrad boxofrad deleted the boxofrad/fix-tbot-test-flake branch October 29, 2025 11:03
github-merge-queue bot pushed a commit that referenced this pull request Oct 29, 2025
* [v17] MWI: Automatically report service statuses in oneshot mode

Backport #60148 to branch/v17

* [v17] MWI: Add `AllServicesReported` method to `readyz.Register`

Backport #60059 to branch/v17

* [v17] MWI: Wait for service health before sending first heartbeat

Backport #60087 to branch/v17

* [v17] MWI: Add service health to bot heartbeats

Backport #60093 to branch/v17

* [v17] MWI: Simpler auto-generated `tbot` service names

Backport #60052 to branch/v17

* Fix `testing/synctest` on CI

* Fix linting of synctest files on CI

* [v17] MWI: Fix flaky test in SPIFFE Workload APIs

Backport #60668 to branch/v17
github-merge-queue bot pushed a commit that referenced this pull request Oct 29, 2025
* [v18] MWI: Automatically report service statuses in oneshot mode

Backport #60148 to branch/v18

* [v18] MWI: Add `AllServicesReported` method to `readyz.Register`

Backport #60059 to branch/v18

* [v18] MWI: Wait for service health before sending first heartbeat

Backport #60087 to branch/v18

* [v18] MWI: Add service health to bot heartbeats

Backport #60093 to branch/v18

* [v18] MWI: Simpler auto-generated `tbot` service names

Backport #60052 to branch/v18

* Fix linting of synctest files on CI

* [v18] MWI: Fix flaky test in SPIFFE Workload APIs

Backport #60668 to branch/v18
mmcallister pushed a commit that referenced this pull request Nov 6, 2025
* MWI: Fix flaky test in SPIFFE Workload APIs

The `workload-identity-api` and `spiffe-workload-api` tests frequently fail the
Flake Detector by timing out after 10 minutes.

On taking a look at the logs, the following stacktrace stood out to me:

    2025-10-28T12:42:46.4097768Z goroutine 9965029 [runnable]:
    2025-10-28T12:42:46.4097973Z internal/poll.(*FD).Read(0xc0333a8cc0, {0xc029ff0000, 0x8000, 0x8000})

    ...

    github.com/gravitational/teleport/lib/tbot/workloadidentity/workloadattest.copyAtMost({0x7fe928bbdbf0, 0xc09d285900}, {0x1ade5660, 0xc011dfecf0}, 0x40000000)
    2025-10-28T12:42:46.4100772Z 	/__w/teleport/teleport/lib/tbot/workloadidentity/workloadattest/unix.go:178 +0x8f

Which is the codepath that reads a workload's executable using the `/proc/<pid>/exe`
symlink in order to attest its SHA256 checksum. This symlink is special in that
it points to an *inode* rather than a regular path, so that if you replace the
executable (e.g. during a rolling deploy) it will still point to the *original*
file.

My current theory is that the combination of overlayfs, debugfs, and any other
layers of indirection in our GitHub Actions environment break inode stability
causing reads to stall indefinitely. Lowering `binary_hash_max_size_bytes` and
therefore the number of reads we perform seems to fix this!

I think it's unlikely we would see this in a real production deployment (due to
the odd filesystem shenanigans required), but as it's not impossible, I've added
a 15 second timeout to the read.

* Extract binary hashing into separate function
mmcallister pushed a commit that referenced this pull request Nov 19, 2025
* MWI: Fix flaky test in SPIFFE Workload APIs

The `workload-identity-api` and `spiffe-workload-api` tests frequently fail the
Flake Detector by timing out after 10 minutes.

On taking a look at the logs, the following stacktrace stood out to me:

    2025-10-28T12:42:46.4097768Z goroutine 9965029 [runnable]:
    2025-10-28T12:42:46.4097973Z internal/poll.(*FD).Read(0xc0333a8cc0, {0xc029ff0000, 0x8000, 0x8000})

    ...

    github.com/gravitational/teleport/lib/tbot/workloadidentity/workloadattest.copyAtMost({0x7fe928bbdbf0, 0xc09d285900}, {0x1ade5660, 0xc011dfecf0}, 0x40000000)
    2025-10-28T12:42:46.4100772Z 	/__w/teleport/teleport/lib/tbot/workloadidentity/workloadattest/unix.go:178 +0x8f

Which is the codepath that reads a workload's executable using the `/proc/<pid>/exe`
symlink in order to attest its SHA256 checksum. This symlink is special in that
it points to an *inode* rather than a regular path, so that if you replace the
executable (e.g. during a rolling deploy) it will still point to the *original*
file.

My current theory is that the combination of overlayfs, debugfs, and any other
layers of indirection in our GitHub Actions environment break inode stability
causing reads to stall indefinitely. Lowering `binary_hash_max_size_bytes` and
therefore the number of reads we perform seems to fix this!

I think it's unlikely we would see this in a real production deployment (due to
the odd filesystem shenanigans required), but as it's not impossible, I've added
a 15 second timeout to the read.

* Extract binary hashing into separate function
mmcallister pushed a commit that referenced this pull request Nov 20, 2025
* MWI: Fix flaky test in SPIFFE Workload APIs

The `workload-identity-api` and `spiffe-workload-api` tests frequently fail the
Flake Detector by timing out after 10 minutes.

On taking a look at the logs, the following stacktrace stood out to me:

    2025-10-28T12:42:46.4097768Z goroutine 9965029 [runnable]:
    2025-10-28T12:42:46.4097973Z internal/poll.(*FD).Read(0xc0333a8cc0, {0xc029ff0000, 0x8000, 0x8000})

    ...

    github.com/gravitational/teleport/lib/tbot/workloadidentity/workloadattest.copyAtMost({0x7fe928bbdbf0, 0xc09d285900}, {0x1ade5660, 0xc011dfecf0}, 0x40000000)
    2025-10-28T12:42:46.4100772Z 	/__w/teleport/teleport/lib/tbot/workloadidentity/workloadattest/unix.go:178 +0x8f

Which is the codepath that reads a workload's executable using the `/proc/<pid>/exe`
symlink in order to attest its SHA256 checksum. This symlink is special in that
it points to an *inode* rather than a regular path, so that if you replace the
executable (e.g. during a rolling deploy) it will still point to the *original*
file.

My current theory is that the combination of overlayfs, debugfs, and any other
layers of indirection in our GitHub Actions environment break inode stability
causing reads to stall indefinitely. Lowering `binary_hash_max_size_bytes` and
therefore the number of reads we perform seems to fix this!

I think it's unlikely we would see this in a real production deployment (due to
the odd filesystem shenanigans required), but as it's not impossible, I've added
a 15 second timeout to the read.

* Extract binary hashing into separate function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

machine-id no-changelog Indicates that a PR does not require a changelog entry size/sm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants