Skip to content

MWI: Wait for service health before sending first heartbeat#60087

Merged
boxofrad merged 5 commits intomasterfrom
boxofrad/tbot-heartbeat-wait-for-services
Oct 21, 2025
Merged

MWI: Wait for service health before sending first heartbeat#60087
boxofrad merged 5 commits intomasterfrom
boxofrad/tbot-heartbeat-wait-for-services

Conversation

@boxofrad
Copy link
Copy Markdown
Contributor

@boxofrad boxofrad commented Oct 9, 2025

Uses the new AllServicesReported method to wait for all services to report their initial status before sending the "startup" heartbeat, with a timeout of 30 seconds.

In one-shot mode, if the outer context is canceled (typically caused by another service returning an error) we'll hang on for up to 5 seconds to try to send the heartbeat before shutting down.

@boxofrad boxofrad added machine-id no-changelog Indicates that a PR does not require a changelog entry labels Oct 9, 2025
Comment thread lib/tbot/internal/heartbeat/service_test.go Outdated
@boxofrad boxofrad force-pushed the boxofrad/tbot-heartbeat-wait-for-services branch from ed5530d to d011f22 Compare October 9, 2025 15:34
@boxofrad boxofrad force-pushed the boxofrad/tbot-readyz-block branch from 2a25403 to 48b0203 Compare October 9, 2025 15:34
@boxofrad
Copy link
Copy Markdown
Contributor Author

boxofrad commented Oct 9, 2025

This doesn't quite work in one-shot mode at the moment, because services typically don't report their status. I'm working on making it happen automatically (and to not report statuses for services like ca-rotation which don't run in one-shot mode) but it requires a bit of refactoring.

Update: Fixed in #60148

@boxofrad boxofrad force-pushed the boxofrad/tbot-readyz-block branch from 48b0203 to ec9ad91 Compare October 13, 2025 10:42
@boxofrad boxofrad force-pushed the boxofrad/tbot-heartbeat-wait-for-services branch 2 times, most recently from a9b7123 to 022c3ae Compare October 13, 2025 11:24
Comment thread lib/tbot/internal/heartbeat/service.go Outdated
var cancel context.CancelFunc
ctx, cancel = context.WithTimeout(
context.WithoutCancel(ctx),
5*time.Second,
Copy link
Copy Markdown
Contributor

@strideynet strideynet Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Extract this to a constant. Also - I can't recall off the top of my head - but do we wait for everything else to finish up before closing the bot identity client? I could see that ending up racing with this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the identity service's Close method gets called in a defer, after all of the services have shut down.

Comment thread lib/tbot/internal/heartbeat/service.go Outdated
Comment thread lib/tbot/internal/heartbeat/service_test.go Outdated
@boxofrad boxofrad force-pushed the boxofrad/tbot-heartbeat-wait-for-services branch from 022c3ae to 5878e1e Compare October 20, 2025 11:41
@boxofrad boxofrad force-pushed the boxofrad/tbot-readyz-block branch from ec9ad91 to 2afeaa2 Compare October 20, 2025 11:41
@boxofrad boxofrad force-pushed the boxofrad/tbot-readyz-block branch from 2afeaa2 to f763e57 Compare October 20, 2025 14:39
@boxofrad boxofrad force-pushed the boxofrad/tbot-heartbeat-wait-for-services branch from 5878e1e to 16e56e8 Compare October 20, 2025 14:39
Base automatically changed from boxofrad/tbot-readyz-block to master October 20, 2025 15:31
Uses the new `AllServicesReported` method to wait for all services to report
their initial status before sending the "startup" heartbeat, with a timeout of
30 seconds.

In one-shot mode, if the outer context is canceled (typically caused by another
service returning an error) we'll hang on for up to 5 seconds to try to send the
heartbeat before shutting down.
@boxofrad boxofrad force-pushed the boxofrad/tbot-heartbeat-wait-for-services branch from 16e56e8 to d9aef19 Compare October 20, 2025 15:41
@boxofrad boxofrad added this pull request to the merge queue Oct 20, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 20, 2025
@boxofrad boxofrad added this pull request to the merge queue Oct 21, 2025
Merged via the queue into master with commit 7f54876 Oct 21, 2025
41 checks passed
@boxofrad boxofrad deleted the boxofrad/tbot-heartbeat-wait-for-services branch October 21, 2025 10:22
boxofrad added a commit that referenced this pull request Oct 22, 2025
boxofrad added a commit that referenced this pull request Oct 22, 2025
boxofrad added a commit that referenced this pull request Oct 29, 2025
boxofrad added a commit that referenced this pull request Oct 29, 2025
github-merge-queue bot pushed a commit that referenced this pull request Oct 29, 2025
* [v17] MWI: Automatically report service statuses in oneshot mode

Backport #60148 to branch/v17

* [v17] MWI: Add `AllServicesReported` method to `readyz.Register`

Backport #60059 to branch/v17

* [v17] MWI: Wait for service health before sending first heartbeat

Backport #60087 to branch/v17

* [v17] MWI: Add service health to bot heartbeats

Backport #60093 to branch/v17

* [v17] MWI: Simpler auto-generated `tbot` service names

Backport #60052 to branch/v17

* Fix `testing/synctest` on CI

* Fix linting of synctest files on CI

* [v17] MWI: Fix flaky test in SPIFFE Workload APIs

Backport #60668 to branch/v17
github-merge-queue bot pushed a commit that referenced this pull request Oct 29, 2025
* [v18] MWI: Automatically report service statuses in oneshot mode

Backport #60148 to branch/v18

* [v18] MWI: Add `AllServicesReported` method to `readyz.Register`

Backport #60059 to branch/v18

* [v18] MWI: Wait for service health before sending first heartbeat

Backport #60087 to branch/v18

* [v18] MWI: Add service health to bot heartbeats

Backport #60093 to branch/v18

* [v18] MWI: Simpler auto-generated `tbot` service names

Backport #60052 to branch/v18

* Fix linting of synctest files on CI

* [v18] MWI: Fix flaky test in SPIFFE Workload APIs

Backport #60668 to branch/v18
mmcallister pushed a commit that referenced this pull request Nov 6, 2025
* MWI: Wait for service health before sending first heartbeat

Uses the new `AllServicesReported` method to wait for all services to report
their initial status before sending the "startup" heartbeat, with a timeout of
30 seconds.

In one-shot mode, if the outer context is canceled (typically caused by another
service returning an error) we'll hang on for up to 5 seconds to try to send the
heartbeat before shutting down.

* Swap clockwork with testing/synctest

* Prevent identity service from reporting its status too early

* Use test context

* Extract durations into constants
mmcallister pushed a commit that referenced this pull request Nov 19, 2025
* MWI: Wait for service health before sending first heartbeat

Uses the new `AllServicesReported` method to wait for all services to report
their initial status before sending the "startup" heartbeat, with a timeout of
30 seconds.

In one-shot mode, if the outer context is canceled (typically caused by another
service returning an error) we'll hang on for up to 5 seconds to try to send the
heartbeat before shutting down.

* Swap clockwork with testing/synctest

* Prevent identity service from reporting its status too early

* Use test context

* Extract durations into constants
mmcallister pushed a commit that referenced this pull request Nov 20, 2025
* MWI: Wait for service health before sending first heartbeat

Uses the new `AllServicesReported` method to wait for all services to report
their initial status before sending the "startup" heartbeat, with a timeout of
30 seconds.

In one-shot mode, if the outer context is canceled (typically caused by another
service returning an error) we'll hang on for up to 5 seconds to try to send the
heartbeat before shutting down.

* Swap clockwork with testing/synctest

* Prevent identity service from reporting its status too early

* Use test context

* Extract durations into constants
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

machine-id no-changelog Indicates that a PR does not require a changelog entry size/sm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants