Skip to content

algod: Add static EnableTelemetry retry#6183

Merged
gmalouf merged 8 commits intoalgorand:masterfrom
AlgoNode:urtho-telemetry-cfg
Feb 24, 2025
Merged

algod: Add static EnableTelemetry retry#6183
gmalouf merged 8 commits intoalgorand:masterfrom
AlgoNode:urtho-telemetry-cfg

Conversation

@urtho
Copy link
Copy Markdown
Contributor

@urtho urtho commented Nov 28, 2024

Remote telemetry with a static URI in config never gets enabled past the initial, single try at algod startup.
Remote logging to static URI never gets enabled in the event the Internet or remote service is not available during Algod startup.

There is no such issue with dynamic remote telemetry (DNS based discovery) as it retries the connection with TelemetryURIUpdateService.

This PR adds a loop that retries the static remote service every minute until it succeeds.

@urtho
Copy link
Copy Markdown
Contributor Author

urtho commented Nov 28, 2024

This one does not solve the HeartBeat race where HB sometimes is not being sent via telemetry.

Copy link
Copy Markdown
Contributor

@jannotti jannotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with adding the go routine loop, but if we're doing that, let's only do that. We can remove the initial attempt to enable, and only enable in the loop.

@urtho
Copy link
Copy Markdown
Contributor Author

urtho commented Dec 2, 2024

This change will make static telemetry init fully async instead of sync.
Initial Start and HeartBeat events will be lost in most startup cases with static URI - same as it is now with DNS telemetry.

@codecov
Copy link
Copy Markdown

codecov bot commented Dec 2, 2024

Codecov Report

Attention: Patch coverage is 18.18182% with 18 lines in your changes missing coverage. Please review.

Project coverage is 51.78%. Comparing base (5f12e1a) to head (5d93b6b).
Report is 7 commits behind head on master.

Files with missing lines Patch % Lines
cmd/algod/main.go 0.00% 10 Missing ⚠️
logging/telemetryhook.go 40.00% 3 Missing ⚠️
logging/log.go 0.00% 2 Missing ⚠️
logging/telemetry.go 50.00% 2 Missing ⚠️
cmd/algoh/main.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6183      +/-   ##
==========================================
- Coverage   51.80%   51.78%   -0.02%     
==========================================
  Files         644      644              
  Lines       86505    86514       +9     
==========================================
- Hits        44816    44804      -12     
- Misses      38822    38843      +21     
  Partials     2867     2867              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gmalouf
Copy link
Copy Markdown
Contributor

gmalouf commented Jan 29, 2025

@urtho want to merge in master here and we'll get this pulled in?

@jannotti
Copy link
Copy Markdown
Contributor

jannotti commented Feb 6, 2025

This change will make static telemetry init fully async instead of sync.
Initial Start and HeartBeat events will be lost in most startup cases with static URI - same as it is now with DNS telemetry.

To keep those, how about a short sleep, say 2 seconds, after starting the go routine?

If you want to get very fancy, you could have the init code write to a channel when it completes, and the sleep could be a select on the channel or until a few seconds timer expires. But I would be happy with a short also and a comment explaining the reasoning.

jannotti
jannotti previously approved these changes Feb 6, 2025
Copy link
Copy Markdown
Contributor

@jannotti jannotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would accept this, but I'd prefer a little delay to try to get telemetry initialized before start events.

@algorandskiy
Copy link
Copy Markdown
Contributor

To keep those, how about a short sleep, say 2 seconds, after starting the go routine?

Maybe just to try a synchronous init attempt and then run a goroutine in case of failure?

@urtho
Copy link
Copy Markdown
Contributor Author

urtho commented Feb 9, 2025

@algorandskiy this was in original PR - try sync one time and fallback to async loop on error.

gmalouf
gmalouf previously approved these changes Feb 19, 2025
@gmalouf gmalouf self-requested a review February 19, 2025 20:08
@algorandskiy
Copy link
Copy Markdown
Contributor

algorandskiy commented Feb 19, 2025

To me an ideal solution would be something like this.

  1. Try a first time synchronously
    1. Modify createTelemetryHook (and createElasticHook) to take cancellable context and call elastic.DialContext(ctx, options...) instead of elastic.NewClient
    2. Create that context with timeout (say 5s timeout?) in EnableTelemetry or let its callers to provide one.
  2. Run a goroutine trying with a minute delay (as this PR sugggests) or with backoff.

@urtho
Copy link
Copy Markdown
Contributor Author

urtho commented Feb 20, 2025

@algorandskiy

ctx in Dial elastic.NewClient already has a set of built-in timeouts but there are cases where cluster interrogation might make the initial process too long. Adding a timeout context is a good precaution.

Testing the patch with timeout context now.

@urtho urtho dismissed stale reviews from gmalouf and jannotti via 7669e58 February 20, 2025 18:02
Copy link
Copy Markdown
Contributor

@algorandskiy algorandskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Missing ctx usage in createElasticHookContext, other than that looks great!

@algorandskiy
Copy link
Copy Markdown
Contributor

@urtho could you fix linter error (make DefaultStaticTelemetryStartupTimeout unexported is fine) and my feedback?

@urtho
Copy link
Copy Markdown
Contributor Author

urtho commented Feb 21, 2025

@algorandskiy
linter does not like it when I ignore the cancel fn

cmd/algod/main.go:242:9: lostcancel: the cancel function returned by context.WithTimeout should be called, not discarded, to avoid a context leak (govet)
                        ctx, _ := context.WithTimeout(context.Background(), defaultStaticTelemetryStartupTimeout)
                             ^
make: *** [Makefile:115: lint] Error 1

@algorandskiy
Copy link
Copy Markdown
Contributor

Indeed

Canceling this context releases resources associated with it, so code should call cancel as soon as the operations running in this Context complete.

@urtho
Copy link
Copy Markdown
Contributor Author

urtho commented Feb 22, 2025

OK, going to non-deferred cancel like in

stepCtx, stepCancelFunction := context.WithTimeout(dbCtx, chunkExecutionDuration)
writeStepStartTime := time.Now()
more, err = catchpointWriter.FileWriteStep(stepCtx)
// accumulate the actual time we've spent writing in this step.
catchpointGenerationStats.CPUTime += uint64(time.Since(writeStepStartTime).Nanoseconds())
stepCancelFunction()
if more && err == nil {

@gmalouf gmalouf merged commit fd2f520 into algorand:master Feb 24, 2025
f.l.SetLevel(Debug) // Ensure logging doesn't filter anything out

f.telem, _ = makeTelemetryState(lcfg, func(cfg TelemetryConfig) (hook logrus.Hook, err error) {
f.telem, _ = makeTelemetryStateContext(context.Background(), lcfg, func(ctx context.Context, cfg TelemetryConfig) (hook logrus.Hook, err error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I was following all the Context plumbing, but when it gets to here I don't understand. It seems like the context is being dropped now, so how can the original context.WithTimeout actually work?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean in this test?

in the code the context plumbed down to
createTelemetryHookContext -> hookFactory -> createElasticHookContext -> elastic.DialContext

@urtho urtho deleted the urtho-telemetry-cfg branch November 29, 2025 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants