algod: Add static EnableTelemetry retry#6183
Conversation
|
This one does not solve the HeartBeat race where HB sometimes is not being sent via telemetry. |
jannotti
left a comment
There was a problem hiding this comment.
I'm ok with adding the go routine loop, but if we're doing that, let's only do that. We can remove the initial attempt to enable, and only enable in the loop.
|
This change will make static telemetry init fully async instead of sync. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6183 +/- ##
==========================================
- Coverage 51.80% 51.78% -0.02%
==========================================
Files 644 644
Lines 86505 86514 +9
==========================================
- Hits 44816 44804 -12
- Misses 38822 38843 +21
Partials 2867 2867 ☔ View full report in Codecov by Sentry. |
|
@urtho want to merge in master here and we'll get this pulled in? |
To keep those, how about a short sleep, say 2 seconds, after starting the go routine? If you want to get very fancy, you could have the init code write to a channel when it completes, and the sleep could be a select on the channel or until a few seconds timer expires. But I would be happy with a short also and a comment explaining the reasoning. |
jannotti
left a comment
There was a problem hiding this comment.
I would accept this, but I'd prefer a little delay to try to get telemetry initialized before start events.
Maybe just to try a synchronous init attempt and then run a goroutine in case of failure? |
|
@algorandskiy this was in original PR - try sync one time and fallback to async loop on error. |
|
To me an ideal solution would be something like this.
|
|
ctx in Dial Testing the patch with timeout context now. |
algorandskiy
left a comment
There was a problem hiding this comment.
Thank you! Missing ctx usage in createElasticHookContext, other than that looks great!
|
@urtho could you fix linter error (make DefaultStaticTelemetryStartupTimeout unexported is fine) and my feedback? |
|
@algorandskiy |
|
Indeed
|
|
OK, going to non-deferred cancel like in go-algorand/ledger/catchpointtracker.go Lines 1273 to 1279 in 113b0c3 |
| f.l.SetLevel(Debug) // Ensure logging doesn't filter anything out | ||
|
|
||
| f.telem, _ = makeTelemetryState(lcfg, func(cfg TelemetryConfig) (hook logrus.Hook, err error) { | ||
| f.telem, _ = makeTelemetryStateContext(context.Background(), lcfg, func(ctx context.Context, cfg TelemetryConfig) (hook logrus.Hook, err error) { |
There was a problem hiding this comment.
I thought I was following all the Context plumbing, but when it gets to here I don't understand. It seems like the context is being dropped now, so how can the original context.WithTimeout actually work?
There was a problem hiding this comment.
you mean in this test?
in the code the context plumbed down to
createTelemetryHookContext -> hookFactory -> createElasticHookContext -> elastic.DialContext
Remote telemetry with a static URI in config never gets enabled past the initial, single try at algod startup.
Remote logging to static URI never gets enabled in the event the Internet or remote service is not available during Algod startup.
There is no such issue with dynamic remote telemetry (DNS based discovery) as it retries the connection with TelemetryURIUpdateService.
This PR adds a loop that retries the static remote service every minute until it succeeds.