[native] Randomize retry time with jitter after worker announcement failure.#19906
Conversation
7565433 to
7e87029
Compare
|
@amitkdutta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
There should also be a jitter If announcement is always successful. In that case, we can do something like timeToSleepMs + rand(2000) - 1000
There was a problem hiding this comment.
In high level, when a worker makes a successful announcement, it should not schedule the next announcement very quickly, but after some specific period of time which should ideally include some jitter (as you suggested). However, when the announcement fails, worker should try the next attempt quickly with exponential back-off. And also, this wait can't be infinitely long, so capped by max delay (which is today's hard-coded frequency ms). Since worker does not know when coordinator will come back, it will need to talk indefinitely. The issue is indefinite talking is - if there are too many failed attempts then all workers will get synchronized towards the frequencyMsMax (somewhat today's behavior). On the other side, if frequencyMsMax is too big, then if coordinator comes back after a number of failed attempts, worker registration will be late due to exponential back off. That will hurt routing and caching etc.
Hence a reasonable trade off is:
- Add some jitter in success
- Back-off the retry exponentially where initial failed attempts try to talk quickly hopoing coordinator is back soon and registration can be done quickly (this is improvement from today's hard coded timing, because today worker need to wait a lot regardless of the failed attempts)
- We need to keep the announcement indefinitely (for both success and failure) to keep giving heartbead to coordiantor.
There was a problem hiding this comment.
I think it might be good to add a small amount of jitter as @pranjalssh suggested (not exponentially backoff as the failure case does but a simple a small random value in tens of milliseconds with cap to the max or on top of max) for successful case to avoid any possible synchronized announcements from all the workers especially in batch cluster with a lot of workers. thanks!
There was a problem hiding this comment.
Retrying infinitely doesn't look clean. We can cap number of failedAttempts - so there is only one codepath which does the inifinite retries.
We can also skip jitter here and use the manually added jitter in below comment. I'll leave this to your judgment
xiaoxmeng
left a comment
There was a problem hiding this comment.
@amitkdutta thanks for the back-off retry improvement % some minors.
There was a problem hiding this comment.
Do we want to make this a pair of node level configs? Also name constexpr variable with prefix k -> s/frequencyMsMin/kFrequencyMsMin/
There was a problem hiding this comment.
nit: s/// In ms/// 100 ms/
There was a problem hiding this comment.
nit: minFrequenceMs, maxFrequenceMs
There was a problem hiding this comment.
Can you comment the pair of parameters in ctor?
There was a problem hiding this comment.
NYC: could rearrange the members by putting the const members first? Thanks!
Can you comment on jitterParam_?
s/jitterParam/jitterParam_/
There was a problem hiding this comment.
How about we put this logic into a function
Announcer::nextScheduleDelay() {
if (failedAttempts_ == 0) {
return ...
}
...
}
There was a problem hiding this comment.
I think it might be good to add a small amount of jitter as @pranjalssh suggested (not exponentially backoff as the failure case does but a simple a small random value in tens of milliseconds with cap to the max or on top of max) for successful case to avoid any possible synchronized announcements from all the workers especially in batch cluster with a lot of workers. thanks!
There was a problem hiding this comment.
Nit : Rename variables kMinFrequencyMs
There was a problem hiding this comment.
Nit : Rename min(max)FrequencyMs_
7e87029 to
0543476
Compare
xiaoxmeng
left a comment
There was a problem hiding this comment.
@amitkdutta thanks for the update!
There was a problem hiding this comment.
coordinator with max back off time cap at 'maxFrequencyMs_'.
0543476 to
00506f1
Compare
|
@amitkdutta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
This PR randomizes announcement scheduled time after announcement fails. Announcement failure can happen when coordinator is not available. For example, if coordinator goes to a old gc cycle, it can be non responsive for certain amount of time. After its back, all workers tries to register at the same time making its recover harder. Some randomization with jitter helps the coordinator to come back to a healthy status quickly as the workers will register with a variable speed with the upper bound of regular announcement time. Note that, if there is no failure, announcement scheduling time is unchanged.
Test plan