Add new field to health status#12942
Conversation
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
If a new flag is being introduced:
If a workflow is added or modified:
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
|
@deepthi @GuptaManan100 , should we add this in release notes? |
| }, 1, "") | ||
| vtorc := clusterInfo.ClusterInstance.VTOrcProcesses[0] | ||
| // Call API with retry to ensure health service is up | ||
| status, resp := utils.MakeAPICallRetry(t, vtorc, "/debug/health", func(code int, response string) bool { |
There was a problem hiding this comment.
I find out not every time Vtorc health is ready right away. So I am calling retry here. But retry was having assert inside it implementation. This force me to pull assert out of that method.
go/vt/vtorc/logic/orchestrator.go
Outdated
| }) | ||
| // we turn on HitAtLeastOneDiscovery first time | ||
| // con: this will result in extra memory hit for every recovery cycle. | ||
| process.HitAtLeastOneDiscovery.CompareAndSwap(false, true) |
There was a problem hiding this comment.
only after first successful discovery. Should we instead make it only after first discoveryInstance call regardless of success or failure?
I though successful is better.
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Not needed |
|
Under what conditions does |
GuptaManan100
left a comment
There was a problem hiding this comment.
There are 2 changes that are required -
- We actually want to only mark VTOrc as healthy after it has read all the instances for the first time. I would mark the boolean true in
case <-tabletTopoTick:ofContinuousDiscoveryafter waiting for both the refreshes to complete. - The return status of the API is very important. Its code should be 200 only after the discovery has completed.
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
GuptaManan100
left a comment
There was a problem hiding this comment.
I have made the requested changes 👍
|
I have updated the tests and description too. Please take a look @deepthi. |
deepthi
left a comment
There was a problem hiding this comment.
A couple of naming suggestions, the rest LGTM
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
* add new field to health status Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * fix test bug Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * bug fix Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * feat: fix the issues pointed out in reviews Signed-off-by: Manan Gupta <manan@planetscale.com> * feat: fix the test Signed-off-by: Manan Gupta <manan@planetscale.com> * feat: fix naming of variables Signed-off-by: Manan Gupta <manan@planetscale.com> * feat: fix tests after the changes Signed-off-by: Manan Gupta <manan@planetscale.com> --------- Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> Signed-off-by: Manan Gupta <manan@planetscale.com> Co-authored-by: Manan Gupta <manan@planetscale.com>
Description
This PR introduce a new boolean in HealthStatus structure. The idea is to distinguish between Vtorc being up and running as service and Vtorc being ready to run recoveries after having run the discovery check. The way code has been setup when you hit /debug/health,
Healthyflag will tell you if the service is up and running.HasDiscoveredOnceflag will tell you if Vtorc has completed the initial round of discovery. This will ensure that we don't run into situations as pointed out in Feature Request: VTOrc should be "ready" only when its ready to remediate. #12268.We only return a
200status when VTOrc is Healthy and has run the first round of discovery.Solution
While resolving this issue, I thought there can be multiple ways we can implement it.
HealthStatusthrough health endpoint, why not just add an extra field there.ContinuousDiscovery, before thefor {...}we can set booleanprocess.HitAtLeastOneDiscovery =truein a separate go routine by just waiting forTopoInformationRefreshSeconds. E.gThis will run only once.
Inside
DiscoverInstancewe setprocess.HitAtLeastOneDiscovery =true. This will ensure that we are flagging only when we have at-least one discovery. But problem with this approach is it will do memory read of boolean (atomic so will be very fast and light) during every successfulDiscoverInstancecall. Also, doing just one discovery isn't enough, we need to wait for all the discoveries to complete.Add a wait group in the topoinformation tick and wait for the refreshes to complete. Set a boolean value when the refreshes are complete. Use this boolean in the health field.
I went with option 4 but I am open to discuss above mentioned solution or any other one not listed here.
Related Issue(s)
closes #12268
Checklist
Deployment Notes