Skip to content

Add new field to health status#12942

Merged
GuptaManan100 merged 7 commits intovitessio:mainfrom
planetscale:issue-12268
May 8, 2023
Merged

Add new field to health status#12942
GuptaManan100 merged 7 commits intovitessio:mainfrom
planetscale:issue-12268

Conversation

@rsajwani
Copy link
Contributor

@rsajwani rsajwani commented Apr 20, 2023

Description

This PR introduce a new boolean in HealthStatus structure. The idea is to distinguish between Vtorc being up and running as service and Vtorc being ready to run recoveries after having run the discovery check. The way code has been setup when you hit /debug/health,

  1. Healthy flag will tell you if the service is up and running.
  2. HasDiscoveredOnce flag will tell you if Vtorc has completed the initial round of discovery. This will ensure that we don't run into situations as pointed out in Feature Request: VTOrc should be "ready" only when its ready to remediate. #12268.

We only return a 200 status when VTOrc is Healthy and has run the first round of discovery.

Solution

While resolving this issue, I thought there can be multiple ways we can implement it.

  1. We could have created a separate endpoint e.g /debug/discoveryHealth. But I thought it would be overkill and given we already are exposing HealthStatus through health endpoint, why not just add an extra field there.
  2. Inside ContinuousDiscovery, before the for {...} we can set boolean process.HitAtLeastOneDiscovery =true in a separate go routine by just waiting for TopoInformationRefreshSeconds. E.g
go func() {
		time.Sleep(config.Config.TopoInformationRefreshSeconds * time.Second)
		process.HitAtLeastOneDiscovery.CompareAndSwap(false, true)
	}()

This will run only once.

  1. Inside DiscoverInstance we set process.HitAtLeastOneDiscovery =true. This will ensure that we are flagging only when we have at-least one discovery. But problem with this approach is it will do memory read of boolean (atomic so will be very fast and light) during every successful DiscoverInstance call. Also, doing just one discovery isn't enough, we need to wait for all the discoveries to complete.

  2. Add a wait group in the topoinformation tick and wait for the refreshes to complete. Set a boolean value when the refreshes are complete. Use this boolean in the health field.

I went with option 4 but I am open to discuss above mentioned solution or any other one not listed here.

Related Issue(s)

closes #12268

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Apr 20, 2023
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Apr 20, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@github-actions github-actions bot added this to the v17.0.0 milestone Apr 20, 2023
@rsajwani
Copy link
Contributor Author

@deepthi @GuptaManan100 , should we add this in release notes?

}, 1, "")
vtorc := clusterInfo.ClusterInstance.VTOrcProcesses[0]
// Call API with retry to ensure health service is up
status, resp := utils.MakeAPICallRetry(t, vtorc, "/debug/health", func(code int, response string) bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find out not every time Vtorc health is ready right away. So I am calling retry here. But retry was having assert inside it implementation. This force me to pull assert out of that method.

})
// we turn on HitAtLeastOneDiscovery first time
// con: this will result in extra memory hit for every recovery cycle.
process.HitAtLeastOneDiscovery.CompareAndSwap(false, true)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only after first successful discovery. Should we instead make it only after first discoveryInstance call regardless of success or failure?
I though successful is better.

Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
@deepthi
Copy link
Collaborator

deepthi commented Apr 21, 2023

@deepthi @GuptaManan100 , should we add this in release notes?

Not needed

@deepthi
Copy link
Collaborator

deepthi commented Apr 21, 2023

Under what conditions does /debug/health now return a 200 (OK) response? That should be stated clearly in the PR description itself.

@deepthi deepthi added Component: VTOrc Vitess Orchestrator integration Type: Bug and removed NeedsWebsiteDocsUpdate What it says labels Apr 21, 2023
Copy link
Contributor

@GuptaManan100 GuptaManan100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 changes that are required -

  1. We actually want to only mark VTOrc as healthy after it has read all the instances for the first time. I would mark the boolean true in case <-tabletTopoTick: of ContinuousDiscovery after waiting for both the refreshes to complete.
  2. The return status of the API is very important. Its code should be 200 only after the discovery has completed.

rsajwani and others added 3 commits April 21, 2023 12:11
Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Copy link
Contributor

@GuptaManan100 GuptaManan100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the requested changes 👍

@GuptaManan100 GuptaManan100 marked this pull request as ready for review May 3, 2023 10:41
@GuptaManan100 GuptaManan100 removed the NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work label May 3, 2023
@GuptaManan100
Copy link
Contributor

I have updated the tests and description too. Please take a look @deepthi.

Copy link
Collaborator

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of naming suggestions, the rest LGTM

Signed-off-by: Manan Gupta <manan@planetscale.com>
Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100 GuptaManan100 merged commit 70ac857 into vitessio:main May 8, 2023
@GuptaManan100 GuptaManan100 deleted the issue-12268 branch May 8, 2023 12:31
GuptaManan100 added a commit to planetscale/vitess that referenced this pull request May 9, 2023
* add new field to health status

Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>

* fix test bug

Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>

* bug fix

Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>

* feat: fix the issues pointed out in reviews

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix the test

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix naming of variables

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix tests after the changes

Signed-off-by: Manan Gupta <manan@planetscale.com>

---------

Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Co-authored-by: Manan Gupta <manan@planetscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: VTOrc Vitess Orchestrator integration Type: Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: VTOrc should be "ready" only when its ready to remediate.

4 participants