Skip to content

Implement the readiness endpoint for health checking#2015

Merged
azdagron merged 9 commits intospiffe:masterfrom
ryysud:impl-readiness-endpoint
Jan 16, 2021
Merged

Implement the readiness endpoint for health checking#2015
azdagron merged 9 commits intospiffe:masterfrom
ryysud:impl-readiness-endpoint

Conversation

@ryysud
Copy link
Contributor

@ryysud ryysud commented Dec 10, 2020

Signed-off-by: Ryuma Yoshida ryuma.y1117@gmail.com

Pull Request check list

  • Commit conforms to CONTRIBUTING.md?
  • Proper tests/regressions included?
  • Documentation updated?

Affected functionality

Health check.

Description of change

I implemented the readiness endpoint for health checking.

Which issue this PR fixes

This PR fixes #1980.

Signed-off-by: Ryuma Yoshida <ryuma.y1117@gmail.com>
@ryysud ryysud force-pushed the impl-readiness-endpoint branch from c3ae644 to ee89782 Compare December 10, 2020 06:34
@ryysud
Copy link
Contributor Author

ryysud commented Dec 10, 2020

Using the following commands, I made sure the integration test was successful.

make images
SUITES='suites/k8s suites/k8s-reconcile' make integration

NOTE: "SUITES" variable is added in #2013

Signed-off-by: Ryuma Yoshida <ryuma.y1117@gmail.com>
@ryysud ryysud requested a review from rturner3 as a code owner January 5, 2021 01:51
Signed-off-by: Ryuma Yoshida <ryuma.y1117@gmail.com>
@ryysud
Copy link
Contributor Author

ryysud commented Jan 8, 2021

Could you review this?

@azdagron
Copy link
Member

azdagron commented Jan 8, 2021

Hi @ryysud! Thank you for your patience and I apologize that we have waited this long to provide feedback. Turns out that this PR has been the catalyst for some discussion among the maintainers about the overall design of the health system for SPIRE. We've limped long for quite some time with the current approach but SPIRE deserves a more holistic and complete design for the health subsystem. I'm on point to put some initial designs together, but will likely not be able to get to it until next week sometime.

Once we have a vision on the direction we want to move in, we can then reevaluate this PR. I hope to have additional feedback soon.

@azdagron
Copy link
Member

The maintainers got together and talked about this PR at some length. The outcome of that discussion was that this PR represents an incremental step forward, providing equivalent functionality to the CLI health checks and can be merged without hampering any long term efforts implementing a robust health system inside of SPIRE.

As such, I think we can take this. I'll give it another pass to make sure I didn't miss anything. I'll be opening an issue to track the planning and proposal of the health system.

Thanks again for putting this together @ryysud. I appreciate your patience!

Comment on lines +701 to +703

# # checking_readiness_interval: Interval for checking server readiness. Default: 1m.
# # checking_readiness_interval = "1m"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts around making this configurable? Is this something we can punt on right now? The current checks don't seem costly so as an operator, it isn't clear to me when I'd want to change this.

If we do keep this configurable, I wonder if we need a separate configurable for liveness checks or if we should just use the same interval for both. If separate, I'd suggest naming this ready_check_interval, but if combined it could just be check_interval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current checks don't seem costly so as an operator, it isn't clear to me when I'd want to change this.

That makes sense, so I removed the checking_readiness_interval parameter with 703c624.

client, err := server_util.NewServerClient(s.config.BindUDSAddress.Name)
if err != nil {
return nil, errors.New("cannot create registration client")
}
Copy link
Member

@azdagron azdagron Jan 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to defer a call to Release() on the client or we will leak a gRPC connection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it with 4a3c745.

health_checks {
listener_enabled = true
bind_address = "0.0.0.0"
bind_port = "80"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a different port other than 80? We're actively exploring ways to run SPIRE as a non-root user in containers and binding to port 80 will require elevated capabilities, so this will end up breaking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it with 358825e.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I updated the k8s-workload-registrar.conf in suites/k8s-reconcile with a79c88c because the default value of metrics_addr is ":8080", which conflicts with the health check port.

health_checks {
listener_enabled = true
bind_address = "0.0.0.0"
bind_port = "80"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a different port other than 80? We're actively exploring ways to run SPIRE as a non-root user in containers and binding to port 80 will require elevated capabilities, so this will end up breaking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Signed-off-by: Ryuma Yoshida <ryumyosh@zlab.co.jp>
Signed-off-by: Ryuma Yoshida <ryumyosh@zlab.co.jp>
Signed-off-by: Ryuma Yoshida <ryumyosh@zlab.co.jp>
Signed-off-by: Ryuma Yoshida <ryumyosh@zlab.co.jp>
Copy link
Member

@azdagron azdagron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some very small comments and then I think we can take this. Thank you @ryysud !


# # bind_port: HTTP Port number of the health checks endpoint. Default: 80.
# # bind_port = "80"
# # bind_port = "8080"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably leave this as is, since the default is currently 80. We can update the documentation if we ever migrate away from it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! I fixed that with 1fb4a0d.


# # bind_port: HTTP Port number of the health checks endpoint. Default: 80.
# # bind_port = "80"
# # bind_port = "8080"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Contributor Author

@ryysud ryysud Jan 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed that with the above commit.


// Currently using the ability to fetch a bundle as the health check. This
// **could** be problematic if the Upstream CA signing process is lengthy.
// As currently coded however, the registration API isn't served until after
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I know this comment was copied, but it is now out of date since we aren't using the (now deprecated) Registration API, but the Bundle API. Maybe we should just say API...

Suggested change
// As currently coded however, the registration API isn't served until after
// As currently coded however, the API isn't served until after

Copy link
Contributor Author

@ryysud ryysud Jan 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed that with the above commit.

Signed-off-by: Ryuma Yoshida <ryumyosh@zlab.co.jp>
Copy link
Member

@azdagron azdagron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\o/

@azdagron azdagron merged commit 277aabf into spiffe:master Jan 16, 2021
@ryysud ryysud deleted the impl-readiness-endpoint branch January 16, 2021 17:25
@ryysud
Copy link
Contributor Author

ryysud commented Jan 16, 2021

Thank you for your review, @azdagron!

Copy link
Member

@amartinezfayo amartinezfayo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed that health checks errors may be shown at startup due to a race. Please see my comments below.

// Status is used as a top-level health check for the Server.
func (s *Server) Status() (interface{}, error) {
return nil, nil
client, err := server_util.NewServerClient(s.config.BindUDSAddress.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm realizing that there could be a race here at startup time, where this can be executed before the server is serving. In that case, throwing this error could be confusing.

return nil, nil
client, err := server_util.NewServerClient(s.config.BindUDSAddress.Name)
if err != nil {
return nil, errors.New("cannot create registration client")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would be good to expose the error captured in err.

// As currently coded however, the API isn't served until after
// the server CA has been signed by upstream.
if _, err := bundleClient.GetBundle(context.Background(), &bundle.GetBundleRequest{}); err != nil {
return nil, errors.New("unable to fetch bundle")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would be good to expose the error captured in err.

// Status is used as a top-level health check for the Agent.
func (a *Agent) Status() (interface{}, error) {
return nil, nil
client := api_workload.NewX509Client(&api_workload.X509ClientConfig{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same situation as the server applies here also.


err := <-errCh
if status.Code(err) == codes.Unavailable {
return nil, errors.New("workload api is unavailable") //nolint: golint // error is (ab)used for CLI output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would be good to expose the error captured in err.

@ryysud
Copy link
Contributor Author

ryysud commented Jan 21, 2021

Thank you for your review, @amartinezfayo!
I will fix them and create a new pull-request.

@amartinezfayo
Copy link
Member

I filed #2063 for the issue at startup.

@amartinezfayo
Copy link
Member

Hi @ryysud, did you have a chance to look at #2063?
We decided that this fix needs to be included in the 1.0.0 release. Please let us know if you need any help. Thanks!

@ryysud
Copy link
Contributor Author

ryysud commented Jan 26, 2021

Sorry for the delay. I created the pull-request #2079.

Comment on lines +267 to +285
client := api_workload.NewX509Client(&api_workload.X509ClientConfig{
Addr: a.c.BindAddress,
FailOnError: true,
})
defer client.Stop()

errCh := make(chan error, 1)
go func() {
errCh <- client.Start()
}()

err := <-errCh
if status.Code(err) == codes.Unavailable {
return nil, errors.New("workload api is unavailable") //nolint: golint // error is (ab)used for CLI output
}

return health.Details{
Message: "successfully created a workload api client to fetch x509 svid",
}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had something very similar in mind. I think it needs a context.WithTimeout and a select or we risk blocking on the client indefinitely:
https://play.golang.org/p/_R0PeOF3tU9
vs.
https://play.golang.org/p/KOOZuJNTELv

Obviously, we would want to add a test for the race detector on that select as well.

@@ -1,5 +1,5 @@
# This is the SPIRE Agent configuration file including all possible configuration
# options.
# options.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: All the minor whitespace corrections in the doc comments distract from the main purpose of the pull request. I agree the corrections should be made, it would be easier to review in a separate commit.

// getCheckingReadinessInterval returns the configured value or a default
func (c *Config) getCheckingReadinessInterval() string {
if c.CheckingReadinessInterval == "" {
return "1m"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use a string constant for the default value

Is 1m an appropriate default? I would think 10s.

}

// getAddress returns an address suitable for use as http.Server.Addr.
func (c *Config) getAddress() string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this method above getReadyPath makes review a little more difficult. This is not actually a change to getAddress(), but I have to work to confirm that.

if c.BindAddress != "" {
host = c.BindAddress
// getCheckingReadinessInterval returns the configured value or a default
func (c *Config) getCheckingReadinessInterval() string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please initially leave method order the same, add this new method (above or below doesn't matter) in one commit, and then do any move as a separate commit so it is easier to confirm the move doesn't change anything.

@zmt
Copy link
Contributor

zmt commented Jul 12, 2021

Oops - I had an unsubmitted review just hanging about so I clicked submit even though this is already merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The readiness endpoint for health checking is not implemented

5 participants