Implement the readiness endpoint for health checking#2015
Implement the readiness endpoint for health checking#2015azdagron merged 9 commits intospiffe:masterfrom
Conversation
Signed-off-by: Ryuma Yoshida <ryuma.y1117@gmail.com>
c3ae644 to
ee89782
Compare
|
Using the following commands, I made sure the integration test was successful. make images
SUITES='suites/k8s suites/k8s-reconcile' make integrationNOTE: "SUITES" variable is added in #2013 |
Signed-off-by: Ryuma Yoshida <ryuma.y1117@gmail.com>
Signed-off-by: Ryuma Yoshida <ryuma.y1117@gmail.com>
|
Could you review this? |
|
Hi @ryysud! Thank you for your patience and I apologize that we have waited this long to provide feedback. Turns out that this PR has been the catalyst for some discussion among the maintainers about the overall design of the health system for SPIRE. We've limped long for quite some time with the current approach but SPIRE deserves a more holistic and complete design for the health subsystem. I'm on point to put some initial designs together, but will likely not be able to get to it until next week sometime. Once we have a vision on the direction we want to move in, we can then reevaluate this PR. I hope to have additional feedback soon. |
|
The maintainers got together and talked about this PR at some length. The outcome of that discussion was that this PR represents an incremental step forward, providing equivalent functionality to the CLI health checks and can be merged without hampering any long term efforts implementing a robust health system inside of SPIRE. As such, I think we can take this. I'll give it another pass to make sure I didn't miss anything. I'll be opening an issue to track the planning and proposal of the health system. Thanks again for putting this together @ryysud. I appreciate your patience! |
conf/server/server_full.conf
Outdated
|
|
||
| # # checking_readiness_interval: Interval for checking server readiness. Default: 1m. | ||
| # # checking_readiness_interval = "1m" |
There was a problem hiding this comment.
What are your thoughts around making this configurable? Is this something we can punt on right now? The current checks don't seem costly so as an operator, it isn't clear to me when I'd want to change this.
If we do keep this configurable, I wonder if we need a separate configurable for liveness checks or if we should just use the same interval for both. If separate, I'd suggest naming this ready_check_interval, but if combined it could just be check_interval.
There was a problem hiding this comment.
The current checks don't seem costly so as an operator, it isn't clear to me when I'd want to change this.
That makes sense, so I removed the checking_readiness_interval parameter with 703c624.
| client, err := server_util.NewServerClient(s.config.BindUDSAddress.Name) | ||
| if err != nil { | ||
| return nil, errors.New("cannot create registration client") | ||
| } |
There was a problem hiding this comment.
We need to defer a call to Release() on the client or we will leak a gRPC connection.
| health_checks { | ||
| listener_enabled = true | ||
| bind_address = "0.0.0.0" | ||
| bind_port = "80" |
There was a problem hiding this comment.
Can we use a different port other than 80? We're actively exploring ways to run SPIRE as a non-root user in containers and binding to port 80 will require elevated capabilities, so this will end up breaking.
There was a problem hiding this comment.
Also, I updated the k8s-workload-registrar.conf in suites/k8s-reconcile with a79c88c because the default value of metrics_addr is ":8080", which conflicts with the health check port.
| health_checks { | ||
| listener_enabled = true | ||
| bind_address = "0.0.0.0" | ||
| bind_port = "80" |
There was a problem hiding this comment.
Can we use a different port other than 80? We're actively exploring ways to run SPIRE as a non-root user in containers and binding to port 80 will require elevated capabilities, so this will end up breaking.
Signed-off-by: Ryuma Yoshida <ryumyosh@zlab.co.jp>
Signed-off-by: Ryuma Yoshida <ryumyosh@zlab.co.jp>
Signed-off-by: Ryuma Yoshida <ryumyosh@zlab.co.jp>
Signed-off-by: Ryuma Yoshida <ryumyosh@zlab.co.jp>
conf/agent/agent_full.conf
Outdated
|
|
||
| # # bind_port: HTTP Port number of the health checks endpoint. Default: 80. | ||
| # # bind_port = "80" | ||
| # # bind_port = "8080" |
There was a problem hiding this comment.
We should probably leave this as is, since the default is currently 80. We can update the documentation if we ever migrate away from it.
conf/server/server_full.conf
Outdated
|
|
||
| # # bind_port: HTTP Port number of the health checks endpoint. Default: 80. | ||
| # # bind_port = "80" | ||
| # # bind_port = "8080" |
There was a problem hiding this comment.
I fixed that with the above commit.
pkg/server/server.go
Outdated
|
|
||
| // Currently using the ability to fetch a bundle as the health check. This | ||
| // **could** be problematic if the Upstream CA signing process is lengthy. | ||
| // As currently coded however, the registration API isn't served until after |
There was a problem hiding this comment.
nitpick: I know this comment was copied, but it is now out of date since we aren't using the (now deprecated) Registration API, but the Bundle API. Maybe we should just say API...
| // As currently coded however, the registration API isn't served until after | |
| // As currently coded however, the API isn't served until after |
There was a problem hiding this comment.
I fixed that with the above commit.
Signed-off-by: Ryuma Yoshida <ryumyosh@zlab.co.jp>
|
Thank you for your review, @azdagron! |
amartinezfayo
left a comment
There was a problem hiding this comment.
I've noticed that health checks errors may be shown at startup due to a race. Please see my comments below.
| // Status is used as a top-level health check for the Server. | ||
| func (s *Server) Status() (interface{}, error) { | ||
| return nil, nil | ||
| client, err := server_util.NewServerClient(s.config.BindUDSAddress.Name) |
There was a problem hiding this comment.
I'm realizing that there could be a race here at startup time, where this can be executed before the server is serving. In that case, throwing this error could be confusing.
| return nil, nil | ||
| client, err := server_util.NewServerClient(s.config.BindUDSAddress.Name) | ||
| if err != nil { | ||
| return nil, errors.New("cannot create registration client") |
There was a problem hiding this comment.
I think that it would be good to expose the error captured in err.
| // As currently coded however, the API isn't served until after | ||
| // the server CA has been signed by upstream. | ||
| if _, err := bundleClient.GetBundle(context.Background(), &bundle.GetBundleRequest{}); err != nil { | ||
| return nil, errors.New("unable to fetch bundle") |
There was a problem hiding this comment.
I think that it would be good to expose the error captured in err.
| // Status is used as a top-level health check for the Agent. | ||
| func (a *Agent) Status() (interface{}, error) { | ||
| return nil, nil | ||
| client := api_workload.NewX509Client(&api_workload.X509ClientConfig{ |
There was a problem hiding this comment.
The same situation as the server applies here also.
|
|
||
| err := <-errCh | ||
| if status.Code(err) == codes.Unavailable { | ||
| return nil, errors.New("workload api is unavailable") //nolint: golint // error is (ab)used for CLI output |
There was a problem hiding this comment.
I think that it would be good to expose the error captured in err.
|
Thank you for your review, @amartinezfayo! |
|
I filed #2063 for the issue at startup. |
|
Sorry for the delay. I created the pull-request #2079. |
| client := api_workload.NewX509Client(&api_workload.X509ClientConfig{ | ||
| Addr: a.c.BindAddress, | ||
| FailOnError: true, | ||
| }) | ||
| defer client.Stop() | ||
|
|
||
| errCh := make(chan error, 1) | ||
| go func() { | ||
| errCh <- client.Start() | ||
| }() | ||
|
|
||
| err := <-errCh | ||
| if status.Code(err) == codes.Unavailable { | ||
| return nil, errors.New("workload api is unavailable") //nolint: golint // error is (ab)used for CLI output | ||
| } | ||
|
|
||
| return health.Details{ | ||
| Message: "successfully created a workload api client to fetch x509 svid", | ||
| }, nil |
There was a problem hiding this comment.
I had something very similar in mind. I think it needs a context.WithTimeout and a select or we risk blocking on the client indefinitely:
https://play.golang.org/p/_R0PeOF3tU9
vs.
https://play.golang.org/p/KOOZuJNTELv
Obviously, we would want to add a test for the race detector on that select as well.
| @@ -1,5 +1,5 @@ | |||
| # This is the SPIRE Agent configuration file including all possible configuration | |||
| # options. | |||
| # options. | |||
There was a problem hiding this comment.
nit: All the minor whitespace corrections in the doc comments distract from the main purpose of the pull request. I agree the corrections should be made, it would be easier to review in a separate commit.
pkg/common/health/config.go
Outdated
| // getCheckingReadinessInterval returns the configured value or a default | ||
| func (c *Config) getCheckingReadinessInterval() string { | ||
| if c.CheckingReadinessInterval == "" { | ||
| return "1m" |
There was a problem hiding this comment.
nit: use a string constant for the default value
Is 1m an appropriate default? I would think 10s.
| } | ||
|
|
||
| // getAddress returns an address suitable for use as http.Server.Addr. | ||
| func (c *Config) getAddress() string { |
There was a problem hiding this comment.
Moving this method above getReadyPath makes review a little more difficult. This is not actually a change to getAddress(), but I have to work to confirm that.
pkg/common/health/config.go
Outdated
| if c.BindAddress != "" { | ||
| host = c.BindAddress | ||
| // getCheckingReadinessInterval returns the configured value or a default | ||
| func (c *Config) getCheckingReadinessInterval() string { |
There was a problem hiding this comment.
Please initially leave method order the same, add this new method (above or below doesn't matter) in one commit, and then do any move as a separate commit so it is easier to confirm the move doesn't change anything.
|
Oops - I had an unsubmitted review just hanging about so I clicked submit even though this is already merged. |
Signed-off-by: Ryuma Yoshida ryuma.y1117@gmail.com
Pull Request check list
Affected functionality
Health check.
Description of change
I implemented the readiness endpoint for health checking.
Which issue this PR fixes
This PR fixes #1980.