Implement the readiness endpoint for health checking by ryysud · Pull Request #2015 · spiffe/spire

ryysud · 2020-12-10T06:16:21Z

Signed-off-by: Ryuma Yoshida ryuma.y1117@gmail.com

Pull Request check list

Commit conforms to CONTRIBUTING.md?
Proper tests/regressions included?
Documentation updated?

Affected functionality

Health check.

Description of change

I implemented the readiness endpoint for health checking.

Which issue this PR fixes

This PR fixes #1980.

Signed-off-by: Ryuma Yoshida <ryuma.y1117@gmail.com>

ryysud · 2020-12-10T06:57:05Z

Using the following commands, I made sure the integration test was successful.

make images
SUITES='suites/k8s suites/k8s-reconcile' make integration

NOTE: "SUITES" variable is added in #2013

Signed-off-by: Ryuma Yoshida <ryuma.y1117@gmail.com>

ryysud · 2021-01-08T09:37:12Z

Could you review this?

azdagron · 2021-01-08T21:17:14Z

Hi @ryysud! Thank you for your patience and I apologize that we have waited this long to provide feedback. Turns out that this PR has been the catalyst for some discussion among the maintainers about the overall design of the health system for SPIRE. We've limped long for quite some time with the current approach but SPIRE deserves a more holistic and complete design for the health subsystem. I'm on point to put some initial designs together, but will likely not be able to get to it until next week sometime.

Once we have a vision on the direction we want to move in, we can then reevaluate this PR. I hope to have additional feedback soon.

azdagron · 2021-01-12T21:36:50Z

The maintainers got together and talked about this PR at some length. The outcome of that discussion was that this PR represents an incremental step forward, providing equivalent functionality to the CLI health checks and can be merged without hampering any long term efforts implementing a robust health system inside of SPIRE.

As such, I think we can take this. I'll give it another pass to make sure I didn't miss anything. I'll be opening an issue to track the planning and proposal of the health system.

Thanks again for putting this together @ryysud. I appreciate your patience!

azdagron · 2021-01-12T21:56:27Z

conf/server/server_full.conf

+
+#     # checking_readiness_interval: Interval for checking server readiness. Default: 1m.
+#     # checking_readiness_interval = "1m"


What are your thoughts around making this configurable? Is this something we can punt on right now? The current checks don't seem costly so as an operator, it isn't clear to me when I'd want to change this.

If we do keep this configurable, I wonder if we need a separate configurable for liveness checks or if we should just use the same interval for both. If separate, I'd suggest naming this ready_check_interval, but if combined it could just be check_interval.

The current checks don't seem costly so as an operator, it isn't clear to me when I'd want to change this.

That makes sense, so I removed the checking_readiness_interval parameter with 703c624.

azdagron · 2021-01-12T21:58:34Z

pkg/server/server.go

+	client, err := server_util.NewServerClient(s.config.BindUDSAddress.Name)
+	if err != nil {
+		return nil, errors.New("cannot create registration client")
+	}


We need to defer a call to Release() on the client or we will leak a gRPC connection.

I fixed it with 4a3c745.

azdagron · 2021-01-12T22:00:31Z

test/integration/suites/k8s/conf/server/spire-server.yaml

+    health_checks {
+      listener_enabled = true
+      bind_address = "0.0.0.0"
+      bind_port = "80"


Can we use a different port other than 80? We're actively exploring ways to run SPIRE as a non-root user in containers and binding to port 80 will require elevated capabilities, so this will end up breaking.

I fixed it with 358825e.

Also, I updated the k8s-workload-registrar.conf in suites/k8s-reconcile with a79c88c because the default value of metrics_addr is ":8080", which conflicts with the health check port.

azdagron · 2021-01-12T22:00:38Z

test/integration/suites/k8s/conf/agent/spire-agent.yaml

+    health_checks {
+      listener_enabled = true
+      bind_address = "0.0.0.0"
+      bind_port = "80"


Can we use a different port other than 80? We're actively exploring ways to run SPIRE as a non-root user in containers and binding to port 80 will require elevated capabilities, so this will end up breaking.

Signed-off-by: Ryuma Yoshida <ryumyosh@zlab.co.jp>

azdagron

Just some very small comments and then I think we can take this. Thank you @ryysud !

azdagron · 2021-01-15T18:24:04Z

conf/agent/agent_full.conf


 #     # bind_port: HTTP Port number of the health checks endpoint. Default: 80.
-#     # bind_port = "80"
+#     # bind_port = "8080"


We should probably leave this as is, since the default is currently 80. We can update the documentation if we ever migrate away from it.

OK! I fixed that with 1fb4a0d.

azdagron · 2021-01-15T18:24:17Z

conf/server/server_full.conf


 #     # bind_port: HTTP Port number of the health checks endpoint. Default: 80.
-#     # bind_port = "80"
+#     # bind_port = "8080"


I fixed that with the above commit.

azdagron · 2021-01-15T18:26:16Z

pkg/server/server.go

+
+	// Currently using the ability to fetch a bundle as the health check. This
+	// **could** be problematic if the Upstream CA signing process is lengthy.
+	// As currently coded however, the registration API isn't served until after


nitpick: I know this comment was copied, but it is now out of date since we aren't using the (now deprecated) Registration API, but the Bundle API. Maybe we should just say API...

Suggested change

// As currently coded however, the registration API isn't served until after

// As currently coded however, the API isn't served until after

I fixed that with the above commit.

Signed-off-by: Ryuma Yoshida <ryumyosh@zlab.co.jp>

azdagron

\o/

ryysud · 2021-01-16T17:28:06Z

Thank you for your review, @azdagron!

amartinezfayo

I've noticed that health checks errors may be shown at startup due to a race. Please see my comments below.

amartinezfayo · 2021-01-21T00:28:24Z

pkg/server/server.go

 // Status is used as a top-level health check for the Server.
 func (s *Server) Status() (interface{}, error) {
-	return nil, nil
+	client, err := server_util.NewServerClient(s.config.BindUDSAddress.Name)


I'm realizing that there could be a race here at startup time, where this can be executed before the server is serving. In that case, throwing this error could be confusing.

amartinezfayo · 2021-01-21T00:29:41Z

pkg/server/server.go

-	return nil, nil
+	client, err := server_util.NewServerClient(s.config.BindUDSAddress.Name)
+	if err != nil {
+		return nil, errors.New("cannot create registration client")


I think that it would be good to expose the error captured in err.

amartinezfayo · 2021-01-21T00:29:56Z

pkg/server/server.go

+	// As currently coded however, the API isn't served until after
+	// the server CA has been signed by upstream.
+	if _, err := bundleClient.GetBundle(context.Background(), &bundle.GetBundleRequest{}); err != nil {
+		return nil, errors.New("unable to fetch bundle")


I think that it would be good to expose the error captured in err.

amartinezfayo · 2021-01-21T00:30:40Z

pkg/agent/agent.go

 // Status is used as a top-level health check for the Agent.
 func (a *Agent) Status() (interface{}, error) {
-	return nil, nil
+	client := api_workload.NewX509Client(&api_workload.X509ClientConfig{


The same situation as the server applies here also.

amartinezfayo · 2021-01-21T00:33:58Z

pkg/agent/agent.go

+
+	err := <-errCh
+	if status.Code(err) == codes.Unavailable {
+		return nil, errors.New("workload api is unavailable") //nolint: golint // error is (ab)used for CLI output


I think that it would be good to expose the error captured in err.

ryysud · 2021-01-21T10:16:07Z

Thank you for your review, @amartinezfayo!
I will fix them and create a new pull-request.

amartinezfayo · 2021-01-21T20:49:27Z

I filed #2063 for the issue at startup.

amartinezfayo · 2021-01-25T20:35:21Z

Hi @ryysud, did you have a chance to look at #2063?
We decided that this fix needs to be included in the 1.0.0 release. Please let us know if you need any help. Thanks!

ryysud · 2021-01-26T06:07:12Z

Sorry for the delay. I created the pull-request #2079.

zmt · 2020-12-10T18:19:11Z

pkg/agent/agent.go

+	client := api_workload.NewX509Client(&api_workload.X509ClientConfig{
+		Addr:        a.c.BindAddress,
+		FailOnError: true,
+	})
+	defer client.Stop()
+
+	errCh := make(chan error, 1)
+	go func() {
+		errCh <- client.Start()
+	}()
+
+	err := <-errCh
+	if status.Code(err) == codes.Unavailable {
+		return nil, errors.New("workload api is unavailable") //nolint: golint // error is (ab)used for CLI output
+	}
+
+	return health.Details{
+		Message: "successfully created a workload api client to fetch x509 svid",
+	}, nil


I had something very similar in mind. I think it needs a context.WithTimeout and a select or we risk blocking on the client indefinitely:
https://play.golang.org/p/_R0PeOF3tU9
vs.
https://play.golang.org/p/KOOZuJNTELv

Obviously, we would want to add a test for the race detector on that select as well.

zmt · 2020-12-10T18:28:48Z

conf/agent/agent_full.conf

@@ -1,5 +1,5 @@
 # This is the SPIRE Agent configuration file including all possible configuration
-# options. 
+# options.


nit: All the minor whitespace corrections in the doc comments distract from the main purpose of the pull request. I agree the corrections should be made, it would be easier to review in a separate commit.

zmt · 2020-12-10T18:30:44Z

pkg/common/health/config.go

+// getCheckingReadinessInterval returns the configured value or a default
+func (c *Config) getCheckingReadinessInterval() string {
+	if c.CheckingReadinessInterval == "" {
+		return "1m"


nit: use a string constant for the default value

Is 1m an appropriate default? I would think 10s.

zmt · 2020-12-10T18:32:40Z

pkg/common/health/config.go

 }

+// getAddress returns an address suitable for use as http.Server.Addr.
+func (c *Config) getAddress() string {


Moving this method above getReadyPath makes review a little more difficult. This is not actually a change to getAddress(), but I have to work to confirm that.

zmt · 2020-12-10T18:44:35Z

pkg/common/health/config.go

-	if c.BindAddress != "" {
-		host = c.BindAddress
+// getCheckingReadinessInterval returns the configured value or a default
+func (c *Config) getCheckingReadinessInterval() string {


Please initially leave method order the same, add this new method (above or below doesn't matter) in one commit, and then do any move as a separate commit so it is easier to confirm the move doesn't change anything.

zmt · 2021-07-12T22:42:04Z

Oops - I had an unsubmitted review just hanging about so I clicked submit even though this is already merged.

ryysud requested review from APTy, amartinezfayo, azdagron, evan2645 and mcpherrinm as code owners December 10, 2020 06:16

Implement the readiness endpoint for health checking

ee89782

Signed-off-by: Ryuma Yoshida <ryuma.y1117@gmail.com>

ryysud force-pushed the impl-readiness-endpoint branch from c3ae644 to ee89782 Compare December 10, 2020 06:34

azdagron assigned evan2645 and azdagron Dec 15, 2020

Merge branch 'master' into impl-readiness-endpoint

65ed547

Signed-off-by: Ryuma Yoshida <ryuma.y1117@gmail.com>

ryysud requested a review from rturner3 as a code owner January 5, 2021 01:51

Add missing package

fd5512c

Signed-off-by: Ryuma Yoshida <ryuma.y1117@gmail.com>

azdagron mentioned this pull request Jan 12, 2021

Design a robust health system for SPIRE #2047

Closed

azdagron reviewed Jan 12, 2021

View reviewed changes

ryysud added 4 commits January 13, 2021 14:29

Remove the checking_readiness_interval parameter

703c624

Signed-off-by: Ryuma Yoshida <ryumyosh@zlab.co.jp>

Avoid the gRPC connection leaks in the health system

4a3c745

Signed-off-by: Ryuma Yoshida <ryumyosh@zlab.co.jp>

Update the bind_port in the health_checks block example

358825e

Signed-off-by: Ryuma Yoshida <ryumyosh@zlab.co.jp>

Fix suites/k8s-reconcile

a79c88c

Signed-off-by: Ryuma Yoshida <ryumyosh@zlab.co.jp>

ryysud mentioned this pull request Jan 13, 2021

Fix the default value of metrics_addr in the k8s-workload-registrar doc #2050

Merged

3 tasks

azdagron reviewed Jan 15, 2021

View reviewed changes

Address PR comments

1fb4a0d

Signed-off-by: Ryuma Yoshida <ryumyosh@zlab.co.jp>

azdagron approved these changes Jan 16, 2021

View reviewed changes

Merge branch 'master' into impl-readiness-endpoint

7759eb5

azdagron merged commit 277aabf into spiffe:master Jan 16, 2021

ryysud deleted the impl-readiness-endpoint branch January 16, 2021 17:25

amartinezfayo reviewed Jan 21, 2021

View reviewed changes

ryysud mentioned this pull request Jan 26, 2021

Fix the health checks not to be executed before the server and agent are running #2079

Closed

3 tasks

amartinezfayo mentioned this pull request Jan 27, 2021

Health checks errors may be shown at startup #2063

Closed

azdagron added this to the 0.12.2 milestone Mar 23, 2021

azdagron mentioned this pull request Mar 30, 2021

Prepare the v0.12 release branch for v0.12.2 #2183

Merged

zmt reviewed Jul 12, 2021

View reviewed changes

This was referenced Feb 27, 2026

🌱 CNCF mission generation 2026-02-27 kubestellar/console-kb#6

Closed

🌱 CNCF mission generation 2026-02-27 kubestellar/console-kb#11

Merged


		# # checking_readiness_interval: Interval for checking server readiness. Default: 1m.
		# # checking_readiness_interval = "1m"

	// As currently coded however, the registration API isn't served until after
	// As currently coded however, the API isn't served until after

Conversation

ryysud commented Dec 10, 2020

Uh oh!

ryysud commented Dec 10, 2020

Uh oh!

ryysud commented Jan 8, 2021

Uh oh!

azdagron commented Jan 8, 2021

Uh oh!

azdagron commented Jan 12, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

azdagron Jan 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

azdagron left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryysud Jan 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryysud Jan 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

azdagron left a comment

Choose a reason for hiding this comment

Uh oh!

ryysud commented Jan 16, 2021

Uh oh!

amartinezfayo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryysud commented Jan 21, 2021

Uh oh!

amartinezfayo commented Jan 21, 2021

Uh oh!

amartinezfayo commented Jan 25, 2021

Uh oh!

ryysud commented Jan 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

azdagron Jan 12, 2021 •

edited

Loading

ryysud Jan 16, 2021 •

edited

Loading

ryysud Jan 16, 2021 •

edited

Loading