Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics: Differentiate between restricted and unrestricted certificates #13214

Merged
merged 15 commits into from
Mar 28, 2024

Conversation

roosterfish
Copy link
Contributor

@roosterfish roosterfish commented Mar 26, 2024

Fixes #13200.

Overview

This PR adds two new identity types IdentityTypeCertificateMetricsRestricted and IdentityTypeCertificateMetricsUnrestricted. They effectively replace IdentityTypeCertificateMetrics which doesn't track if a metrics certificate is restricted or not.

Additionally the respective DB patch is modified to map restricted certificates to their right identity type and vice versa.

Furthermore the logic for checking metrics samples is now modified to require looping the set of samples only once which should improve the performance.

Permissions

The following permissions checking is now in place when accessing the /1.0/metrics endpoint:

Method URL Auth Entitlement Identity Type Outcome
TLS /1.0/metrics can_view_metrics server Return all the projects metrics including internal server metrics
TLS /1.0/metrics?project=xyz can_view_metrics project Return the project specific metrics. *1
OIDC /1.0/metrics can_view_metrics server Return all the project metrics including internal server metrics
OIDC /1.0/metrics?project=xyz can_view_metrics project Return the project specific metrics. *2

*1: If the certificate is unrestricted also return the internal server metrics. This is too ensure backwards compatibility with what is in place currently (5.20).

*2: This doesn't specifically check if there is another permission to allow viewing the projects instances. Having can_view_metrics on the project level supersedes any more fine grained entitlements on instances. If the user also has can_view_metrics on server return the internal server metrics too.

@github-actions github-actions bot added the API Changes to the REST API label Mar 26, 2024
@roosterfish roosterfish force-pushed the fix_metrics_perms branch 5 times, most recently from 4ccf8b6 to fced0e7 Compare March 26, 2024 11:58
@tomponline tomponline requested a review from markylaing March 26, 2024 14:58
@roosterfish roosterfish marked this pull request as ready for review March 26, 2024 16:06
@roosterfish roosterfish requested a review from tomponline as a code owner March 26, 2024 16:06
@roosterfish
Copy link
Contributor Author

By the way, I was trying to see what is the current behavior when using OIDC on the metrics endpoint but it doesn't return any metric samples, only their descriptions.

For me it only works when using certificates, so OIDC metrics looks broken.

@tomponline
Copy link
Member

so OIDC metrics looks broken

please open separate issue and see if @markylaing can help with this on his return.

@roosterfish
Copy link
Contributor Author

so OIDC metrics looks broken

please open separate issue and see if @markylaing can help with this on his return.

With the changes in this PR it's working again. Somehow the filtering of samples got corrupted, but it looks to be fixed with 691b57f.

@roosterfish
Copy link
Contributor Author

I have updated the table to see which permissions are now checked when accessing the metrics endpoint.

@tomponline
Copy link
Member

so OIDC metrics looks broken

please open separate issue and see if @markylaing can help with this on his return.

With the changes in this PR it's working again. Somehow the filtering of samples got corrupted, but it looks to be fixed with 691b57f.

Cool! So Does this PR fix #13217 ?

lxd/metrics/metrics.go Outdated Show resolved Hide resolved
Copy link
Member

@tomponline tomponline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good thanks!

Will await @markylaing thoughts too.

Copy link
Member

@tomponline tomponline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I missed one thing, please can you add a test for checking TLS cert access to metrics (restricted/unrestricted) to catch any future regressions like this

@roosterfish
Copy link
Contributor Author

Cool! So Does this PR fix #13217 ?

No, it's just another problem with returning metrics. But it's fixed in this PR.

And there seems to be one more. The internal server metrics don't show up in the response if the cache interval of 8 seconds expired.
This can be easily reproduced when querying the endpoint once (first response never contains the internal server metrics), then any subsequent request will contain the metrics until 8 seconds have passed and then the next request won't contain the metrics. Every subsequent request will again contain the metrics.

@gabrielmougard have you seen the same behavior and maybe fixed it with the changes in #13222?

@gabrielmougard
Copy link
Contributor

Cool! So Does this PR fix #13217 ?

No, it's just another problem with returning metrics. But it's fixed in this PR.

And there seems to be one more. The internal server metrics don't show up in the response if the cache interval of 8 seconds expired. This can be easily reproduced when querying the endpoint once (first response never contains the internal server metrics), then any subsequent request will contain the metrics until 8 seconds have passed and then the next request won't contain the metrics. Every subsequent request will again contain the metrics.

@gabrielmougard have you seen the same behavior and maybe fixed it with the changes in #13222?

Yes I see it too. I'm trying to come up with a fix on #13222

@gabrielmougard
Copy link
Contributor

gabrielmougard commented Mar 27, 2024

@roosterfish regarding the internal server metrics, I made them appear in the cache but I had to tie them to a project (I chose default as this one will always exist):

diff --git a/lxd/api_metrics.go b/lxd/api_metrics.go
index 66999f11f7..2f28e911b7 100644
--- a/lxd/api_metrics.go
+++ b/lxd/api_metrics.go
@@ -101,7 +101,7 @@ func metricsGet(d *Daemon, r *http.Request) response.Response {
 	metricSet := metrics.NewMetricSet(nil)
 
 	var projectNames []string
-
+	var intMetrics *metrics.MetricSet
 	err := s.DB.Cluster.Transaction(r.Context(), func(ctx context.Context, tx *db.ClusterTx) error {
 		// Figure out the projects to retrieve.
 		if projectName != "" {
@@ -119,9 +119,8 @@ func metricsGet(d *Daemon, r *http.Request) response.Response {
 			}
 		}
 
-		// Add internal metrics.
-		metricSet.Merge(internalMetrics(ctx, s.StartTime, tx))
-
+		// register internal metrics.
+		intMetrics = internalMetrics(ctx, s.StartTime, tx)
 		return nil
 	})
 	if err != nil {
@@ -304,6 +303,10 @@ func metricsGet(d *Daemon, r *http.Request) response.Response {
 
 	updatedProjects := []string{}
 	for project, entries := range newMetrics {
+		if project == "default" {
+			entries.Merge(intMetrics) // internal metrics are always considered new and associated with the default project.
+		}
+
 		metricsCache[project] = metricsCacheEntry{
 			expiry:  time.Now().Add(cacheDuration),
 			metrics: entries,
@@ -402,28 +405,28 @@ func internalMetrics(ctx context.Context, daemonStartTime time.Time, tx *db.Clus
 
 	runtime.ReadMemStats(&ms)
 
-	out.AddSamples(metrics.GoAllocBytes, metrics.Sample{Value: float64(ms.Alloc)})
-	out.AddSamples(metrics.GoAllocBytesTotal, metrics.Sample{Value: float64(ms.TotalAlloc)})
-	out.AddSamples(metrics.GoBuckHashSysBytes, metrics.Sample{Value: float64(ms.BuckHashSys)})
-	out.AddSamples(metrics.GoFreesTotal, metrics.Sample{Value: float64(ms.Frees)})
-	out.AddSamples(metrics.GoGCSysBytes, metrics.Sample{Value: float64(ms.GCSys)})
-	out.AddSamples(metrics.GoHeapAllocBytes, metrics.Sample{Value: float64(ms.HeapAlloc)})
-	out.AddSamples(metrics.GoHeapIdleBytes, metrics.Sample{Value: float64(ms.HeapIdle)})
-	out.AddSamples(metrics.GoHeapInuseBytes, metrics.Sample{Value: float64(ms.HeapInuse)})
-	out.AddSamples(metrics.GoHeapObjects, metrics.Sample{Value: float64(ms.HeapObjects)})
-	out.AddSamples(metrics.GoHeapReleasedBytes, metrics.Sample{Value: float64(ms.HeapReleased)})
-	out.AddSamples(metrics.GoHeapSysBytes, metrics.Sample{Value: float64(ms.HeapSys)})
-	out.AddSamples(metrics.GoLookupsTotal, metrics.Sample{Value: float64(ms.Lookups)})
-	out.AddSamples(metrics.GoMallocsTotal, metrics.Sample{Value: float64(ms.Mallocs)})
-	out.AddSamples(metrics.GoMCacheInuseBytes, metrics.Sample{Value: float64(ms.MCacheInuse)})
-	out.AddSamples(metrics.GoMCacheSysBytes, metrics.Sample{Value: float64(ms.MCacheSys)})
-	out.AddSamples(metrics.GoMSpanInuseBytes, metrics.Sample{Value: float64(ms.MSpanInuse)})
-	out.AddSamples(metrics.GoMSpanSysBytes, metrics.Sample{Value: float64(ms.MSpanSys)})
-	out.AddSamples(metrics.GoNextGCBytes, metrics.Sample{Value: float64(ms.NextGC)})
-	out.AddSamples(metrics.GoOtherSysBytes, metrics.Sample{Value: float64(ms.OtherSys)})
-	out.AddSamples(metrics.GoStackInuseBytes, metrics.Sample{Value: float64(ms.StackInuse)})
-	out.AddSamples(metrics.GoStackSysBytes, metrics.Sample{Value: float64(ms.StackSys)})
-	out.AddSamples(metrics.GoSysBytes, metrics.Sample{Value: float64(ms.Sys)})
+	out.AddSamples(metrics.GoAllocBytes, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.Alloc)})
+	out.AddSamples(metrics.GoAllocBytesTotal, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.TotalAlloc)})
+	out.AddSamples(metrics.GoBuckHashSysBytes, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.BuckHashSys)})
+	out.AddSamples(metrics.GoFreesTotal, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.Frees)})
+	out.AddSamples(metrics.GoGCSysBytes, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.GCSys)})
+	out.AddSamples(metrics.GoHeapAllocBytes, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.HeapAlloc)})
+	out.AddSamples(metrics.GoHeapIdleBytes, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.HeapIdle)})
+	out.AddSamples(metrics.GoHeapInuseBytes, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.HeapInuse)})
+	out.AddSamples(metrics.GoHeapObjects, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.HeapObjects)})
+	out.AddSamples(metrics.GoHeapReleasedBytes, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.HeapReleased)})
+	out.AddSamples(metrics.GoHeapSysBytes, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.HeapSys)})
+	out.AddSamples(metrics.GoLookupsTotal, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.Lookups)})
+	out.AddSamples(metrics.GoMallocsTotal, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.Mallocs)})
+	out.AddSamples(metrics.GoMCacheInuseBytes, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.MCacheInuse)})
+	out.AddSamples(metrics.GoMCacheSysBytes, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.MCacheSys)})
+	out.AddSamples(metrics.GoMSpanInuseBytes, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.MSpanInuse)})
+	out.AddSamples(metrics.GoMSpanSysBytes, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.MSpanSys)})
+	out.AddSamples(metrics.GoNextGCBytes, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.NextGC)})
+	out.AddSamples(metrics.GoOtherSysBytes, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.OtherSys)})
+	out.AddSamples(metrics.GoStackInuseBytes, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.StackInuse)})
+	out.AddSamples(metrics.GoStackSysBytes, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.StackSys)})
+	out.AddSamples(metrics.GoSysBytes, metrics.Sample{Labels: map[string]string{"project": "default"}, Value: float64(ms.Sys)})
 
 	return out
 }

Right now the cache is keyed by project and these internal metrics are 'global' so this solution is a bit misleading, but it shows the right data.. What do you thiink about this approach?

This way we can apply multiple filters on internal server and project metrics at once.
Additionally this ensures that entity URLs are always called with the right type.
Otherwise this will raise a warning for every sample.

Signed-off-by: Julian Pelizäus <[email protected]>
@roosterfish
Copy link
Contributor Author

also @roosterfish I was thinking as our plan here depends on converting already upgraded installation's unrestricted metrics keys to restricted metrics keys, have we confirmed that a user can convert specific keys back to unrestricted ones if thats what they had originally?

Jup, tested this now a few times with installation going from 5.20/stable to latest/edge. Actually I found (contrary to the discussions), if there was a restricted metrics certificate before it stayed restricted. Metrics certificates that were unrestricted got moved to being restricted. So the DB patch didn't give out more access, instead it took away access:

julian@thinkpad:~$ snap install lxd --channel 5.20/stable
lxd (5.20/stable) 5.20-f3dd836 from Canonical✓ installed
julian@thinkpad:~$ lxc config trust add metrics.crt --type metrics 
julian@thinkpad:~$ lxc config trust show a80f90615040 | grep restricted
restricted: false
julian@thinkpad:~$ snap refresh lxd --channel latest/edge
2024-03-28T15:55:45+01:00 INFO Waiting for "snap.lxd.daemon.service" to stop.
lxd (edge) git-e6f9af1 from Canonical✓ refreshed
julian@thinkpad:~$ lxc config trust show a80f90615040 | grep restricted
restricted: true

The PR is now ready for a final review.

// Filter all metrics for view permissions.
// Internal server metrics without labels are compared against the server entity.
metricSet.FilterSamples(func(labels map[string]string) bool {
if labels["project"] != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? What if the label does have a non-empty project, but the user doesn't have a project permission, but does have server permission, this would return false?

Doesn't having the server permission mean they can see all metrics (including the project ones)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay tricky one, thanks for rising this as I have found another potential security bug in my changes. The unrestricted metrics certificate got just too much permission.

But the way the checking is done here is right.
In case of the TLS driver if the certificate is unrestricted, the driver permits can_view_metrics by default (see latest push where I went back to the original changes).
In case of OIDC this is allowed as the can_view_metrics on the server type supersedes the can_view_metrics on the project type. This way the check still returns true even if there is no explicit can_view_metrics entitlement set for the project (see the changed OpenFGA model).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so its resolved internally in each driver that the user has implicit project level permissions if its assigned at the server level. Makes sense thanks

…tityTypeCertificateMetricsUnrestricted

This effectively replaces identityTypeCertificateMetrics with identityTypeCertificateMetricsRestricted
and adds the new type identityTypeCertificateMetricsUnrestricted.

By replacing identityTypeCertificateMetrics with identityTypeCertificateMetricsRestricted
we ensure that restricted metrics certificates that were falsely moved to identityTypeCertificateMetrics
are still considered to be secure.

On the other side this marks unrestricted certificates as restricted by default.
This has to be manually modified by unrestricting the respective metrics certficiate.

Signed-off-by: Julian Pelizäus <[email protected]>
tomponline
tomponline previously approved these changes Mar 28, 2024
Copy link
Member

@tomponline tomponline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@roosterfish roosterfish marked this pull request as draft March 28, 2024 16:45
@roosterfish
Copy link
Contributor Author

Mhhh, I just saw when adding new metrics certs they are restricted by default. Checking what has now caused this.
I am marking the PR as draft for now just to be sure.

@tomponline
Copy link
Member

Mhhh, I just saw when adding new metrics certs they are restricted by default. Checking what has now caused this. I am marking the PR as draft for now just to be sure.

Seems like we need more tests too.

…cted and unrestricted type

Signed-off-by: Julian Pelizäus <[email protected]>
…TypeCertificateMetricsUnrestricted

Signed-off-by: Julian Pelizäus <[email protected]>
…cted and IdentityTypeCertificateMetricsUnrestricted

Signed-off-by: Julian Pelizäus <[email protected]>
…tyTypeCertificateMetricsUnrestricted

Signed-off-by: Julian Pelizäus <[email protected]>
This allows granting the can_view_metrics entitlement on specific projects
so that requests to the metrics endpoint containing the project query parameter
only return the respective project's metrics.

Signed-off-by: Julian Pelizäus <[email protected]>
Ensure that restricted certificates are mapped to IdentityTypeCertificateMetricsRestricted and vice versa.

Signed-off-by: Julian Pelizäus <[email protected]>
@roosterfish roosterfish marked this pull request as ready for review March 28, 2024 17:08
@roosterfish
Copy link
Contributor Author

Ok it was a display bug, it's still the right type in the database.

Additionally check the responses if the endpoint gets queried using the project parameter
as this is producing different results for restricted and unrestricted metrics certificates.

Signed-off-by: Julian Pelizäus <[email protected]>
@roosterfish
Copy link
Contributor Author

The test is now extended to check for the restriction level too.

@tomponline tomponline merged commit aab7941 into canonical:main Mar 28, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Changes to the REST API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metrics certificates are not restricted by project
4 participants