Send aggregated resource presence usage data in Teleport Enterprise by Envek · Pull Request #34954 · gravitational/teleport

Envek · 2023-11-24T13:22:17Z

First draft of teleport-side collection of active Teleport Protected Resources for usage-based billing.

Approach can be totally wrong by itself, creating pull request to get some feedback as soon as possible.

CLAassistant · 2023-11-24T13:22:26Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ Envek
❌ zmb3
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

zmb3

Let's make sure @espadolini gives a review too.

zmb3 · 2023-11-28T15:53:04Z

+			return nil, trace.LimitExceeded("failed to marshal resource counts report within size limit (this is a bug)")
+		}
+
+		report.Records = report.Records[:len(report.Records)/2]


Not sure I understand exactly what this code is doing. Can you clarify?

It is copied from userActivities reporting intact where it throws some “excess” data until report will become small enough to be stored.

It doesn't have any meaning when storing resource counts per kind as report size is constant in this case.

But if we change it to storing anonymized resource names as @espadolini suggested, then this size restriction become valid again:

teleport/lib/usagereporter/teleport/aggregating/service.go

Lines 31 to 34 in 96ed71b

// maxItemSize is the maximum size for a backend value; dynamodb has a limit of

// 400KiB per item, we store values in base64, there's some framing, and a

// healthy 50% margin gets us to about 128KiB.

maxItemSize = 128 * 1024

Probably, for resource activity reports, we can split each report by resource kind, then one report for one resource kind can contain up to around 4k of anonymized resource names. Will it be enough?

If it's not enough we're gonna have to split each report as well, unfortunately.

As far as the anonymized resource names go, we should also truncate them like we seem to have agreed upon in the yet-to-be-merged cloud rfd 88 https://github.com/gravitational/cloud/pull/6352/files#diff-6859f430e4c996afca151a4a59ffb7ad78613b952cc128a0bea1d4823fb3fa24R185-R222

Ok, fixed64 is 8 bytes instead of 32 bytes of HMAC-SHA-256, so it is around 16k of resources per kind per report. Probably still need to split, given that RFD 88 suggests that 100k resources of all kinds should be expected…

The question is how and when to re-assemble split reports? Is it possible at PostHog level? Should split reports share some identifier (report id and sequence number maybe?) Or maybe there is already some mechanism for doing that?

I can see that there is a note about merging in RFD 88:

Then a Clickhouse job will query for these, merge them and send appropriate statistics to Salesforce and Stripe.

But can't see the full picture yet 😵

We have to handle multiple reports with potentially overlapping and potentially disjoint sets of resources for the same time period anyway, since different agents will heartbeat through different auth servers (hopefully without losing connectivity to the one auth they're connected to, which means that only that one auth server will see that the resource exists). Handling multiple disjoint reports from the same auth server is no different than that.

Got it. Implemented naive report splitting. You can take a look at the last commit at the moment.

Need to figure out test that can check that it actually splits and doesn't lose any resource names along the way, will do it for tomorrow.

zmb3 · 2023-11-28T15:58:59Z

 }

+// the kind of a "resource" (e.g. a node, a pod, a service, etc.)
+enum ResourceKind {


I would add a comment here advising devs to keep this in sync with prehog/v1alpha/teleport.proto

We should add a comment in v1alpha/teleport.proto as well, or we will totally end up desyncing them.

Added (and also added to https://github.com/gravitational/cloud/pull/6823)

espadolini

Each auth server will get an independent, potentially overlapping set of resources, so just sending a count for each resource type won't really cut it, we need to send over identifiers for each resource so we can deduplicate them from all the reports. (cc @bl-nero)

I think we also want two different time granularities for user activity and resource presence, so we need to duplicate the time window handling too.

Envek · 2023-11-29T15:01:56Z

we need to send over identifiers for each resource so we can deduplicate them from all the reports

Changed to sending lists of anonymized resource names instead of counts

I think we also want two different time granularities for user activity and resource presence, so we need to duplicate the time window handling too.

Done.

espadolini

LGTM, needs the potential mid-report split in prepareResourceActivityReport (prepareResourcePresenceReport ideally).

@zmb3 how do we want to deal with non-heartbeating resources, like statically configured nodes and desktops?

espadolini · 2023-11-30T09:38:35Z

+		record.ResourceKind = kind
+		record.ResourceName = make([][]byte, 0, len(set))
+		for name := range set {
+			record.ResourceName = append(record.ResourceName, r.anonymizer.AnonymizeNonEmpty(name))


Check in with @bl-nero on what sort of truncation we want here. If we go for a repeated fixed64 (i.e. a []uint64 on the go side) this might be as simple as the following.

Suggested change

record.ResourceName = append(record.ResourceName, r.anonymizer.AnonymizeNonEmpty(name))

record.ResourceName = append(record.ResourceName, binary.LittleEndian.Uint64(r.anonymizer.AnonymizeNonEmpty(name)))

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@gmail.com>

espadolini · 2023-12-01T11:17:54Z

+			freeSize := maxItemSize - proto.Size(report)
+			elementsToFit := (freeSize - emptyKindReportSize) / singleItemSize
+			if elementsToFit <= 0 {
+				// This kind report is too big to fit even if we remove all elements,
+				break // try to insert it into next report
+			}


It kinda works here because a repeated fixed64 in protobuf wire format is just packed together, but honestly I'd make the logic simpler:

put KindNodes last in the order, as a special case

group per-kind reports together, either skipping ones that don't fit (if you go for this we don't really need to put nodes last) or stopping when the next one doesn't fit; at this point you should have some reports ready and some kinds that don't fit even if just by themselves in a report

for each of those kinds (likely only nodes, but maybe out there there's some cluster with 8000 databases) repeatedly split the list of resources in half every time (rather than trying to get an exact calculation right) until the report fits the limit.

Actually, couldn't we just split resource slices in fixed-size chunks of 5000 or so? 🤔

Implemented your suggestions as follows:

group per-kind reports together, skipping ones that don't fit

for each of those kinds that don't fit repeatedly split the list of resources in half every time

Rewrote test to match expected data layout: first kind is 100k records, second kind 10k, third 1k, so on.

espadolini · 2023-12-01T11:19:19Z

+	require.Greater(t, len(reports), len(resKindReports))                                        // some reports were split into two
+	require.Less(t, len(reports[0].ResourceKindReports[0].ResourceIds), maxResourceIdsPerReport) // reports have less resource id than passed


nit

Suggested change

require.Greater(t, len(reports), len(resKindReports)) // some reports were split into two

require.Less(t, len(reports[0].ResourceKindReports[0].ResourceIds), maxResourceIdsPerReport) // reports have less resource id than passed

require.GreaterOrEqual(t, len(reports), len(resKindReports)) // some reports were split into two

require.LessOrEqual(t, len(reports[0].ResourceKindReports[0].ResourceIds), maxResourceIdsPerReport) // reports have less resource id than passed

Can't agree here: I'm specifically checking that reports has been split in parts, so no OrEqual here.

espadolini · 2023-12-01T11:22:21Z

Labeling do-not-merge because https://github.com/gravitational/cloud/pull/6823 should be merged first (to lock in the .proto changes), remove it once that's in.

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@gmail.com>

espadolini

I like the approach with the head and tail when splitting, we should've done that for the user activity reports too.

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@gmail.com>

…tional#34954

zmb3 · 2023-12-20T23:16:17Z

This LGTM now. I will open a buddy PR tomorrow.

Buddy PR for #34954 Closes #34954 Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com> Signed-off-by: Zac Bergquist <zac.bergquist@goteleport.com>

* Report resource usage counts by handling heartbeat events Buddy PR for #34954 Closes #34954 Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com> Signed-off-by: Zac Bergquist <zac.bergquist@goteleport.com> * Copy latest protos from cloud and regenerate --------- Signed-off-by: Zac Bergquist <zac.bergquist@goteleport.com> Co-authored-by: Andrey Novikov <envek@envek.name> Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>

Buddy PR for #34954 Closes #34954 Signed-off-by: Zac Bergquist <zac.bergquist@goteleport.com> Co-authored-by: Andrey Novikov <envek@envek.name> Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>

Report resource usage counts by handling heartbeat events

972d9ea

Envek force-pushed the resource-count-reporting branch from f1d69cc to 972d9ea Compare November 28, 2023 14:47

zmb3 requested a review from espadolini November 28, 2023 15:47

zmb3 reviewed Nov 28, 2023

View reviewed changes

espadolini requested changes Nov 28, 2023

View reviewed changes

Comment thread lib/usagereporter/teleport/aggregating/reporter.go Outdated

Comment thread lib/usagereporter/teleport/aggregating/reporter.go Outdated

Pass anonymized resource names instead of counts in usage reports

9f9b09a

Envek requested a review from espadolini November 29, 2023 15:02

espadolini reviewed Nov 30, 2023

View reviewed changes

Envek and others added 7 commits November 30, 2023 20:11

Change resource presence report format to match current RFD 88 draft

21aa8fa

Apply suggestions from code review

a045d54

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@gmail.com>

BigEndian or LittleEndian? Probably no difference

f0b3810

Tests and fixes

9e5fd70

Naive implementation of ResourcePresenceReport splitting function

ffc7a79

Test and fixes for ResourcePresenceReport splitting function

2d8fdd2

Merge branch 'master' into resource-count-reporting

35e4b61

Envek marked this pull request as ready for review December 1, 2023 10:18

Envek requested review from espadolini and zmb3 December 1, 2023 10:18

github-actions Bot added the size/md label Dec 1, 2023

github-actions Bot requested a review from jimbishopp December 1, 2023 10:18

espadolini requested a review from bl-nero December 1, 2023 11:00

espadolini reviewed Dec 1, 2023

View reviewed changes

espadolini added the do-not-merge label Dec 1, 2023

Envek and others added 3 commits December 1, 2023 22:02

Apply suggestions from code review

ee4a82f

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@gmail.com>

Unexport internal function resourcePresenceReportKey

2a277c0

Simplify report split logic: split ids by 2 iteratively until fits

cc352c5

Envek requested a review from espadolini December 5, 2023 10:20

Envek added 2 commits December 7, 2023 00:28

Change variable name, add comments

8845f9e

Simplify generated reports to avoid excess kind reports fragmentation

fd1df31

Envek requested a review from espadolini December 6, 2023 15:35

espadolini approved these changes Dec 6, 2023

View reviewed changes

Comment thread lib/usagereporter/teleport/aggregating/service.go Outdated

Comment thread lib/usagereporter/teleport/aggregating/service.go Outdated

Apply suggestions from code review

13fb76a

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@gmail.com>

espadolini changed the title ~~Report resource usage counts by handling heartbeat events~~ Send aggregated resource presence usage data in Teleport Enterprise Dec 7, 2023

Envek added a commit to Envek/teleport that referenced this pull request Dec 8, 2023

Change implementation of user activity records split to match gravita…

255fc9c

…tional#34954

Envek mentioned this pull request Dec 8, 2023

Change implementation of user activity records split to match resource presence #35553

Closed

Envek and others added 3 commits December 20, 2023 09:58

Fix proto comment to match implementation

70f8f24

Fix heartbeat events getting dropped

8230b26

Fix panic when logging report times

c9be4b0

zmb3 approved these changes Dec 20, 2023

View reviewed changes

zmb3 mentioned this pull request Dec 21, 2023

Report resource usage counts by handling heartbeat events #35968

Merged

zmb3 closed this in #35968 Jan 3, 2024

zmb3 mentioned this pull request Jan 4, 2024

[v14] Report resource usage counts by handling heartbeat events #36256

Merged

Envek deleted the resource-count-reporting branch January 4, 2024 14:01

	// maxItemSize is the maximum size for a backend value; dynamodb has a limit of
	// 400KiB per item, we store values in base64, there's some framing, and a
	// healthy 50% margin gets us to about 128KiB.
	maxItemSize = 128 * 1024

	record.ResourceName = append(record.ResourceName, r.anonymizer.AnonymizeNonEmpty(name))
	record.ResourceName = append(record.ResourceName, binary.LittleEndian.Uint64(r.anonymizer.AnonymizeNonEmpty(name)))

		require.Greater(t, len(reports), len(resKindReports)) // some reports were split into two
		require.Less(t, len(reports[0].ResourceKindReports[0].ResourceIds), maxResourceIdsPerReport) // reports have less resource id than passed

Conversation

Envek commented Nov 24, 2023

Uh oh!

CLAassistant commented Nov 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zmb3 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Envek Nov 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

espadolini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Envek commented Nov 29, 2023

Uh oh!

espadolini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

espadolini commented Dec 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

espadolini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zmb3 commented Dec 20, 2023

CLAassistant commented Nov 24, 2023 •

edited

Loading

Envek Nov 30, 2023 •

edited

Loading

espadolini commented Dec 1, 2023 •

edited

Loading