Skip to content

Send aggregated resource presence usage data in Teleport Enterprise#34954

Closed
Envek wants to merge 20 commits intogravitational:masterfrom
Envek:resource-count-reporting
Closed

Send aggregated resource presence usage data in Teleport Enterprise#34954
Envek wants to merge 20 commits intogravitational:masterfrom
Envek:resource-count-reporting

Conversation

@Envek
Copy link
Copy Markdown
Contributor

@Envek Envek commented Nov 24, 2023

First draft of teleport-side collection of active Teleport Protected Resources for usage-based billing.

Approach can be totally wrong by itself, creating pull request to get some feedback as soon as possible.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Nov 24, 2023

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ Envek
❌ zmb3
You have signed the CLA already but the status is still pending? Let us recheck it.

@Envek Envek force-pushed the resource-count-reporting branch from f1d69cc to 972d9ea Compare November 28, 2023 14:47
@zmb3 zmb3 requested a review from espadolini November 28, 2023 15:47
Copy link
Copy Markdown
Collaborator

@zmb3 zmb3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure @espadolini gives a review too.

Comment thread lib/usagereporter/teleport/aggregating/reporter.go Outdated
return nil, trace.LimitExceeded("failed to marshal resource counts report within size limit (this is a bug)")
}

report.Records = report.Records[:len(report.Records)/2]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand exactly what this code is doing. Can you clarify?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is copied from userActivities reporting intact where it throws some “excess” data until report will become small enough to be stored.

It doesn't have any meaning when storing resource counts per kind as report size is constant in this case.

But if we change it to storing anonymized resource names as @espadolini suggested, then this size restriction become valid again:

// maxItemSize is the maximum size for a backend value; dynamodb has a limit of
// 400KiB per item, we store values in base64, there's some framing, and a
// healthy 50% margin gets us to about 128KiB.
maxItemSize = 128 * 1024

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, for resource activity reports, we can split each report by resource kind, then one report for one resource kind can contain up to around 4k of anonymized resource names. Will it be enough?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not enough we're gonna have to split each report as well, unfortunately.

As far as the anonymized resource names go, we should also truncate them like we seem to have agreed upon in the yet-to-be-merged cloud rfd 88 https://github.com/gravitational/cloud/pull/6352/files#diff-6859f430e4c996afca151a4a59ffb7ad78613b952cc128a0bea1d4823fb3fa24R185-R222

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fixed64 is 8 bytes instead of 32 bytes of HMAC-SHA-256, so it is around 16k of resources per kind per report. Probably still need to split, given that RFD 88 suggests that 100k resources of all kinds should be expected…

Copy link
Copy Markdown
Contributor Author

@Envek Envek Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is how and when to re-assemble split reports? Is it possible at PostHog level? Should split reports share some identifier (report id and sequence number maybe?) Or maybe there is already some mechanism for doing that?

I can see that there is a note about merging in RFD 88:

Then a Clickhouse job will query for these, merge them and send appropriate statistics to Salesforce and Stripe.

But can't see the full picture yet 😵

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to handle multiple reports with potentially overlapping and potentially disjoint sets of resources for the same time period anyway, since different agents will heartbeat through different auth servers (hopefully without losing connectivity to the one auth they're connected to, which means that only that one auth server will see that the resource exists). Handling multiple disjoint reports from the same auth server is no different than that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Implemented naive report splitting. You can take a look at the last commit at the moment.

Need to figure out test that can check that it actually splits and doesn't lose any resource names along the way, will do it for tomorrow.

Comment thread lib/usagereporter/teleport/aggregating/service.go
}

// the kind of a "resource" (e.g. a node, a pod, a service, etc.)
enum ResourceKind {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment here advising devs to keep this in sync with prehog/v1alpha/teleport.proto

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a comment in v1alpha/teleport.proto as well, or we will totally end up desyncing them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread proto/prehog/v1/teleport.proto
Copy link
Copy Markdown
Contributor

@espadolini espadolini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each auth server will get an independent, potentially overlapping set of resources, so just sending a count for each resource type won't really cut it, we need to send over identifiers for each resource so we can deduplicate them from all the reports. (cc @bl-nero)

I think we also want two different time granularities for user activity and resource presence, so we need to duplicate the time window handling too.

Comment thread lib/usagereporter/teleport/aggregating/reporter.go Outdated
Comment thread lib/usagereporter/teleport/aggregating/reporter.go Outdated
@Envek
Copy link
Copy Markdown
Contributor Author

Envek commented Nov 29, 2023

we need to send over identifiers for each resource so we can deduplicate them from all the reports

Changed to sending lists of anonymized resource names instead of counts

I think we also want two different time granularities for user activity and resource presence, so we need to duplicate the time window handling too.

Done.

@Envek Envek requested a review from espadolini November 29, 2023 15:02
Copy link
Copy Markdown
Contributor

@espadolini espadolini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, needs the potential mid-report split in prepareResourceActivityReport (prepareResourcePresenceReport ideally).

@zmb3 how do we want to deal with non-heartbeating resources, like statically configured nodes and desktops?

Comment thread lib/usagereporter/teleport/aggregating/reporter.go Outdated
record.ResourceKind = kind
record.ResourceName = make([][]byte, 0, len(set))
for name := range set {
record.ResourceName = append(record.ResourceName, r.anonymizer.AnonymizeNonEmpty(name))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check in with @bl-nero on what sort of truncation we want here. If we go for a repeated fixed64 (i.e. a []uint64 on the go side) this might be as simple as the following.

Suggested change
record.ResourceName = append(record.ResourceName, r.anonymizer.AnonymizeNonEmpty(name))
record.ResourceName = append(record.ResourceName, binary.LittleEndian.Uint64(r.anonymizer.AnonymizeNonEmpty(name)))

Comment thread lib/usagereporter/teleport/aggregating/reporter.go Outdated
Comment thread lib/usagereporter/teleport/aggregating/reporter.go Outdated
Comment thread lib/usagereporter/teleport/aggregating/service.go Outdated
Comment thread proto/prehog/v1/teleport.proto Outdated
Comment thread proto/prehog/v1/teleport.proto Outdated
@Envek Envek marked this pull request as ready for review December 1, 2023 10:18
@Envek Envek requested review from espadolini and zmb3 December 1, 2023 10:18
@github-actions github-actions Bot requested a review from jimbishopp December 1, 2023 10:18
@espadolini espadolini requested a review from bl-nero December 1, 2023 11:00
Comment thread lib/usagereporter/teleport/aggregating/reporter.go Outdated
Comment thread lib/usagereporter/teleport/aggregating/service.go Outdated
Comment thread lib/usagereporter/teleport/aggregating/service.go Outdated
Comment thread lib/usagereporter/teleport/aggregating/service.go Outdated
Comment thread lib/usagereporter/teleport/aggregating/service.go Outdated
Comment on lines +125 to +130
freeSize := maxItemSize - proto.Size(report)
elementsToFit := (freeSize - emptyKindReportSize) / singleItemSize
if elementsToFit <= 0 {
// This kind report is too big to fit even if we remove all elements,
break // try to insert it into next report
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It kinda works here because a repeated fixed64 in protobuf wire format is just packed together, but honestly I'd make the logic simpler:

  • put KindNodes last in the order, as a special case
  • group per-kind reports together, either skipping ones that don't fit (if you go for this we don't really need to put nodes last) or stopping when the next one doesn't fit; at this point you should have some reports ready and some kinds that don't fit even if just by themselves in a report
  • for each of those kinds (likely only nodes, but maybe out there there's some cluster with 8000 databases) repeatedly split the list of resources in half every time (rather than trying to get an exact calculation right) until the report fits the limit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, couldn't we just split resource slices in fixed-size chunks of 5000 or so? 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented your suggestions as follows:

  • group per-kind reports together, skipping ones that don't fit
  • for each of those kinds that don't fit repeatedly split the list of resources in half every time

Rewrote test to match expected data layout: first kind is 100k records, second kind 10k, third 1k, so on.

Comment on lines +167 to +168
require.Greater(t, len(reports), len(resKindReports)) // some reports were split into two
require.Less(t, len(reports[0].ResourceKindReports[0].ResourceIds), maxResourceIdsPerReport) // reports have less resource id than passed
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
require.Greater(t, len(reports), len(resKindReports)) // some reports were split into two
require.Less(t, len(reports[0].ResourceKindReports[0].ResourceIds), maxResourceIdsPerReport) // reports have less resource id than passed
require.GreaterOrEqual(t, len(reports), len(resKindReports)) // some reports were split into two
require.LessOrEqual(t, len(reports[0].ResourceKindReports[0].ResourceIds), maxResourceIdsPerReport) // reports have less resource id than passed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't agree here: I'm specifically checking that reports has been split in parts, so no OrEqual here.

@espadolini
Copy link
Copy Markdown
Contributor

espadolini commented Dec 1, 2023

Labeling do-not-merge because https://github.com/gravitational/cloud/pull/6823 should be merged first (to lock in the .proto changes), remove it once that's in.

@Envek Envek requested a review from espadolini December 5, 2023 10:20
@Envek Envek requested a review from espadolini December 6, 2023 15:35
Copy link
Copy Markdown
Contributor

@espadolini espadolini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach with the head and tail when splitting, we should've done that for the user activity reports too.

Comment thread lib/usagereporter/teleport/aggregating/service.go Outdated
Comment thread lib/usagereporter/teleport/aggregating/service.go Outdated
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@gmail.com>
@espadolini espadolini changed the title Report resource usage counts by handling heartbeat events Send aggregated resource presence usage data in Teleport Enterprise Dec 7, 2023
Envek added a commit to Envek/teleport that referenced this pull request Dec 8, 2023
@zmb3
Copy link
Copy Markdown
Collaborator

zmb3 commented Dec 20, 2023

This LGTM now. I will open a buddy PR tomorrow.

zmb3 pushed a commit that referenced this pull request Jan 2, 2024
Buddy PR for #34954
Closes #34954

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
Signed-off-by: Zac Bergquist <zac.bergquist@goteleport.com>
@zmb3 zmb3 closed this in #35968 Jan 3, 2024
github-merge-queue Bot pushed a commit that referenced this pull request Jan 3, 2024
* Report resource usage counts by handling heartbeat events

Buddy PR for #34954
Closes #34954

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
Signed-off-by: Zac Bergquist <zac.bergquist@goteleport.com>

* Copy latest protos from cloud and regenerate

---------

Signed-off-by: Zac Bergquist <zac.bergquist@goteleport.com>
Co-authored-by: Andrey Novikov <envek@envek.name>
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
zmb3 added a commit that referenced this pull request Jan 4, 2024
Buddy PR for #34954
Closes #34954

Signed-off-by: Zac Bergquist <zac.bergquist@goteleport.com>
Co-authored-by: Andrey Novikov <envek@envek.name>
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
@Envek Envek deleted the resource-count-reporting branch January 4, 2024 14:01
zmb3 added a commit that referenced this pull request Jan 4, 2024
Buddy PR for #34954
Closes #34954

Signed-off-by: Zac Bergquist <zac.bergquist@goteleport.com>
Co-authored-by: Andrey Novikov <envek@envek.name>
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
zmb3 added a commit that referenced this pull request Jan 4, 2024
Buddy PR for #34954
Closes #34954

Signed-off-by: Zac Bergquist <zac.bergquist@goteleport.com>
Co-authored-by: Andrey Novikov <envek@envek.name>
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
github-merge-queue Bot pushed a commit that referenced this pull request Jan 4, 2024
Buddy PR for #34954
Closes #34954

Signed-off-by: Zac Bergquist <zac.bergquist@goteleport.com>
Co-authored-by: Andrey Novikov <envek@envek.name>
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
github-merge-queue Bot pushed a commit that referenced this pull request Jan 5, 2024
Buddy PR for #34954
Closes #34954

Signed-off-by: Zac Bergquist <zac.bergquist@goteleport.com>
Co-authored-by: Andrey Novikov <envek@envek.name>
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
github-merge-queue Bot pushed a commit that referenced this pull request Jan 5, 2024
Buddy PR for #34954
Closes #34954

Signed-off-by: Zac Bergquist <zac.bergquist@goteleport.com>
Co-authored-by: Andrey Novikov <envek@envek.name>
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants