Send aggregated resource presence usage data in Teleport Enterprise#34954
Send aggregated resource presence usage data in Teleport Enterprise#34954Envek wants to merge 20 commits intogravitational:masterfrom
Conversation
|
|
f1d69cc to
972d9ea
Compare
zmb3
left a comment
There was a problem hiding this comment.
Let's make sure @espadolini gives a review too.
| return nil, trace.LimitExceeded("failed to marshal resource counts report within size limit (this is a bug)") | ||
| } | ||
|
|
||
| report.Records = report.Records[:len(report.Records)/2] |
There was a problem hiding this comment.
Not sure I understand exactly what this code is doing. Can you clarify?
There was a problem hiding this comment.
It is copied from userActivities reporting intact where it throws some “excess” data until report will become small enough to be stored.
It doesn't have any meaning when storing resource counts per kind as report size is constant in this case.
But if we change it to storing anonymized resource names as @espadolini suggested, then this size restriction become valid again:
teleport/lib/usagereporter/teleport/aggregating/service.go
Lines 31 to 34 in 96ed71b
There was a problem hiding this comment.
Probably, for resource activity reports, we can split each report by resource kind, then one report for one resource kind can contain up to around 4k of anonymized resource names. Will it be enough?
There was a problem hiding this comment.
If it's not enough we're gonna have to split each report as well, unfortunately.
As far as the anonymized resource names go, we should also truncate them like we seem to have agreed upon in the yet-to-be-merged cloud rfd 88 https://github.com/gravitational/cloud/pull/6352/files#diff-6859f430e4c996afca151a4a59ffb7ad78613b952cc128a0bea1d4823fb3fa24R185-R222
There was a problem hiding this comment.
Ok, fixed64 is 8 bytes instead of 32 bytes of HMAC-SHA-256, so it is around 16k of resources per kind per report. Probably still need to split, given that RFD 88 suggests that 100k resources of all kinds should be expected…
There was a problem hiding this comment.
The question is how and when to re-assemble split reports? Is it possible at PostHog level? Should split reports share some identifier (report id and sequence number maybe?) Or maybe there is already some mechanism for doing that?
I can see that there is a note about merging in RFD 88:
Then a Clickhouse job will query for these, merge them and send appropriate statistics to Salesforce and Stripe.
But can't see the full picture yet 😵
There was a problem hiding this comment.
We have to handle multiple reports with potentially overlapping and potentially disjoint sets of resources for the same time period anyway, since different agents will heartbeat through different auth servers (hopefully without losing connectivity to the one auth they're connected to, which means that only that one auth server will see that the resource exists). Handling multiple disjoint reports from the same auth server is no different than that.
There was a problem hiding this comment.
Got it. Implemented naive report splitting. You can take a look at the last commit at the moment.
Need to figure out test that can check that it actually splits and doesn't lose any resource names along the way, will do it for tomorrow.
| } | ||
|
|
||
| // the kind of a "resource" (e.g. a node, a pod, a service, etc.) | ||
| enum ResourceKind { |
There was a problem hiding this comment.
I would add a comment here advising devs to keep this in sync with prehog/v1alpha/teleport.proto
There was a problem hiding this comment.
We should add a comment in v1alpha/teleport.proto as well, or we will totally end up desyncing them.
There was a problem hiding this comment.
Added (and also added to https://github.com/gravitational/cloud/pull/6823)
espadolini
left a comment
There was a problem hiding this comment.
Each auth server will get an independent, potentially overlapping set of resources, so just sending a count for each resource type won't really cut it, we need to send over identifiers for each resource so we can deduplicate them from all the reports. (cc @bl-nero)
I think we also want two different time granularities for user activity and resource presence, so we need to duplicate the time window handling too.
Changed to sending lists of anonymized resource names instead of counts
Done. |
espadolini
left a comment
There was a problem hiding this comment.
LGTM, needs the potential mid-report split in prepareResourceActivityReport (prepareResourcePresenceReport ideally).
@zmb3 how do we want to deal with non-heartbeating resources, like statically configured nodes and desktops?
| record.ResourceKind = kind | ||
| record.ResourceName = make([][]byte, 0, len(set)) | ||
| for name := range set { | ||
| record.ResourceName = append(record.ResourceName, r.anonymizer.AnonymizeNonEmpty(name)) |
There was a problem hiding this comment.
Check in with @bl-nero on what sort of truncation we want here. If we go for a repeated fixed64 (i.e. a []uint64 on the go side) this might be as simple as the following.
| record.ResourceName = append(record.ResourceName, r.anonymizer.AnonymizeNonEmpty(name)) | |
| record.ResourceName = append(record.ResourceName, binary.LittleEndian.Uint64(r.anonymizer.AnonymizeNonEmpty(name))) |
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@gmail.com>
| freeSize := maxItemSize - proto.Size(report) | ||
| elementsToFit := (freeSize - emptyKindReportSize) / singleItemSize | ||
| if elementsToFit <= 0 { | ||
| // This kind report is too big to fit even if we remove all elements, | ||
| break // try to insert it into next report | ||
| } |
There was a problem hiding this comment.
It kinda works here because a repeated fixed64 in protobuf wire format is just packed together, but honestly I'd make the logic simpler:
- put KindNodes last in the order, as a special case
- group per-kind reports together, either skipping ones that don't fit (if you go for this we don't really need to put nodes last) or stopping when the next one doesn't fit; at this point you should have some reports ready and some kinds that don't fit even if just by themselves in a report
- for each of those kinds (likely only nodes, but maybe out there there's some cluster with 8000 databases) repeatedly split the list of resources in half every time (rather than trying to get an exact calculation right) until the report fits the limit.
There was a problem hiding this comment.
Actually, couldn't we just split resource slices in fixed-size chunks of 5000 or so? 🤔
There was a problem hiding this comment.
Implemented your suggestions as follows:
- group per-kind reports together, skipping ones that don't fit
- for each of those kinds that don't fit repeatedly split the list of resources in half every time
Rewrote test to match expected data layout: first kind is 100k records, second kind 10k, third 1k, so on.
| require.Greater(t, len(reports), len(resKindReports)) // some reports were split into two | ||
| require.Less(t, len(reports[0].ResourceKindReports[0].ResourceIds), maxResourceIdsPerReport) // reports have less resource id than passed |
There was a problem hiding this comment.
nit
| require.Greater(t, len(reports), len(resKindReports)) // some reports were split into two | |
| require.Less(t, len(reports[0].ResourceKindReports[0].ResourceIds), maxResourceIdsPerReport) // reports have less resource id than passed | |
| require.GreaterOrEqual(t, len(reports), len(resKindReports)) // some reports were split into two | |
| require.LessOrEqual(t, len(reports[0].ResourceKindReports[0].ResourceIds), maxResourceIdsPerReport) // reports have less resource id than passed |
There was a problem hiding this comment.
Can't agree here: I'm specifically checking that reports has been split in parts, so no OrEqual here.
|
Labeling |
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@gmail.com>
espadolini
left a comment
There was a problem hiding this comment.
I like the approach with the head and tail when splitting, we should've done that for the user activity reports too.
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@gmail.com>
|
This LGTM now. I will open a buddy PR tomorrow. |
* Report resource usage counts by handling heartbeat events Buddy PR for #34954 Closes #34954 Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com> Signed-off-by: Zac Bergquist <zac.bergquist@goteleport.com> * Copy latest protos from cloud and regenerate --------- Signed-off-by: Zac Bergquist <zac.bergquist@goteleport.com> Co-authored-by: Andrey Novikov <envek@envek.name> Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
First draft of teleport-side collection of active Teleport Protected Resources for usage-based billing.
Approach can be totally wrong by itself, creating pull request to get some feedback as soon as possible.