Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: sanitize structured metadata during ingestion in the distributor #15141

Merged
merged 7 commits into from
Nov 29, 2024

Conversation

cstyan
Copy link
Contributor

@cstyan cstyan commented Nov 27, 2024

Still working here, want to see if I can get the benchmark results to be any better.

The benchmark itself needs some love as well, if you use it with -count you can get a segmentation fault/nil pointer dereference panic in the mock ingester code.

Just the addition of the two checks slows down distributor.Push significantly, and also it looks to me like when the log line doesn't contain a log field for otlp somehow we might be adding the unknown value multiple times? [{detected_level unknown} {detected_level unknown} {detected_level unknown} {detected_level unknown}]

Very basic benchmark results so far:

  • main is off main f65ab130725dc25c9d546fa4d5fb1e4a6d26009e
  • main with sm is with the modification to makeWriteRequestWithLabels to add structured metadata
  • pr-only-value is with the addition of the check in distributor.Push to check and sanitize the label value
  • pr-both is with the addition of both the SM name and value check + sanitization

Note that this is worst case scenario, since every entry in the write request has structured metadata that needs to be checked

goos: linux
goarch: amd64
pkg: github.com/grafana/loki/v3/pkg/distributor
cpu: AMD Ryzen 9 5950X 16-Core Processor
                           │    main     │             main-with-sm             │             pr-only-value             │                 pr-both                 │
                           │   sec/op    │    sec/op     vs base                │    sec/op     vs base                 │    sec/op      vs base                  │
_Push-32                     53.22m ± 1%   53.74m ± ∞ ¹       ~ (p=0.250 n=7+1)   62.81m ± ∞ ¹        ~ (p=0.250 n=7+1)   107.62m ± ∞ ¹         ~ (p=0.250 n=7+1)
_PushWithLineTruncation-32                 57.32m ± ∞ ¹                           64.91m ± ∞ ¹                            110.90m ± ∞ ¹
geomean                      53.22m        55.50m        +0.98%                   63.85m        +18.03%                    109.2m        +102.23%
¹ need >= 6 samples for confidence interval at level 0.95

                           │     main     │             main-with-sm              │             pr-only-value             │                pr-both                 │
                           │     B/op     │     B/op       vs base                │     B/op       vs base                │     B/op       vs base                 │
_Push-32                     6.653Mi ± 3%   7.217Mi ± ∞ ¹       ~ (p=0.250 n=7+1)   7.219Mi ± ∞ ¹       ~ (p=0.250 n=7+1)   8.096Mi ± ∞ ¹        ~ (p=0.250 n=7+1)
_PushWithLineTruncation-32                  7.232Mi ± ∞ ¹                           7.501Mi ± ∞ ¹                           8.116Mi ± ∞ ¹
geomean                      6.653Mi        7.225Mi        +8.47%                   7.359Mi        +8.51%                   8.106Mi        +21.69%
¹ need >= 6 samples for confidence interval at level 0.95

                           │    main     │             main-with-sm             │            pr-only-value             │               pr-both                │
                           │  allocs/op  │  allocs/op    vs base                │  allocs/op    vs base                │  allocs/op    vs base                │
_Push-32                     20.09k ± 5%   18.09k ± ∞ ¹       ~ (p=0.250 n=7+1)   18.13k ± ∞ ¹       ~ (p=0.250 n=7+1)   20.44k ± ∞ ¹       ~ (p=0.500 n=7+1)
_PushWithLineTruncation-32                 18.31k ± ∞ ¹                           19.00k ± ∞ ¹                           20.74k ± ∞ ¹
geomean                      20.09k        18.20k        -9.94%                   18.56k        -9.75%                   20.59k        +1.74%
¹ need >= 6 samples for confidence interval at level 0.95

@cstyan
Copy link
Contributor Author

cstyan commented Nov 28, 2024

Some better benchmark results:

  • no-checks-no-sm is just this PR with the SM checks commented out, and not adding SM to the entries
  • no-checks-no-sm is just this PR with the SM checks commented out, and but with SM in the entries
  • pr-no-sm has no SM in the entries and does the checks for invalid SM name/value
  • pr-with-sm-no-invalid has SM in all the entries and does the checks for invalid SM name/value, but all the SM has valid data
  • pr-invalid-name has SM in all the entries and does the checks for invalid SM name/value, all the SM has valid values but invalid names
  • pr-invalid-value has SM in all the entries and does the checks for invalid SM name/value, all the SM has valid names but invalid values
  • pr-invalid-value has SM in all the entries and does the checks for invalid SM name/value, all the SM has valid names but invalid values
goos: linux
goarch: amd64
pkg: github.com/grafana/loki/v3/pkg/distributor
cpu: AMD Ryzen 9 5950X 16-Core Processor
         │ no-checks-no-sm │           no-checks-with-sm           │            pr-no-sm            │       pr-with-sm-no-invalid        │           pr-invalid-name            │           pr-invalid-value            │            pr-invalid-both            │
         │     sec/op      │    sec/op     vs base                 │   sec/op     vs base           │   sec/op     vs base               │   sec/op     vs base                 │    sec/op     vs base                 │    sec/op     vs base                 │
_Push-32      67.68m ± 28%   77.94m ± 31%  +15.16% (p=0.022 n=7+6)   66.71m ± 7%  ~ (p=0.181 n=7+6)   83.64m ± 7%  +23.59% (p=0.017 n=7)   88.67m ± 3%  +31.02% (p=0.001 n=7+8)   93.22m ± ∞ ¹  +37.75% (p=0.003 n=7+5)   94.27m ± ∞ ¹  +39.29% (p=0.003 n=7+5)
¹ need >= 6 samples for confidence interval at level 0.95

         │ no-checks-no-sm │          no-checks-with-sm           │            pr-no-sm             │       pr-with-sm-no-invalid        │            pr-invalid-name            │            pr-invalid-value            │            pr-invalid-both             │
         │      B/op       │     B/op      vs base                │     B/op      vs base           │     B/op      vs base              │     B/op      vs base                 │     B/op       vs base                 │     B/op       vs base                 │
_Push-32      66.36Mi ± 0%   72.47Mi ± 0%  +9.20% (p=0.001 n=7+6)   66.35Mi ± 0%  ~ (p=0.731 n=7+6)   72.47Mi ± 0%  +9.21% (p=0.001 n=7)   74.02Mi ± 0%  +11.54% (p=0.000 n=7+8)   74.78Mi ± ∞ ¹  +12.68% (p=0.003 n=7+5)   76.30Mi ± ∞ ¹  +14.98% (p=0.003 n=7+5)
¹ need >= 6 samples for confidence interval at level 0.95

         │ no-checks-no-sm │          no-checks-with-sm           │            pr-no-sm            │       pr-with-sm-no-invalid        │           pr-invalid-name            │           pr-invalid-value            │            pr-invalid-both            │
         │    allocs/op    │  allocs/op   vs base                 │  allocs/op   vs base           │  allocs/op   vs base               │  allocs/op   vs base                 │  allocs/op    vs base                 │  allocs/op    vs base                 │
_Push-32       401.0k ± 0%   501.0k ± 0%  +24.94% (p=0.001 n=7+6)   400.8k ± 0%  ~ (p=0.731 n=7+6)   501.1k ± 0%  +24.98% (p=0.001 n=7)   601.4k ± 0%  +49.98% (p=0.000 n=7+8)   601.3k ± ∞ ¹  +49.95% (p=0.003 n=7+5)   701.4k ± ∞ ¹  +74.92% (p=0.003 n=7+5)
¹ need >= 6 samples for confidence interval at level 0.95

To break things down a bit more; if we compare the current code if the request had SM vs now with the checks but the request has SM but nothing is invalid.

goos: linux
goarch: amd64
pkg: github.com/grafana/loki/v3/pkg/distributor
cpu: AMD Ryzen 9 5950X 16-Core Processor
         │ no-checks-with-sm │         pr-with-sm-no-invalid          │
         │      sec/op       │   sec/op     vs base                   │
_Push-32         77.94m ± 9%   85.90m ± 2%  +10.21% (p=0.015 n=12+13)

         │ no-checks-with-sm │       pr-with-sm-no-invalid       │
         │       B/op        │     B/op      vs base             │
_Push-32        72.47Mi ± 0%   72.49Mi ± 0%  ~ (p=0.840 n=12+13)

         │ no-checks-with-sm │      pr-with-sm-no-invalid       │
         │     allocs/op     │  allocs/op   vs base             │
_Push-32         501.0k ± 0%   501.2k ± 0%  ~ (p=0.677 n=12+13)

Next, no checks with SM vs the various cases of checks + invalid SM

goos: linux
goarch: amd64
pkg: github.com/grafana/loki/v3/pkg/distributor
cpu: AMD Ryzen 9 5950X 16-Core Processor
         │ no-checks-with-sm │            pr-invalid-name            │            pr-invalid-value            │            pr-invalid-both             │
         │      sec/op       │   sec/op     vs base                  │    sec/op     vs base                  │    sec/op     vs base                  │
_Push-32         77.94m ± 9%   88.67m ± 3%  +13.77% (p=0.007 n=12+8)   93.22m ± ∞ ¹  +19.61% (p=0.034 n=12+5)   94.27m ± ∞ ¹  +20.95% (p=0.019 n=12+5)

         │ no-checks-with-sm │            pr-invalid-name            │            pr-invalid-value            │            pr-invalid-both             │
         │       B/op        │     B/op      vs base                 │     B/op       vs base                 │     B/op       vs base                 │
_Push-32        72.47Mi ± 0%   74.02Mi ± 0%  +2.15% (p=0.000 n=12+8)   74.78Mi ± ∞ ¹  +3.19% (p=0.000 n=12+5)   76.30Mi ± ∞ ¹  +5.29% (p=0.000 n=12+5)

         │ no-checks-with-sm │            pr-invalid-name            │            pr-invalid-value            │            pr-invalid-both             │
         │     allocs/op     │  allocs/op   vs base                  │  allocs/op    vs base                  │  allocs/op    vs base                  │
_Push-32         501.0k ± 0%   601.4k ± 0%  +20.04% (p=0.000 n=12+8)   601.3k ± ∞ ¹  +20.02% (p=0.000 n=12+5)   701.4k ± ∞ ¹  +40.00% (p=0.000 n=12+5)

@cstyan cstyan closed this Nov 28, 2024
@cstyan cstyan reopened this Nov 28, 2024
we end up adding the detected log level on every iteration to the one
original write request and end up with N-1 detected log level labels in
the SM

Signed-off-by: Callum Styan <[email protected]>
@cstyan cstyan changed the title WIP sanitize structured metadata during ingestion in the distributor feat: sanitize structured metadata during ingestion in the distributor Nov 28, 2024
@cstyan cstyan marked this pull request as ready for review November 28, 2024 22:17
@cstyan cstyan requested a review from a team as a code owner November 28, 2024 22:17
Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Did you find why we have multiple detected_level ?

Copy link
Contributor

@ashwanthgoli ashwanthgoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cstyan
Copy link
Contributor Author

cstyan commented Nov 29, 2024

LGTM

Did you find why we have multiple detected_level ?

I did, it was just related to reuse of the same logproto.PushRequest for each iteration of the benchmark in b.N, since the distributor Push function actually modifies the structured metadata in the struct to add the detected_level.

I modified the benchmark to create a new write request for each iteration of b.N. Unfortunately it seems like the b.ResetTimer or start/stop timer calls break the benchmark in some way that I wasn't able to track down, related to the distributor/mock ingester health check in some way since that's what the panic stack trace shows.

@cstyan cstyan merged commit be4f17e into main Nov 29, 2024
58 checks passed
@cstyan cstyan deleted the callum-distributor-sm-sanitize branch November 29, 2024 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants