feat: sanitize structured metadata during ingestion in the distributor #15141

cstyan · 2024-11-27T02:13:59Z

Still working here, want to see if I can get the benchmark results to be any better.

The benchmark itself needs some love as well, if you use it with -count you can get a segmentation fault/nil pointer dereference panic in the mock ingester code.

Just the addition of the two checks slows down distributor.Push significantly, and also it looks to me like when the log line doesn't contain a log field for otlp somehow we might be adding the unknown value multiple times? [{detected_level unknown} {detected_level unknown} {detected_level unknown} {detected_level unknown}]

Very basic benchmark results so far:

main is off main f65ab130725dc25c9d546fa4d5fb1e4a6d26009e
main with sm is with the modification to makeWriteRequestWithLabels to add structured metadata
pr-only-value is with the addition of the check in distributor.Push to check and sanitize the label value
pr-both is with the addition of both the SM name and value check + sanitization

Note that this is worst case scenario, since every entry in the write request has structured metadata that needs to be checked

goos: linux
goarch: amd64
pkg: github.com/grafana/loki/v3/pkg/distributor
cpu: AMD Ryzen 9 5950X 16-Core Processor
                           │    main     │             main-with-sm             │             pr-only-value             │                 pr-both                 │
                           │   sec/op    │    sec/op     vs base                │    sec/op     vs base                 │    sec/op      vs base                  │
_Push-32                     53.22m ± 1%   53.74m ± ∞ ¹       ~ (p=0.250 n=7+1)   62.81m ± ∞ ¹        ~ (p=0.250 n=7+1)   107.62m ± ∞ ¹         ~ (p=0.250 n=7+1)
_PushWithLineTruncation-32                 57.32m ± ∞ ¹                           64.91m ± ∞ ¹                            110.90m ± ∞ ¹
geomean                      53.22m        55.50m        +0.98%                   63.85m        +18.03%                    109.2m        +102.23%
¹ need >= 6 samples for confidence interval at level 0.95

                           │     main     │             main-with-sm              │             pr-only-value             │                pr-both                 │
                           │     B/op     │     B/op       vs base                │     B/op       vs base                │     B/op       vs base                 │
_Push-32                     6.653Mi ± 3%   7.217Mi ± ∞ ¹       ~ (p=0.250 n=7+1)   7.219Mi ± ∞ ¹       ~ (p=0.250 n=7+1)   8.096Mi ± ∞ ¹        ~ (p=0.250 n=7+1)
_PushWithLineTruncation-32                  7.232Mi ± ∞ ¹                           7.501Mi ± ∞ ¹                           8.116Mi ± ∞ ¹
geomean                      6.653Mi        7.225Mi        +8.47%                   7.359Mi        +8.51%                   8.106Mi        +21.69%
¹ need >= 6 samples for confidence interval at level 0.95

                           │    main     │             main-with-sm             │            pr-only-value             │               pr-both                │
                           │  allocs/op  │  allocs/op    vs base                │  allocs/op    vs base                │  allocs/op    vs base                │
_Push-32                     20.09k ± 5%   18.09k ± ∞ ¹       ~ (p=0.250 n=7+1)   18.13k ± ∞ ¹       ~ (p=0.250 n=7+1)   20.44k ± ∞ ¹       ~ (p=0.500 n=7+1)
_PushWithLineTruncation-32                 18.31k ± ∞ ¹                           19.00k ± ∞ ¹                           20.74k ± ∞ ¹
geomean                      20.09k        18.20k        -9.94%                   18.56k        -9.75%                   20.59k        +1.74%
¹ need >= 6 samples for confidence interval at level 0.95

Signed-off-by: Callum Styan <[email protected]>

pkg/distributor/distributor.go

cstyan · 2024-11-28T20:08:21Z

Some better benchmark results:

no-checks-no-sm is just this PR with the SM checks commented out, and not adding SM to the entries
no-checks-no-sm is just this PR with the SM checks commented out, and but with SM in the entries
pr-no-sm has no SM in the entries and does the checks for invalid SM name/value
pr-with-sm-no-invalid has SM in all the entries and does the checks for invalid SM name/value, but all the SM has valid data
pr-invalid-name has SM in all the entries and does the checks for invalid SM name/value, all the SM has valid values but invalid names
pr-invalid-value has SM in all the entries and does the checks for invalid SM name/value, all the SM has valid names but invalid values
pr-invalid-value has SM in all the entries and does the checks for invalid SM name/value, all the SM has valid names but invalid values

goos: linux
goarch: amd64
pkg: github.com/grafana/loki/v3/pkg/distributor
cpu: AMD Ryzen 9 5950X 16-Core Processor
         │ no-checks-no-sm │           no-checks-with-sm           │            pr-no-sm            │       pr-with-sm-no-invalid        │           pr-invalid-name            │           pr-invalid-value            │            pr-invalid-both            │
         │     sec/op      │    sec/op     vs base                 │   sec/op     vs base           │   sec/op     vs base               │   sec/op     vs base                 │    sec/op     vs base                 │    sec/op     vs base                 │
_Push-32      67.68m ± 28%   77.94m ± 31%  +15.16% (p=0.022 n=7+6)   66.71m ± 7%  ~ (p=0.181 n=7+6)   83.64m ± 7%  +23.59% (p=0.017 n=7)   88.67m ± 3%  +31.02% (p=0.001 n=7+8)   93.22m ± ∞ ¹  +37.75% (p=0.003 n=7+5)   94.27m ± ∞ ¹  +39.29% (p=0.003 n=7+5)
¹ need >= 6 samples for confidence interval at level 0.95

         │ no-checks-no-sm │          no-checks-with-sm           │            pr-no-sm             │       pr-with-sm-no-invalid        │            pr-invalid-name            │            pr-invalid-value            │            pr-invalid-both             │
         │      B/op       │     B/op      vs base                │     B/op      vs base           │     B/op      vs base              │     B/op      vs base                 │     B/op       vs base                 │     B/op       vs base                 │
_Push-32      66.36Mi ± 0%   72.47Mi ± 0%  +9.20% (p=0.001 n=7+6)   66.35Mi ± 0%  ~ (p=0.731 n=7+6)   72.47Mi ± 0%  +9.21% (p=0.001 n=7)   74.02Mi ± 0%  +11.54% (p=0.000 n=7+8)   74.78Mi ± ∞ ¹  +12.68% (p=0.003 n=7+5)   76.30Mi ± ∞ ¹  +14.98% (p=0.003 n=7+5)
¹ need >= 6 samples for confidence interval at level 0.95

         │ no-checks-no-sm │          no-checks-with-sm           │            pr-no-sm            │       pr-with-sm-no-invalid        │           pr-invalid-name            │           pr-invalid-value            │            pr-invalid-both            │
         │    allocs/op    │  allocs/op   vs base                 │  allocs/op   vs base           │  allocs/op   vs base               │  allocs/op   vs base                 │  allocs/op    vs base                 │  allocs/op    vs base                 │
_Push-32       401.0k ± 0%   501.0k ± 0%  +24.94% (p=0.001 n=7+6)   400.8k ± 0%  ~ (p=0.731 n=7+6)   501.1k ± 0%  +24.98% (p=0.001 n=7)   601.4k ± 0%  +49.98% (p=0.000 n=7+8)   601.3k ± ∞ ¹  +49.95% (p=0.003 n=7+5)   701.4k ± ∞ ¹  +74.92% (p=0.003 n=7+5)
¹ need >= 6 samples for confidence interval at level 0.95

To break things down a bit more; if we compare the current code if the request had SM vs now with the checks but the request has SM but nothing is invalid.

goos: linux
goarch: amd64
pkg: github.com/grafana/loki/v3/pkg/distributor
cpu: AMD Ryzen 9 5950X 16-Core Processor
         │ no-checks-with-sm │         pr-with-sm-no-invalid          │
         │      sec/op       │   sec/op     vs base                   │
_Push-32         77.94m ± 9%   85.90m ± 2%  +10.21% (p=0.015 n=12+13)

         │ no-checks-with-sm │       pr-with-sm-no-invalid       │
         │       B/op        │     B/op      vs base             │
_Push-32        72.47Mi ± 0%   72.49Mi ± 0%  ~ (p=0.840 n=12+13)

         │ no-checks-with-sm │      pr-with-sm-no-invalid       │
         │     allocs/op     │  allocs/op   vs base             │
_Push-32         501.0k ± 0%   501.2k ± 0%  ~ (p=0.677 n=12+13)

Next, no checks with SM vs the various cases of checks + invalid SM

goos: linux
goarch: amd64
pkg: github.com/grafana/loki/v3/pkg/distributor
cpu: AMD Ryzen 9 5950X 16-Core Processor
         │ no-checks-with-sm │            pr-invalid-name            │            pr-invalid-value            │            pr-invalid-both             │
         │      sec/op       │   sec/op     vs base                  │    sec/op     vs base                  │    sec/op     vs base                  │
_Push-32         77.94m ± 9%   88.67m ± 3%  +13.77% (p=0.007 n=12+8)   93.22m ± ∞ ¹  +19.61% (p=0.034 n=12+5)   94.27m ± ∞ ¹  +20.95% (p=0.019 n=12+5)

         │ no-checks-with-sm │            pr-invalid-name            │            pr-invalid-value            │            pr-invalid-both             │
         │       B/op        │     B/op      vs base                 │     B/op       vs base                 │     B/op       vs base                 │
_Push-32        72.47Mi ± 0%   74.02Mi ± 0%  +2.15% (p=0.000 n=12+8)   74.78Mi ± ∞ ¹  +3.19% (p=0.000 n=12+5)   76.30Mi ± ∞ ¹  +5.29% (p=0.000 n=12+5)

         │ no-checks-with-sm │            pr-invalid-name            │            pr-invalid-value            │            pr-invalid-both             │
         │     allocs/op     │  allocs/op   vs base                  │  allocs/op    vs base                  │  allocs/op    vs base                  │
_Push-32         501.0k ± 0%   601.4k ± 0%  +20.04% (p=0.000 n=12+8)   601.3k ± ∞ ¹  +20.02% (p=0.000 n=12+5)   701.4k ± ∞ ¹  +40.00% (p=0.000 n=12+5)

Signed-off-by: Callum Styan <[email protected]>

we end up adding the detected log level on every iteration to the one original write request and end up with N-1 detected log level labels in the SM Signed-off-by: Callum Styan <[email protected]>

Signed-off-by: Callum Styan <[email protected]>

cyriltovena

LGTM

Did you find why we have multiple detected_level ?

ashwanthgoli

LGTM

cstyan · 2024-11-29T19:29:29Z

LGTM

Did you find why we have multiple detected_level ?

I did, it was just related to reuse of the same logproto.PushRequest for each iteration of the benchmark in b.N, since the distributor Push function actually modifies the structured metadata in the struct to add the detected_level.

I modified the benchmark to create a new write request for each iteration of b.N. Unfortunately it seems like the b.ResetTimer or start/stop timer calls break the benchmark in some way that I wasn't able to track down, related to the distributor/mock ingester health check in some way since that's what the panic stack trace shows.

cstyan added 2 commits November 26, 2024 17:32

sanitize structured metadata during ingestion in the distributor

28e31e9

Signed-off-by: Callum Styan <[email protected]>

add a test for ensuring the labels have been sanitized

2bc97f0

Signed-off-by: Callum Styan <[email protected]>

pull-request-size bot added the size/L label Nov 27, 2024

cyriltovena reviewed Nov 27, 2024

View reviewed changes

pkg/distributor/distributor.go Show resolved Hide resolved

cstyan closed this Nov 28, 2024

cstyan reopened this Nov 28, 2024

cstyan added 3 commits November 28, 2024 12:43

check if SM is valid in the correct place

20fccbd

Signed-off-by: Callum Styan <[email protected]>

create write request in each iteration of the benchmark test, otherwise

ea913f4

we end up adding the detected log level on every iteration to the one original write request and end up with N-1 detected log level labels in the SM Signed-off-by: Callum Styan <[email protected]>

remove the if check, doesn't impact performance at all

c27b8e3

Signed-off-by: Callum Styan <[email protected]>

cstyan changed the title ~~WIP sanitize structured metadata during ingestion in the distributor~~ feat: sanitize structured metadata during ingestion in the distributor Nov 28, 2024

cstyan marked this pull request as ready for review November 28, 2024 22:17

cstyan requested a review from a team as a code owner November 28, 2024 22:17

cstyan added 2 commits November 28, 2024 14:23

Merge branch 'main' into callum-distributor-sm-sanitize

d002e93

update the push benchmark to test structured metadata cases

ba1a96c

Signed-off-by: Callum Styan <[email protected]>

cyriltovena approved these changes Nov 29, 2024

View reviewed changes

ashwanthgoli approved these changes Nov 29, 2024

View reviewed changes

cstyan merged commit be4f17e into main Nov 29, 2024
58 checks passed

cstyan deleted the callum-distributor-sm-sanitize branch November 29, 2024 22:27

This was referenced Dec 23, 2024

chore(k234): release 3.4.0 #15536

Open

chore(k235): release 3.4.0 #15555

Open

loki-gh-app bot mentioned this pull request Jan 6, 2025

chore(k236): release 3.4.0 #15595

Open

loki-gh-app bot mentioned this pull request Jan 13, 2025

chore(k237): release 3.4.0 #15705

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: sanitize structured metadata during ingestion in the distributor #15141

feat: sanitize structured metadata during ingestion in the distributor #15141

cstyan commented Nov 27, 2024

cstyan commented Nov 28, 2024 •

edited

Loading

cyriltovena left a comment

ashwanthgoli left a comment

cstyan commented Nov 29, 2024

feat: sanitize structured metadata during ingestion in the distributor #15141

feat: sanitize structured metadata during ingestion in the distributor #15141

Conversation

cstyan commented Nov 27, 2024

cstyan commented Nov 28, 2024 • edited Loading

cyriltovena left a comment

Choose a reason for hiding this comment

ashwanthgoli left a comment

Choose a reason for hiding this comment

cstyan commented Nov 29, 2024

cstyan commented Nov 28, 2024 •

edited

Loading