NETOBSERV-497 allow direct metrics pipeline #266

jotak · 2022-07-26T12:03:33Z

This PoC creates a new "SimpleProm" Encode stage that directly extracts
metrics from flows without an intermediate Aggregate stage

The rationale is that Prometheus already manages labels aggregations.
It leaves more responsibilities (and more power) to the query side.
For instance, aggregations such as sum/avg etc. would be performed in
PromQL at query time rather than upstream.

Pipelines to generate are simpler as they don't need an "Aggregate" stage, and the prom-encode stage itself has less parameters, e.g.:

	enrichedStage.SimpleEncodePrometheus("prometheus", api.SimplePromEncode{
		Port:   int(b.desired.PrometheusPort),
		Prefix: "netobserv_",
		Metrics: []api.SimplePromMetricsItem{{
			Name:      "bytes_total",
			RecordKey: "Bytes",
			Type:      "counter",
			Labels:    []string{"SrcK8S_Namespace", "DstK8S_Namespace"},
		}, {
			Name:      "packets_total",
			RecordKey: "Packets",
			Type:      "counter",
			Labels:    []string{"SrcK8S_Namespace", "DstK8S_Namespace"},
		}},
	})

jotak · 2022-07-26T12:10:09Z

This pipeline (cf go code above) results in creating this metrics, with both src/dest namespaces as labels :

codecov-commenter · 2022-07-26T12:13:09Z

Codecov Report

Merging #266 (d83b532) into main (aa342a2) will increase coverage by 0.36%.
The diff coverage is 84.07%.

@@            Coverage Diff             @@
##             main     #266      +/-   ##
==========================================
+ Coverage   67.34%   67.70%   +0.36%     
==========================================
  Files          73       74       +1     
  Lines        4281     4357      +76     
==========================================
+ Hits         2883     2950      +67     
- Misses       1214     1219       +5     
- Partials      184      188       +4

Flag	Coverage Δ
unittests	`67.70% <84.07%> (+0.36%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/api/encode_prom.go	`100.00% <ø> (ø)`
pkg/pipeline/encode/encode_prom.go	`76.13% <81.97%> (+0.99%)`	⬆️
pkg/confgen/flowlogs2metrics_config.go	`75.00% <90.00%> (+1.15%)`	⬆️
pkg/confgen/confgen.go	`49.20% <100.00%> (ø)`
pkg/confgen/encode.go	`61.90% <100.00%> (+11.90%)`	⬆️
pkg/operational/metrics/metrics.go	`52.50% <100.00%> (ø)`
pkg/pipeline/utils/timed_cache.go	`100.00% <100.00%> (ø)`
pkg/test/prom.go	`100.00% <100.00%> (ø)`
pkg/confgen/grafana_jsonnet.go	`47.47% <0.00%> (+3.03%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

eranra · 2022-07-26T13:51:47Z

@jotak how many metrics will be sent to Prometheus? don't you think that this will exhaust Prometheus with too many metrics? the idea of aggregate was to pre-process metrics so that the only thing Prometheus will need to handle is the aggregates. For example in the case of the above Prometheus will get namespace metrics without the need to worry about the underly metrics. Maybe I am missing something here ... we can find time to talk tomorrow maybe?

jotak · 2022-07-26T13:54:40Z

@eranra I already proposed a call on thursday (check your mail) :)

eranra · 2022-07-26T13:58:31Z

@jotak I see that now ... can we move that to an earlier time ... morning EU time --- this overlaps multiple meetings for me

jotak · 2022-07-27T08:51:05Z

btw we would also need to check how to combine that with confgen ; maybe adding a config flag in metrics definitions to tell if we want to use the "direct prom" approach or the "aggregate + prom" approach.

eranra · 2022-07-27T13:16:46Z

@jotak another thing that I want to think about is if there is a way to combine the code so we will not have two options that are doing almost the same thing based on two code bases,. Maybe there is a way to split the code / reuse the code so that we do not duplicate the code but split the functions so that they are used in both the direct version and the split version

jotak · 2022-07-27T16:26:08Z

Also, for the record: we need to check if there is an internal cache in prom client to clean from time to time (similar to the expiry mechanism implemented in FLP caches)

KalmanMeth · 2022-07-28T06:30:59Z

In encode_prom, we implemented a cache to clean up items that are inactive, and we encapsulated it in utils.timed_cache.
The Cleanup callback function in encode_prom.go does the cleanup in the prometheus client.

KalmanMeth · 2022-07-28T14:48:27Z

The encode_prom stage does not depend on the extract_aggregate stage. The confgenerator builds a config that uses extract_aggregate that feeds encode_prom , but encode_prom can be used by itself without aggregation. Once you add in the cache to SimpleEncodeProm it becomes essentially the existing encode_prom.

jotak · 2022-07-28T15:00:41Z

@KalmanMeth yes, I see that now. Maybe with the exception of the histogram values, right?

jotak · 2022-07-28T16:38:47Z

@KalmanMeth I think the existing PromEncode is almost working well as a independent stage, like what I was trying to do, I see however two issues:

The PromMetricsFilter in PromEncode api prevents from exposing any metrics we'd like
Histogram are not working

I believe these are changes added later, without having in mind that the prom-encode stage could be used independently?

In my last commit I've added twice a test of exposed metrics, see eb99bc0 ; one time for the existing PromEncode, one time for the new PoC prom encode.
I would expect the test should work on both, but it only works on the SimpleProm one because of the two issues mentioned.

jotak · 2022-07-28T16:43:30Z

So, rather than having a new Encode implementation, I will try to fix the existing one.

jotak · 2022-07-29T09:44:14Z

last commit: now focusing on fixing the existing PromEncode to work fine without Aggregation. Also spending some time on optimization (e.g. decreasing number of allocs)

jotak · 2022-07-29T16:31:28Z

I see another issue with the current Aggregation stage: we cannot mix "direct" PromEncode stages and "non-direct" (ie. following Aggregation) in confgen: we would need to fork before Aggregation, something like :

transform -> aggregate
aggregate -> prom1
transform -> prom2

But confgen doesn't support that. Also, having 2 prom stages means we need to deal with port conflicts.
So at the moment, it's something like all or nothing : either we have only "aggregate+prom" in the pipeline, or we have only "direct-prom", but not a mix of them.

jotak · 2022-08-02T09:43:46Z

FYI, jira created : https://issues.redhat.com/browse/NETOBSERV-497

ronensc

This looks promising. It feels like the code is cleaner.
@jotak I'm amazed by how fast you've done this work 🤩

ronensc · 2022-08-02T08:30:33Z

pkg/pipeline/aggregate_prom_test.go

+			exposed := test.ReadExposedMetrics(t)
+
+			for _, expected := range tt.expectedEncode {
+				require.Contains(t, exposed, expected)
+			}


👍
I like the idea of checking the exposed metrics rather than PrevRecords

ronensc · 2022-08-02T08:49:29Z

pkg/pipeline/utils/timed_cache.go

+	tc.mu.RLock()
+	defer tc.mu.RUnlock()


pkg/pipeline/encode/encode_prom.go

ronensc · 2022-08-02T10:27:11Z

pkg/pipeline/encode/encode_prom.go

+var errorsCounter = operationalMetrics.NewCounterVec(prometheus.CounterOpts{
+	Name: "encode_prom_errors",
+	Help: "Total errors during metrics generation",
+}, []string{"error", "metric", "key"})
+


ronensc · 2022-08-02T10:43:58Z

pkg/pipeline/encode/encode_prom_test.go

+	reg := prometheus.NewRegistry()
+	prometheus.DefaultRegisterer = reg
+	prometheus.DefaultGatherer = reg
+	http.DefaultServeMux = http.NewServeMux()


What's the purpose of setting http.DefaultServeMux = http.NewServeMux()?

this is because in encode_prom, we run http.Handle("/metrics", promhttp.Handler())
which creates a handler on the default, global mux/router. When this is called several times, an error would be fired (something like "cannot register /metrics, route already exists").

I don't remember exactly in which case that happened, maybe it was just in the benchmark I created below.

BTW it also shows that the prom_encode stage will need to be refactored if I some point we want to be able to define more than one prom-encodes in a pipeline. I mentioned that in this jira: https://issues.redhat.com/browse/NETOBSERV-498

@jotak thanks. I tried commenting out that line and see what happens. It failed on the second unit test of encode_prom_test.go because of the multiple registration that you have described. So resetting the DefaultServeMux to a new instance on each unit test solves this problem.

KalmanMeth · 2022-08-03T08:19:52Z

pkg/pipeline/encode/encode_prom_test.go

+	require.Contains(t, exposed, `test_packets_total{dstIP="10.0.0.1",srcIP="20.0.0.2"} 2`)
+	require.Contains(t, exposed, `test_packets_total{dstIP="30.0.0.3",srcIP="10.0.0.1"} 2`)
+	require.Contains(t, exposed, `test_latency_seconds_bucket{dstIP="10.0.0.1",srcIP="20.0.0.2",le="0.025"} 0`)
+	require.Contains(t, exposed, `test_latency_seconds_bucket{dstIP="10.0.0.1",srcIP="20.0.0.2",le="0.05"} 1`)


Where are these buckets defined? Are they a default?

yes, prom client defines default: DefBuckets = []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}

This hasn't changed from the previous implementation

KalmanMeth · 2022-08-03T08:23:15Z

pkg/pipeline/encode/encode_prom_test.go

+
+	for i := 0; i < b.N; i++ {
+		prom.Encode(hundredFlows())
+	}
 }


@jotak I learned a lot from this re-write of encode_prom and its test.

pkg/confgen/encode.go

Amoghrd · 2022-08-23T14:43:31Z

/ok-to-test

This PoC creates a new "SimpleProm" Encode stage that directly extracts metrics from flows without an intermediate Aggregate stage The rationale is that Prometheus already manages labels aggregations. It leaves more responsibilities (and more power) to the query side. For instance, aggregations such as sum/avg etc. would be performed in PromQL at query time rather than upstream.

TTL test doesn't pass on the existing encode_prom

Added type "AggHistogram" to differentiate histograms initiated from Aggregate stage from histograms to create by only by prom. Make the Filter mechanism optional (it doesn't make sense when metrics aren't prepared from the Aggregation stage) In confgen, detect automatically when AggHistogram should be used, so that the user doesn't have to worry about it. Modify aggregate_prom_test to read metrics output from the HTTP handler rather than prevrecords. Fixed race in timed_cache (only tests could be impacted) Also, some performance improvement in EncodeProm (benchmark: divide by 2 ns/op, almost by 3 allocs/op) by having a more efficient cache key building, removing prevrecords, and a few other things

Similar to the "count" operation in Aggregation stage, but that can be done directly via promencode

jotak marked this pull request as draft July 26, 2022 12:03

jotak force-pushed the poc-simple-prom branch 3 times, most recently from 57936bb to 1140493 Compare July 29, 2022 09:32

jotak force-pushed the poc-simple-prom branch from e5729c1 to 9ac279f Compare August 2, 2022 10:22

ronensc approved these changes Aug 2, 2022

View reviewed changes

ronensc requested a review from KalmanMeth August 2, 2022 11:07

jotak changed the title ~~PoC simpler metrics pipeline~~ NETOBSERV-497 allow direct metrics pipeline Aug 2, 2022

jotak marked this pull request as ready for review August 2, 2022 12:52

KalmanMeth reviewed Aug 3, 2022

View reviewed changes

KalmanMeth approved these changes Aug 3, 2022

View reviewed changes

jotak commented Aug 3, 2022

View reviewed changes

pkg/confgen/encode.go Show resolved Hide resolved

jotak force-pushed the poc-simple-prom branch from 7c66003 to 205cb38 Compare August 3, 2022 09:14

jotak mentioned this pull request Aug 3, 2022

conntrack: add conntrack to confgen and create connection size in bytes histogram metric #272

Merged

jotak force-pushed the poc-simple-prom branch from 205cb38 to d0af080 Compare August 10, 2022 14:05

jotak mentioned this pull request Aug 11, 2022

NETOBSERV-497 Use direct-prom pipeline netobserv/network-observability-operator#153

Merged

jotak added 10 commits September 1, 2022 09:04

Manage metrics expiry

1683ec2

Add tests on exposed metrics, both in PoC and existing prom-encode

fce00c3

Adding tests and benchmarks

b28040e

TTL test doesn't pass on the existing encode_prom

Fix confgen to enable direct-prom stage (no agg)

799b887

Add some comments, avoid pointers copies

403d03f

Some cleanup, add test on agg_histogram

1db52f0

Post-rebase fixes

8d6af79

Add records counter

d83b532

Similar to the "count" operation in Aggregation stage, but that can be done directly via promencode

jotak force-pushed the poc-simple-prom branch from a5f10ba to d83b532 Compare September 1, 2022 07:10

jotak merged commit b399de5 into netobserv:main Sep 1, 2022

KalmanMeth mentioned this pull request Sep 4, 2022

check retention policy of metrics in prometheus server #277

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NETOBSERV-497 allow direct metrics pipeline #266

NETOBSERV-497 allow direct metrics pipeline #266

jotak commented Jul 26, 2022

jotak commented Jul 26, 2022

codecov-commenter commented Jul 26, 2022 •

edited

Loading

eranra commented Jul 26, 2022

jotak commented Jul 26, 2022

eranra commented Jul 26, 2022

jotak commented Jul 27, 2022

eranra commented Jul 27, 2022

jotak commented Jul 27, 2022

KalmanMeth commented Jul 28, 2022 •

edited

Loading

KalmanMeth commented Jul 28, 2022

jotak commented Jul 28, 2022

jotak commented Jul 28, 2022

jotak commented Jul 28, 2022

jotak commented Jul 29, 2022

jotak commented Jul 29, 2022

jotak commented Aug 2, 2022

ronensc left a comment •

edited

Loading

ronensc Aug 2, 2022

ronensc Aug 2, 2022

ronensc Aug 2, 2022

ronensc Aug 2, 2022 •

edited

Loading

jotak Aug 2, 2022 •

edited

Loading

ronensc Aug 3, 2022

KalmanMeth Aug 3, 2022

jotak Aug 3, 2022

KalmanMeth Aug 3, 2022

Amoghrd commented Aug 23, 2022

NETOBSERV-497 allow direct metrics pipeline #266

NETOBSERV-497 allow direct metrics pipeline #266

Conversation

jotak commented Jul 26, 2022

jotak commented Jul 26, 2022

codecov-commenter commented Jul 26, 2022 • edited Loading

Codecov Report

eranra commented Jul 26, 2022

jotak commented Jul 26, 2022

eranra commented Jul 26, 2022

jotak commented Jul 27, 2022

eranra commented Jul 27, 2022

jotak commented Jul 27, 2022

KalmanMeth commented Jul 28, 2022 • edited Loading

KalmanMeth commented Jul 28, 2022

jotak commented Jul 28, 2022

jotak commented Jul 28, 2022

jotak commented Jul 28, 2022

jotak commented Jul 29, 2022

jotak commented Jul 29, 2022

jotak commented Aug 2, 2022

ronensc left a comment • edited Loading

Choose a reason for hiding this comment

ronensc Aug 2, 2022

Choose a reason for hiding this comment

ronensc Aug 2, 2022

Choose a reason for hiding this comment

ronensc Aug 2, 2022

Choose a reason for hiding this comment

ronensc Aug 2, 2022 • edited Loading

Choose a reason for hiding this comment

jotak Aug 2, 2022 • edited Loading

Choose a reason for hiding this comment

ronensc Aug 3, 2022

Choose a reason for hiding this comment

KalmanMeth Aug 3, 2022

Choose a reason for hiding this comment

jotak Aug 3, 2022

Choose a reason for hiding this comment

KalmanMeth Aug 3, 2022

Choose a reason for hiding this comment

Amoghrd commented Aug 23, 2022

codecov-commenter commented Jul 26, 2022 •

edited

Loading

KalmanMeth commented Jul 28, 2022 •

edited

Loading

ronensc left a comment •

edited

Loading

ronensc Aug 2, 2022 •

edited

Loading

jotak Aug 2, 2022 •

edited

Loading