-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NETOBSERV-497 allow direct metrics pipeline #266
Conversation
Codecov Report
@@ Coverage Diff @@
## main #266 +/- ##
==========================================
+ Coverage 67.34% 67.70% +0.36%
==========================================
Files 73 74 +1
Lines 4281 4357 +76
==========================================
+ Hits 2883 2950 +67
- Misses 1214 1219 +5
- Partials 184 188 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@jotak how many metrics will be sent to Prometheus? don't you think that this will exhaust Prometheus with too many metrics? the idea of aggregate was to pre-process metrics so that the only thing Prometheus will need to handle is the aggregates. For example in the case of the above Prometheus will get namespace metrics without the need to worry about the underly metrics. Maybe I am missing something here ... we can find time to talk tomorrow maybe? |
@eranra I already proposed a call on thursday (check your mail) :) |
@jotak I see that now ... can we move that to an earlier time ... morning EU time --- this overlaps multiple meetings for me |
btw we would also need to check how to combine that with confgen ; maybe adding a config flag in metrics definitions to tell if we want to use the "direct prom" approach or the "aggregate + prom" approach. |
@jotak another thing that I want to think about is if there is a way to combine the code so we will not have two options that are doing almost the same thing based on two code bases,. Maybe there is a way to split the code / reuse the code so that we do not duplicate the code but split the functions so that they are used in both the direct version and the split version |
Also, for the record: we need to check if there is an internal cache in prom client to clean from time to time (similar to the expiry mechanism implemented in FLP caches) |
In encode_prom, we implemented a cache to clean up items that are inactive, and we encapsulated it in utils.timed_cache. |
The encode_prom stage does not depend on the extract_aggregate stage. The confgenerator builds a config that uses extract_aggregate that feeds encode_prom , but encode_prom can be used by itself without aggregation. Once you add in the cache to SimpleEncodeProm it becomes essentially the existing encode_prom. |
@KalmanMeth yes, I see that now. Maybe with the exception of the histogram values, right? |
@KalmanMeth I think the existing
I believe these are changes added later, without having in mind that the prom-encode stage could be used independently? In my last commit I've added twice a test of exposed metrics, see eb99bc0 ; one time for the existing PromEncode, one time for the new PoC prom encode. |
So, rather than having a new Encode implementation, I will try to fix the existing one. |
57936bb
to
1140493
Compare
last commit: now focusing on fixing the existing PromEncode to work fine without Aggregation. Also spending some time on optimization (e.g. decreasing number of allocs) |
I see another issue with the current Aggregation stage: we cannot mix "direct" PromEncode stages and "non-direct" (ie. following Aggregation) in confgen: we would need to fork before Aggregation, something like :
But confgen doesn't support that. Also, having 2 prom stages means we need to deal with port conflicts. |
FYI, jira created : https://issues.redhat.com/browse/NETOBSERV-497 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks promising. It feels like the code is cleaner.
@jotak I'm amazed by how fast you've done this work 🤩
exposed := test.ReadExposedMetrics(t) | ||
|
||
for _, expected := range tt.expectedEncode { | ||
require.Contains(t, exposed, expected) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I like the idea of checking the exposed metrics rather than PrevRecords
tc.mu.RLock() | ||
defer tc.mu.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
var errorsCounter = operationalMetrics.NewCounterVec(prometheus.CounterOpts{ | ||
Name: "encode_prom_errors", | ||
Help: "Total errors during metrics generation", | ||
}, []string{"error", "metric", "key"}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
reg := prometheus.NewRegistry() | ||
prometheus.DefaultRegisterer = reg | ||
prometheus.DefaultGatherer = reg | ||
http.DefaultServeMux = http.NewServeMux() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of setting http.DefaultServeMux = http.NewServeMux()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is because in encode_prom
, we run http.Handle("/metrics", promhttp.Handler())
which creates a handler on the default, global mux/router. When this is called several times, an error would be fired (something like "cannot register /metrics, route already exists").
I don't remember exactly in which case that happened, maybe it was just in the benchmark I created below.
BTW it also shows that the prom_encode stage will need to be refactored if I some point we want to be able to define more than one prom-encodes in a pipeline. I mentioned that in this jira: https://issues.redhat.com/browse/NETOBSERV-498
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jotak thanks. I tried commenting out that line and see what happens. It failed on the second unit test of encode_prom_test.go
because of the multiple registration that you have described. So resetting the DefaultServeMux
to a new instance on each unit test solves this problem.
require.Contains(t, exposed, `test_packets_total{dstIP="10.0.0.1",srcIP="20.0.0.2"} 2`) | ||
require.Contains(t, exposed, `test_packets_total{dstIP="30.0.0.3",srcIP="10.0.0.1"} 2`) | ||
require.Contains(t, exposed, `test_latency_seconds_bucket{dstIP="10.0.0.1",srcIP="20.0.0.2",le="0.025"} 0`) | ||
require.Contains(t, exposed, `test_latency_seconds_bucket{dstIP="10.0.0.1",srcIP="20.0.0.2",le="0.05"} 1`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are these buckets defined? Are they a default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, prom client defines default: DefBuckets = []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}
This hasn't changed from the previous implementation
|
||
for i := 0; i < b.N; i++ { | ||
prom.Encode(hundredFlows()) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jotak I learned a lot from this re-write of encode_prom and its test.
/ok-to-test |
This PoC creates a new "SimpleProm" Encode stage that directly extracts metrics from flows without an intermediate Aggregate stage The rationale is that Prometheus already manages labels aggregations. It leaves more responsibilities (and more power) to the query side. For instance, aggregations such as sum/avg etc. would be performed in PromQL at query time rather than upstream.
TTL test doesn't pass on the existing encode_prom
Added type "AggHistogram" to differentiate histograms initiated from Aggregate stage from histograms to create by only by prom. Make the Filter mechanism optional (it doesn't make sense when metrics aren't prepared from the Aggregation stage) In confgen, detect automatically when AggHistogram should be used, so that the user doesn't have to worry about it. Modify aggregate_prom_test to read metrics output from the HTTP handler rather than prevrecords. Fixed race in timed_cache (only tests could be impacted) Also, some performance improvement in EncodeProm (benchmark: divide by 2 ns/op, almost by 3 allocs/op) by having a more efficient cache key building, removing prevrecords, and a few other things
Similar to the "count" operation in Aggregation stage, but that can be done directly via promencode
a5f10ba
to
d83b532
Compare
This PoC creates a new "SimpleProm" Encode stage that directly extracts
metrics from flows without an intermediate Aggregate stage
The rationale is that Prometheus already manages labels aggregations.
It leaves more responsibilities (and more power) to the query side.
For instance, aggregations such as sum/avg etc. would be performed in
PromQL at query time rather than upstream.
Pipelines to generate are simpler as they don't need an "Aggregate" stage, and the prom-encode stage itself has less parameters, e.g.: