-
Notifications
You must be signed in to change notification settings - Fork 43
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to see instrumentation for QUIC to come in the future!
now := time.Now() | ||
c.mutex.Lock() | ||
for _, conn := range c.conns { | ||
if rtt, valid := conn.getSmoothedRTT(); valid { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly that the RTT is recorded each time Prometheus scrapes the node? Thus the metric data resolution is dependent on the scrape interval of Prometheus?
I would deem this a bit problematic for the following reasons:
- Data exposed is dependent on external processes.
- The way to achieve high availability is to run 2 or 3 instances of Prometheus each scraping the same set of targets (nodes). This would directly influence data resolution.
- In case Prometheus does not scrape the node at all (e.g. due to downtime) no data is recorded in the histogram.
Would it be feasible to record the RTT on each round trip instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 problems with recording the RTT every roundtrip:
- If a connection is not sending any data, it will not generate any RTT samples.
- If we have one connection with 50ms RTT and one connection with 500ms RTT (assuming both are sending data and both are using the same congestion window), the shorter-RTT connection will generate 10x as many RTT measurements.
What I'm trying to measure is the RTT distribution of all active connections, so we have to make sure that every connection generates the same number of measurements per scraping interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a connection is not sending any data, it will not generate any RTT samples.
This applies to both the status quo and my suggestion, correct?
If we have one connection with 50ms RTT and one connection with 500ms RTT (assuming both are sending data and both are using the same congestion window), the shorter-RTT connection will generate 10x as many RTT measurements.
Good point. Thanks for clarifying!
What I'm trying to measure is the RTT distribution of all active connections, so we have to make sure that every connection generates the same number of measurements per scraping interval.
Would a collection across connections every X seconds do the trick? Thus data collection would not depend on external factors (number of Prometheus instances and scrape interval) but at the same time uphold the same amount of measurements per connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This applies to both the status quo and my suggestion, correct?
We're saving the most recent RTT sample for every connection, so this problem is solved in the PR. We might use an RTT measurements that a few seconds old, but considering that we're sending a keep-alive PING every 15s, we're generating fresh RTT measurements quite frequently.
Would a collection across connections every X seconds do the trick? Thus data collection would not depend on external factors (number of Prometheus instances and scrape interval) but at the same time uphold the same amount of measurements per connection.
What's the advantage of that? Looping over all connections is moderately expensive, so it would be good to not do it more frequently than necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the advantage of that?
Looping over the connections on a fixed interval would make the data returned to Prometheus independent of its scrape interval and the amount of Prometheus instances. E.g. if you have 3 Prometheus instances for high availability, you would be looping over the connections 3 times every scrape interval (e.g. minute). Instead, if you collect on a fixed interval (e.g. 1 minute) you would only loop over all collections once per minute, instead of 3 times per minute.
I think it is fine either way.
|
||
var collector *aggregatingCollector | ||
|
||
func init() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my Golang skills are not misleading me, all packages importing this package will now expose the metrics below, even if they just make use of some small part of the library, e.g. a helper function. Can the same be achieved with local instead of global variables? E.g. can one pass a metric registry at construction time and keep a reference to the metrics in some local struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the best practice here? If package
wants to expose Prometheus metrics, does the user of that package call a package.InitializeMetrics()
function? We could certainly do that, if that's the preferred way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the best practice here?
If I recall correctly the best practice in Golang is to use global metric variables and initialization via init
like you did here. I am not in favor of this instrumentation style, due to the various issues that global state brings along. Instead I would pass a registry as an argument at object creation, e.g. NewQuicTransport
. The returned QuicTransport
object would then own the metrics registered with the provided Registry.
I am happy to expand on my architecture, though in the end I think you should go with whatever works best for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead I would pass a registry as an argument at object creation, e.g. NewQuicTransport.
What would happen if you create multiple transports then? We can only register each metric once.
If the init
way is actually the common way to do this in Go, let's get this PR merged. We can always improve it later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if you create multiple transports then? We can only register each metric once.
What I would do is have a RegisterMetrics
function that takes a registry and returns a structure with all the metrics. I would then pass a clone of the latter to each transport.
If the init way is actually the common way to do this in Go, let's get this PR merged. We can always improve it later on.
That sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I would do is have a RegisterMetrics function that takes a registry and returns a structure with all the metrics. I would then pass a clone of the latter to each transport.
That sounds like a much cleaner setup. I'm open to making this change in the future.
ea3afa1
to
fb1b1d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me from the Prometheus side. Not sure my Golang skills are good enough for a general approval. Your call.
fb1b1d8
to
3acd350
Compare
No description provided.