-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(proxy, worker): add metrics for the proxy and worker #1004
Comments
It might be good to have an approach similar to that of miden-node. @Mirko-von-Leipzig @bobbinth any take here will be useful. |
I don't think we have an approach in
I don't have much experience here but Prometheus seems like a reasonable and popular choice - but curious what other options are there. @Mirko-von-Leipzig, what do you think?
Maybe also number of workers? |
This is something I've been thinking about recently, I've opened a discussion. |
Update: I'm using tracing + open telemetry + jaeger. I've traces in place, and those traces includes the internal logs correctly. For the worker, instrumenting the RPC Server was enough and it used the internal instrumentation and traces of the prover out-of-the-box. For the proxy I didn't arrived to the desired state yet. Pingora provides a trait to be implemented in order to complete the life-cycle of a request and I instrumented each of these methods. But the caller is not instrumented, so the traces are shown separately. I'm looking for a way to join these traces somehow, most likely through the request ID. Any idea, tip or method would be much appreciated @Mirko-von-Leipzig . This is currently how the traces of the proxy are shown: |
So if I understand correctly its because there is no parent span to connect these different methods into one trace? ^^ seems to hint that you need to manually insert the parent span into the You should then be able to manually (aka no |
yes Awesome, checking it. Thank you! |
Update: I ended up using Mirko's approach, creating a parent span and using the context to hold it. Luckily, I spent some time debugging an error that caused some transaction traces to be nested when they weren't supposed to be. Restarting the Jaeger container resolved the issue. At the moment, a quick cleanup + some configuration are missing for the traces PR to be ready though feel free to take a look at it. I'm taking some time to add metrics to it before finishing the PR. |
Yet another update: Been playing with metrics. Pingora provides a service that can be added to the main server this easily: let mut prometheus_service_http =
pingora::services::listening::Service::prometheus_http_service();
prometheus_service_http.add_tcp("127.0.0.1:6192");
server.add_service(prometheus_service_http); And just works! I added the queue size as metric to test it with: lazy_static! {
pub static ref QUEUE_SIZE: IntGauge =
register_int_gauge!("queue_size", "Number of requests in the queue").unwrap();
}
// ...
/// Enqueue a request
pub async fn enqueue(&self, request_id: Uuid) {
QUEUE_SIZE.inc();
let mut queue = self.queue.write().await;
queue.push_back(request_id);
}
/// Dequeue a request
pub async fn dequeue(&self) -> Option<Uuid> {
let mut queue = self.queue.write().await;
// If the queue was empty, the queue size does not change
if let Some(request_id) = queue.pop_front() {
QUEUE_SIZE.dec();
Some(request_id)
} else {
None
}
} And now we can retrieve the metrics by doing: $ curl http://0.0.0.0:6192/metrics
# HELP queue_size Number of requests in the queue
# TYPE queue_size gauge
queue_size 0 Demo: Screen.Recording.2024-12-12.at.17.58.52.movNow I'm adding more metrics and I was going to take a look at Grafana. |
Opened #1017 as draft including a couple metrics. I will open the PR fully after fixing a metrics that is not being updated properly and including some documentation. |
Very cool! One question: I understand that for things like queue size, number of available workers etc. we need to track the metrics manually. But what about things like request latency, total number of requests in a given period of time etc. - do we also need to track them manually, or could we tap into the instrumentation output to get them? |
What should be done?
The Proxy and Worker services should have a way to report metrics about its internal status and transactions flow.
How should it be done?
When is this task done?
When a metrics system is in place and a dashboard is provided.
Additional context
No response
The text was updated successfully, but these errors were encountered: