feat(proxy, worker): add metrics for the proxy and worker #1004

SantiagoPittella · 2024-12-03T15:50:01Z

What should be done?

The Proxy and Worker services should have a way to report metrics about its internal status and transactions flow.

How should it be done?

Define what monitoring technology should be used. Pingora already provides support for Prometheus.
Define what metrics are important for our use case:
- Proxy queue size.
- Proving time of the worker.
- Total request time.
Add metrics report.
Add dashboard to read metrics.

When is this task done?

When a metrics system is in place and a dashboard is provided.

Additional context

No response

SantiagoPittella · 2024-12-03T15:52:54Z

It might be good to have an approach similar to that of miden-node. @Mirko-von-Leipzig @bobbinth any take here will be useful.

bobbinth · 2024-12-03T22:06:23Z

It might be good to have an approach similar to that of miden-node.

I don't think we have an approach in miden-node yet (we do have an issue for this though). So, w/e we land on here, would probably be the path we'd take in miden-node as well.

Define what monitoring technology should be used. Pingora already provides support for Prometheus.

I don't have much experience here but Prometheus seems like a reasonable and popular choice - but curious what other options are there. @Mirko-von-Leipzig, what do you think?

Proxy queue size.

Proving time of the worker.

Total request time.

Maybe also number of workers?

Mirko-von-Leipzig · 2024-12-04T09:55:49Z

It might be good to have an approach similar to that of miden-node.

I don't think we have an approach in miden-node yet (we do have an issue for this though). So, w/e we land on here, would probably be the path we'd take in miden-node as well.

Define what monitoring technology should be used. Pingora already provides support for Prometheus.

I don't have much experience here but Prometheus seems like a reasonable and popular choice - but curious what other options are there. @Mirko-von-Leipzig, what do you think?

Proxy queue size.

Proving time of the worker.

Total request time.

Maybe also number of workers?

This is something I've been thinking about recently, I've opened a discussion.

SantiagoPittella · 2024-12-11T14:07:46Z

Update:

I'm using tracing + open telemetry + jaeger. I've traces in place, and those traces includes the internal logs correctly.

For the worker, instrumenting the RPC Server was enough and it used the internal instrumentation and traces of the prover out-of-the-box.

For the proxy I didn't arrived to the desired state yet. Pingora provides a trait to be implemented in order to complete the life-cycle of a request and I instrumented each of these methods. But the caller is not instrumented, so the traces are shown separately. I'm looking for a way to join these traces somehow, most likely through the request ID. Any idea, tip or method would be much appreciated @Mirko-von-Leipzig .

This is currently how the traces of the proxy are shown:

Mirko-von-Leipzig · 2024-12-11T14:33:56Z

So if I understand correctly its because there is no parent span to connect these different methods into one trace?

cloudflare/pingora#237

^^ seems to hint that you need to manually insert the parent span into the Ctx - I assume there would be some appropriate callback on on construction by the caller. I'm not super familiar with pingora.

You should then be able to manually (aka no #[instrument]) instrument each method by entering this context span. I think. Wish they'd just spell it out in the issue instead of just hinting it.

SantiagoPittella · 2024-12-11T15:20:43Z

So if I understand correctly its because there is no parent span to connect these different methods into one trace?

yes

Awesome, checking it. Thank you!

SantiagoPittella · 2024-12-12T19:21:24Z

Update:

I ended up using Mirko's approach, creating a parent span and using the context to hold it. Luckily, #[instrument] supports passing a parent, so using #[instrument( parent = ctx.parent_trace )] did the work.

I spent some time debugging an error that caused some transaction traces to be nested when they weren't supposed to be. Restarting the Jaeger container resolved the issue.

At the moment, a quick cleanup + some configuration are missing for the traces PR to be ready though feel free to take a look at it.

I'm taking some time to add metrics to it before finishing the PR.

SantiagoPittella · 2024-12-12T21:10:07Z

Yet another update:

Been playing with metrics. Pingora provides a service that can be added to the main server this easily:

        let mut prometheus_service_http =
            pingora::services::listening::Service::prometheus_http_service();
        prometheus_service_http.add_tcp("127.0.0.1:6192");

        server.add_service(prometheus_service_http);

And just works! I added the queue size as metric to test it with:

lazy_static! {
    pub static ref QUEUE_SIZE: IntGauge =
        register_int_gauge!("queue_size", "Number of requests in the queue").unwrap();
}

// ...

    /// Enqueue a request
    pub async fn enqueue(&self, request_id: Uuid) {
        QUEUE_SIZE.inc();
        let mut queue = self.queue.write().await;
        queue.push_back(request_id);
    }

    /// Dequeue a request
    pub async fn dequeue(&self) -> Option<Uuid> {
        let mut queue = self.queue.write().await;
        // If the queue was empty, the queue size does not change
        if let Some(request_id) = queue.pop_front() {
            QUEUE_SIZE.dec();
            Some(request_id)
        } else {
            None
        }
    }

And now we can retrieve the metrics by doing:

$ curl http://0.0.0.0:6192/metrics

# HELP queue_size Number of requests in the queue
# TYPE queue_size gauge
queue_size 0

Demo:

Screen.Recording.2024-12-12.at.17.58.52.mov

Now I'm adding more metrics and I was going to take a look at Grafana.

SantiagoPittella · 2024-12-12T22:22:30Z

Opened #1017 as draft including a couple metrics. I will open the PR fully after fixing a metrics that is not being updated properly and including some documentation.

bobbinth · 2024-12-13T01:52:51Z

Very cool! One question: I understand that for things like queue size, number of available workers etc. we need to track the metrics manually. But what about things like request latency, total number of requests in a given period of time etc. - do we also need to track them manually, or could we tap into the instrumentation output to get them?

bobbinth assigned SantiagoPittella Dec 4, 2024

bobbinth added this to User's testnet Dec 4, 2024

bobbinth added this to the v0.7 milestone Dec 4, 2024

SantiagoPittella mentioned this issue Dec 12, 2024

feat: add tracing to worker and proxy #1014

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(proxy, worker): add metrics for the proxy and worker #1004

feat(proxy, worker): add metrics for the proxy and worker #1004

SantiagoPittella commented Dec 3, 2024

SantiagoPittella commented Dec 3, 2024

bobbinth commented Dec 3, 2024

Mirko-von-Leipzig commented Dec 4, 2024

SantiagoPittella commented Dec 11, 2024

Mirko-von-Leipzig commented Dec 11, 2024

SantiagoPittella commented Dec 11, 2024

SantiagoPittella commented Dec 12, 2024 •

edited

Loading

SantiagoPittella commented Dec 12, 2024

SantiagoPittella commented Dec 12, 2024

bobbinth commented Dec 13, 2024

feat(proxy, worker): add metrics for the proxy and worker #1004

feat(proxy, worker): add metrics for the proxy and worker #1004

Comments

SantiagoPittella commented Dec 3, 2024

What should be done?

How should it be done?

When is this task done?

Additional context

SantiagoPittella commented Dec 3, 2024

bobbinth commented Dec 3, 2024

Mirko-von-Leipzig commented Dec 4, 2024

SantiagoPittella commented Dec 11, 2024

Mirko-von-Leipzig commented Dec 11, 2024

SantiagoPittella commented Dec 11, 2024

SantiagoPittella commented Dec 12, 2024 • edited Loading

SantiagoPittella commented Dec 12, 2024

SantiagoPittella commented Dec 12, 2024

bobbinth commented Dec 13, 2024

SantiagoPittella commented Dec 12, 2024 •

edited

Loading