Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(proxy, worker): add metrics for the proxy and worker #1004

Open
SantiagoPittella opened this issue Dec 3, 2024 · 10 comments
Open

feat(proxy, worker): add metrics for the proxy and worker #1004

SantiagoPittella opened this issue Dec 3, 2024 · 10 comments
Assignees
Milestone

Comments

@SantiagoPittella
Copy link
Collaborator

What should be done?

The Proxy and Worker services should have a way to report metrics about its internal status and transactions flow.

How should it be done?

  • Define what monitoring technology should be used. Pingora already provides support for Prometheus.
  • Define what metrics are important for our use case:
    • Proxy queue size.
    • Proving time of the worker.
    • Total request time.
  • Add metrics report.
  • Add dashboard to read metrics.

When is this task done?

When a metrics system is in place and a dashboard is provided.

Additional context

No response

@SantiagoPittella
Copy link
Collaborator Author

It might be good to have an approach similar to that of miden-node. @Mirko-von-Leipzig @bobbinth any take here will be useful.

@bobbinth
Copy link
Contributor

bobbinth commented Dec 3, 2024

It might be good to have an approach similar to that of miden-node.

I don't think we have an approach in miden-node yet (we do have an issue for this though). So, w/e we land on here, would probably be the path we'd take in miden-node as well.

Define what monitoring technology should be used. Pingora already provides support for Prometheus.

I don't have much experience here but Prometheus seems like a reasonable and popular choice - but curious what other options are there. @Mirko-von-Leipzig, what do you think?

  • Proxy queue size.
  • Proving time of the worker.
  • Total request time.

Maybe also number of workers?

@bobbinth bobbinth added this to the v0.7 milestone Dec 4, 2024
@Mirko-von-Leipzig
Copy link
Contributor

It might be good to have an approach similar to that of miden-node.

I don't think we have an approach in miden-node yet (we do have an issue for this though). So, w/e we land on here, would probably be the path we'd take in miden-node as well.

Define what monitoring technology should be used. Pingora already provides support for Prometheus.

I don't have much experience here but Prometheus seems like a reasonable and popular choice - but curious what other options are there. @Mirko-von-Leipzig, what do you think?

  • Proxy queue size.
  • Proving time of the worker.
  • Total request time.

Maybe also number of workers?

This is something I've been thinking about recently, I've opened a discussion.

@SantiagoPittella
Copy link
Collaborator Author

Update:

I'm using tracing + open telemetry + jaeger. I've traces in place, and those traces includes the internal logs correctly.

For the worker, instrumenting the RPC Server was enough and it used the internal instrumentation and traces of the prover out-of-the-box.

image

For the proxy I didn't arrived to the desired state yet. Pingora provides a trait to be implemented in order to complete the life-cycle of a request and I instrumented each of these methods. But the caller is not instrumented, so the traces are shown separately. I'm looking for a way to join these traces somehow, most likely through the request ID. Any idea, tip or method would be much appreciated @Mirko-von-Leipzig .

This is currently how the traces of the proxy are shown:

image

@Mirko-von-Leipzig
Copy link
Contributor

So if I understand correctly its because there is no parent span to connect these different methods into one trace?

cloudflare/pingora#237

^^ seems to hint that you need to manually insert the parent span into the Ctx - I assume there would be some appropriate callback on on construction by the caller. I'm not super familiar with pingora.

You should then be able to manually (aka no #[instrument]) instrument each method by entering this context span. I think. Wish they'd just spell it out in the issue instead of just hinting it.

@SantiagoPittella
Copy link
Collaborator Author

So if I understand correctly its because there is no parent span to connect these different methods into one trace?

yes

Awesome, checking it. Thank you!

@SantiagoPittella
Copy link
Collaborator Author

SantiagoPittella commented Dec 12, 2024

Update:

I ended up using Mirko's approach, creating a parent span and using the context to hold it. Luckily, #[instrument] supports passing a parent, so using #[instrument( parent = ctx.parent_trace )] did the work.

I spent some time debugging an error that caused some transaction traces to be nested when they weren't supposed to be. Restarting the Jaeger container resolved the issue.

image

At the moment, a quick cleanup + some configuration are missing for the traces PR to be ready though feel free to take a look at it.

I'm taking some time to add metrics to it before finishing the PR.

@SantiagoPittella
Copy link
Collaborator Author

Yet another update:

Been playing with metrics. Pingora provides a service that can be added to the main server this easily:

        let mut prometheus_service_http =
            pingora::services::listening::Service::prometheus_http_service();
        prometheus_service_http.add_tcp("127.0.0.1:6192");

        server.add_service(prometheus_service_http);

And just works! I added the queue size as metric to test it with:

lazy_static! {
    pub static ref QUEUE_SIZE: IntGauge =
        register_int_gauge!("queue_size", "Number of requests in the queue").unwrap();
}

// ...

    /// Enqueue a request
    pub async fn enqueue(&self, request_id: Uuid) {
        QUEUE_SIZE.inc();
        let mut queue = self.queue.write().await;
        queue.push_back(request_id);
    }

    /// Dequeue a request
    pub async fn dequeue(&self) -> Option<Uuid> {
        let mut queue = self.queue.write().await;
        // If the queue was empty, the queue size does not change
        if let Some(request_id) = queue.pop_front() {
            QUEUE_SIZE.dec();
            Some(request_id)
        } else {
            None
        }
    }

And now we can retrieve the metrics by doing:

$ curl http://0.0.0.0:6192/metrics

# HELP queue_size Number of requests in the queue
# TYPE queue_size gauge
queue_size 0

Demo:

Screen.Recording.2024-12-12.at.17.58.52.mov

Now I'm adding more metrics and I was going to take a look at Grafana.

@SantiagoPittella
Copy link
Collaborator Author

Opened #1017 as draft including a couple metrics. I will open the PR fully after fixing a metrics that is not being updated properly and including some documentation.

@bobbinth
Copy link
Contributor

Very cool! One question: I understand that for things like queue size, number of available workers etc. we need to track the metrics manually. But what about things like request latency, total number of requests in a given period of time etc. - do we also need to track them manually, or could we tap into the instrumentation output to get them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants