Added worker capacity to Prometheus metrics #194

mtelvers · 2022-10-13T16:03:01Z

Added a capacity metric to ocluster-worker to allow us to centrally report on how busy a worker is.

Thus in Grafana, we could graph on

ocluster_worker_running_jobs{job="workername"}/ocluster_worker_capacity{job="workername"}

As this value is fixed it only needs to be set once but for readability, I have kept it together with the other metrics which are set once per job. Happy to move it if there is a preferred/better place.

MisterDA

Seems ok to me. Don't know much about Prometheus nor Grafana, though.

talex5 · 2022-10-14T09:35:59Z

As this value is fixed it only needs to be set once but for readability, I have kept it together with the other metrics which are set once per job. Happy to move it if there is a preferred/better place.

This also has the problem that the capacity will be reported as zero until the first job starts. It would be better to move this to the run function, setting the capacity as soon as it is known.

mtelvers · 2022-10-14T11:52:55Z

Thanks @talex5. Do the workers know the name of the pool they are connected to? Using Grafana, there doesn't seem to be an easy way to select workers by pool? It would be great if there was a metric which reflected, say linux-x86_64.

talex5 · 2022-10-14T14:40:22Z

Do the workers know the name of the pool they are connected to?

No, but the scheduler does, and it knows the capacity too. You could report the metric there instead. There are two places that would need changing: when a worker connects and when it disconnects:

ocluster/scheduler/pool.ml

Lines 485 to 486 in 6f5f7c3

    
           t.cluster_capacity <- t.cluster_capacity +. float capacity; 
        
           Prometheus.Gauge.inc_one (Metrics.workers_connected t.pool);

ocluster/scheduler/pool.ml

Lines 592 to 593 in 6f5f7c3

    
           t.cluster_capacity <- t.cluster_capacity -. float w.capacity; 
        
           Prometheus.Gauge.dec_one (Metrics.workers_connected t.pool);

mtelvers · 2022-10-15T14:01:31Z

We could use PR #195 instead.

Added worker capacity to Prometheus metrics

d079fc6

mtelvers requested a review from MisterDA October 13, 2022 16:03

MisterDA approved these changes Oct 13, 2022

View reviewed changes

Moved value to run function

560052b

mtelvers closed this Oct 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added worker capacity to Prometheus metrics #194

Added worker capacity to Prometheus metrics #194

Uh oh!

mtelvers commented Oct 13, 2022

Uh oh!

MisterDA left a comment

Uh oh!

talex5 commented Oct 14, 2022

Uh oh!

mtelvers commented Oct 14, 2022

Uh oh!

talex5 commented Oct 14, 2022

Uh oh!

mtelvers commented Oct 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Added worker capacity to Prometheus metrics #194

Added worker capacity to Prometheus metrics #194

Uh oh!

Conversation

mtelvers commented Oct 13, 2022

Uh oh!

MisterDA left a comment

Choose a reason for hiding this comment

Uh oh!

talex5 commented Oct 14, 2022

Uh oh!

mtelvers commented Oct 14, 2022

Uh oh!

talex5 commented Oct 14, 2022

Uh oh!

mtelvers commented Oct 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants