Skip to content

Conversation

@mtelvers
Copy link
Member

Added a capacity metric to ocluster-worker to allow us to centrally report on how busy a worker is.

Thus in Grafana, we could graph on

ocluster_worker_running_jobs{job="workername"}/ocluster_worker_capacity{job="workername"}

As this value is fixed it only needs to be set once but for readability, I have kept it together with the other metrics which are set once per job. Happy to move it if there is a preferred/better place.

@mtelvers mtelvers requested a review from MisterDA October 13, 2022 16:03
Copy link
Contributor

@MisterDA MisterDA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems ok to me. Don't know much about Prometheus nor Grafana, though.

@talex5
Copy link
Contributor

talex5 commented Oct 14, 2022

As this value is fixed it only needs to be set once but for readability, I have kept it together with the other metrics which are set once per job. Happy to move it if there is a preferred/better place.

This also has the problem that the capacity will be reported as zero until the first job starts. It would be better to move this to the run function, setting the capacity as soon as it is known.

@mtelvers
Copy link
Member Author

Thanks @talex5. Do the workers know the name of the pool they are connected to? Using Grafana, there doesn't seem to be an easy way to select workers by pool? It would be great if there was a metric which reflected, say linux-x86_64.

@talex5
Copy link
Contributor

talex5 commented Oct 14, 2022

Do the workers know the name of the pool they are connected to?

No, but the scheduler does, and it knows the capacity too. You could report the metric there instead. There are two places that would need changing: when a worker connects and when it disconnects:

ocluster/scheduler/pool.ml

Lines 485 to 486 in 6f5f7c3

t.cluster_capacity <- t.cluster_capacity +. float capacity;
Prometheus.Gauge.inc_one (Metrics.workers_connected t.pool);

ocluster/scheduler/pool.ml

Lines 592 to 593 in 6f5f7c3

t.cluster_capacity <- t.cluster_capacity -. float w.capacity;
Prometheus.Gauge.dec_one (Metrics.workers_connected t.pool);

@mtelvers
Copy link
Member Author

We could use PR #195 instead.

@mtelvers mtelvers closed this Oct 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants