Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit cadence worker's hardware utilization inside worker once per host #1260

Merged
merged 20 commits into from
Jul 18, 2023

Conversation

timl3136
Copy link
Member

What changed?
We can now emit cadence worker's hardware utilization while polling for decision/activity task

Why?
This enable us to monitor worker's utilization

How did you test it?
Tested on dev environment

Potential risks
No risk since it does not affect overall logic and are just added read-only functions

@CLAassistant
Copy link

CLAassistant commented Jun 26, 2023

CLA assistant check
All committers have signed the CLA.

@timl3136 timl3136 changed the title Emit cadence worker's hardware utilization while polling Emit cadence worker's hardware utilization inside worker once per host Jun 28, 2023
@@ -209,6 +215,9 @@ func (bw *baseWorker) Start() {
bw.shutdownWG.Add(1)
go bw.runTaskDispatcher()

bw.shutdownWG.Add(1)
bw.EmitHardwareUsage()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if that's the best practice of using waitgroup and go routine (since the go routine is started inside this function) but the test is not reporting and goleak.

@@ -27,7 +27,9 @@ import (
"context"
"errors"
"fmt"
"github.com/shirou/gopsutil/cpu"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standart imports should be separated from external. I think goimports should do the trick

@@ -54,6 +57,8 @@ var (

var errShutdown = errors.New("worker shutting down")

var emitOnce sync.Once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest extracting this into WorkerResult as a separate dependency, instead of trying to fit into the workers with sync.Once.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by WorkerResult? Are you referring to the baseWorker struct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh well, I thought we provided fx.Module for this as we do in the monorepo. There is a NewCadenceWorkers that returns a single result that can be a good place to have this component.
If we want to care about monorepo - then it should be part of monorepo.
If we want to have this in OSS, then, I think, it should be a separated component that is optional and could be used by developers as an extra option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not even sure that we want to have this in OSS, since a developer could use something like https://github.com/prometheus/client_golang/blob/main/prometheus/go_collector.go#L216
So it will expose data for collection.

Copy link
Member

@shijiesheng shijiesheng Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to do it in OSS, we cannot use go_collector that only works for prometheus. Cadence use tally on top of underlying metrics package.

@3vilhamster
Copy link
Contributor

I'm not sure I understand what you mean by worker utilization. If we track cpu/memory, then it should be dominated by a workload that is unrelated to the worker itself.
What is exactly we want to achieve here?

internal/internal_task_pollers.go Outdated Show resolved Hide resolved
@@ -54,6 +57,8 @@ var (

var errShutdown = errors.New("worker shutting down")

var emitOnce sync.Once
Copy link
Member

@shijiesheng shijiesheng Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to do it in OSS, we cannot use go_collector that only works for prometheus. Cadence use tally on top of underlying metrics package.

internal/internal_worker_base.go Outdated Show resolved Hide resolved
internal/internal_worker_base.go Outdated Show resolved Hide resolved
internal/internal_worker_base.go Show resolved Hide resolved
internal/internal_worker_base.go Outdated Show resolved Hide resolved
internal/internal_worker_base.go Outdated Show resolved Hide resolved
go.mod Show resolved Hide resolved
cpuCores, _ := cpu.Counts(false)

scope.Gauge(metrics.NumCPUCores).Update(float64(cpuCores))
scope.Gauge(metrics.CPUPercentage).Update(cpuPercent[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might still have a nil pointer if cpu.Percent(0, false) returns error. Either need to handle error or do a length check.

@@ -209,6 +215,12 @@ func (bw *baseWorker) Start() {
bw.shutdownWG.Add(1)
go bw.runTaskDispatcher()

// We want the emit function run once per host instead of run once per worker
emitOnce.Do(func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about move emitOnce to bw.emitHardwareUsage

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried that, but that results in go routine leak unfortunately

@timl3136 timl3136 merged commit f5eb256 into cadence-workflow:master Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants