Nomad Periodic job enabled=false , recommended way to gain insights into this? #24119

dmclf · 2024-10-03T08:15:39Z

following issue 19671 more wondering about Hashicorp Nomad guidelines to be aware of a disabled periodic job.

as with version nomad 1.8.3, when you have a periodic job that is disabled:

on Nomad-UI:

marked as Running
one can only see Job.Periodic.Enabled = false when going to job definition
- which does not quite make that a 'simple' thing to check if you have dozens or hundreds of possible jobs

on Nomad-CLI:

with nomad job status marked as Running
one can see with nomad job inspect jobname my-disabled-periodic-job |jq '.Job.Periodic.Enabled' but that also may not be 'convenient' to monitor on a 24x7 basis and raise alerts

on Nomad Prometheus exported metrics (preferred):

no visibility at all.

so then generic question:

what would be the programmatic way to get proper insights into such disabled jobs?
(As issue 19671 was closed as 'not worth doing' )
(or, is it not cleaner in that case to deprecate the 'enabled' flag from the periodic jobspec to avoid confusion ? )

The text was updated successfully, but these errors were encountered:

jrasell · 2024-10-04T11:29:19Z

Hi @dmclf and thanks for raising this issue. I have tried to respond to each question below, with a further note on a particular line that caught my attention.

monitor on a 24x7 basis and raise alerts

The question that comes to mind when reading this is why do you need to raise alerts on this? If periodic jobs are having their enabled flag altered (or deployed with the wrong setting) and that is cause to trigger an alert, I would lean towards tighter access control on the Nomad cluster, job specifications within source control, and CI/CD automation for job deployments.

on Nomad-UI

The periodic block is only available when reading the job specification from Nomad and is not available within the listing. In order to display a button on the job list page, the UI would need to list and then read every job. This is time, network, and computationally expensive and must be done for every job, irregardless of whether they are periodic or not, in order to discover this fact.

on Nomad-CLI

The API is likely the better tool to use for this kind of work. The following example utilizes curl and JQ to get the periodic enabled value of each running job, printing out the information is a readable manner:

$ curl -s localhost:4646/v1/jobs | jq -r '.[].ID' |while read -r jobID; do curl -s localhost:4646/v1/job/$jobID | jq -r '"Job: \(.ID), Namespace: \(.Namespace),  PeriodicEnabled: \(.Periodic.Enabled)"' ; done
Job: not-periodic, Namespace: default,  PeriodicEnabled: null
Job: periodic-disabled, Namespace: default,  PeriodicEnabled: false
Job: periodic-enabled, Namespace: default,  PeriodicEnabled: true

Nomad Prometheus exported metrics (preferred)

This would require the Nomad servers to emit telemetry based on static job specification parameters rather than runtime information. This brings numerous questions and adds considerable computational overhead and would increase the cardinality of our metrics as 100 periodic jobs would produce 100 new data points. If this was the route you wanted to go, I would first consider building a small Prometheus sidecar exporter, which could consume the Nomad API and present the required data for scarping.

I would also be worried about the potential for this enhancement to creep of other job specification parameters if we chose to include it.

dmclf · 2024-10-04T11:46:07Z

Hi @jrasell

re 1: the jobs are actually fully in CI/CD including job deployments
but that will not stop a person from possibly disabling it manually on the UI.

Which, if happens, will go unnoticed (job may look to be 'running' from a first glance) and for infrequent jobs, there wont be children due to GC's, so it will be hard to notice.

Will likely explore options for custom API monitoring, thanks you for confirming this path.

tgross · 2024-10-04T12:43:49Z

Another thought is that this is the sort of thing you can monitor with Sentinel, for Nomad Enterprise.

dmclf · 2024-10-04T14:23:05Z

Ok, just to briefly comment on Nomad (and Consul/Vault) Enterprise for our use case, cost/benefit ratio was disproportionate.

For comparison with other players in the market, that other container orchestration platform does not seem to have issues with this being a specific metric (that does not make them right nor a standard)

Metric name	Metric type	Description	Labels/tags	Status
kube_cronjob_spec_suspend	Gauge		`cronjob=<cronjob-name>namespace=<cronjob-namespace>`	STABLE

Anyway, recommended path was clarified, so case closed.

dmclf added the type/enhancement label Oct 3, 2024

tgross added type/question theme/batch Issues related to batch jobs and scheduling and removed type/enhancement labels Oct 3, 2024

jrasell added the stage/waiting-reply label Oct 4, 2024

dmclf closed this as completed Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad Periodic job enabled=false , recommended way to gain insights into this? #24119

Nomad Periodic job enabled=false , recommended way to gain insights into this? #24119

dmclf commented Oct 3, 2024

jrasell commented Oct 4, 2024

dmclf commented Oct 4, 2024

tgross commented Oct 4, 2024

dmclf commented Oct 4, 2024

Nomad Periodic job enabled=false , recommended way to gain insights into this? #24119

Nomad Periodic job enabled=false , recommended way to gain insights into this? #24119

Comments

dmclf commented Oct 3, 2024

jrasell commented Oct 4, 2024

dmclf commented Oct 4, 2024

tgross commented Oct 4, 2024

dmclf commented Oct 4, 2024