Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad Periodic job enabled=false , recommended way to gain insights into this? #24119

Closed
dmclf opened this issue Oct 3, 2024 · 4 comments
Closed
Labels
stage/waiting-reply theme/batch Issues related to batch jobs and scheduling type/question

Comments

@dmclf
Copy link

dmclf commented Oct 3, 2024

following issue 19671 more wondering about Hashicorp Nomad guidelines to be aware of a disabled periodic job.

as with version nomad 1.8.3, when you have a periodic job that is disabled:

  1. on Nomad-UI:
  • marked as Running
  • one can only see Job.Periodic.Enabled = false when going to job definition
    • which does not quite make that a 'simple' thing to check if you have dozens or hundreds of possible jobs
  1. on Nomad-CLI:
  • with nomad job status marked as Running
  • one can see with nomad job inspect jobname my-disabled-periodic-job |jq '.Job.Periodic.Enabled' but that also may not be 'convenient' to monitor on a 24x7 basis and raise alerts
  1. on Nomad Prometheus exported metrics (preferred):
  • no visibility at all.

so then generic question:

  • what would be the programmatic way to get proper insights into such disabled jobs?
    (As issue 19671 was closed as 'not worth doing' )
    (or, is it not cleaner in that case to deprecate the 'enabled' flag from the periodic jobspec to avoid confusion ? )
@tgross tgross added type/question theme/batch Issues related to batch jobs and scheduling and removed type/enhancement labels Oct 3, 2024
@jrasell
Copy link
Member

jrasell commented Oct 4, 2024

Hi @dmclf and thanks for raising this issue. I have tried to respond to each question below, with a further note on a particular line that caught my attention.

monitor on a 24x7 basis and raise alerts

The question that comes to mind when reading this is why do you need to raise alerts on this? If periodic jobs are having their enabled flag altered (or deployed with the wrong setting) and that is cause to trigger an alert, I would lean towards tighter access control on the Nomad cluster, job specifications within source control, and CI/CD automation for job deployments.

on Nomad-UI

The periodic block is only available when reading the job specification from Nomad and is not available within the listing. In order to display a button on the job list page, the UI would need to list and then read every job. This is time, network, and computationally expensive and must be done for every job, irregardless of whether they are periodic or not, in order to discover this fact.

on Nomad-CLI

The API is likely the better tool to use for this kind of work. The following example utilizes curl and JQ to get the periodic enabled value of each running job, printing out the information is a readable manner:

$ curl -s localhost:4646/v1/jobs | jq -r '.[].ID' |while read -r jobID; do curl -s localhost:4646/v1/job/$jobID | jq -r '"Job: \(.ID), Namespace: \(.Namespace),  PeriodicEnabled: \(.Periodic.Enabled)"' ; done
Job: not-periodic, Namespace: default,  PeriodicEnabled: null
Job: periodic-disabled, Namespace: default,  PeriodicEnabled: false
Job: periodic-enabled, Namespace: default,  PeriodicEnabled: true

Nomad Prometheus exported metrics (preferred)

This would require the Nomad servers to emit telemetry based on static job specification parameters rather than runtime information. This brings numerous questions and adds considerable computational overhead and would increase the cardinality of our metrics as 100 periodic jobs would produce 100 new data points. If this was the route you wanted to go, I would first consider building a small Prometheus sidecar exporter, which could consume the Nomad API and present the required data for scarping.

I would also be worried about the potential for this enhancement to creep of other job specification parameters if we chose to include it.

@dmclf
Copy link
Author

dmclf commented Oct 4, 2024

Hi @jrasell

re 1: the jobs are actually fully in CI/CD including job deployments
but that will not stop a person from possibly disabling it manually on the UI.

Which, if happens, will go unnoticed (job may look to be 'running' from a first glance) and for infrequent jobs, there wont be children due to GC's, so it will be hard to notice.

Will likely explore options for custom API monitoring, thanks you for confirming this path.

@dmclf dmclf closed this as completed Oct 4, 2024
@tgross
Copy link
Member

tgross commented Oct 4, 2024

Another thought is that this is the sort of thing you can monitor with Sentinel, for Nomad Enterprise.

@dmclf
Copy link
Author

dmclf commented Oct 4, 2024

Ok, just to briefly comment on Nomad (and Consul/Vault) Enterprise for our use case, cost/benefit ratio was disproportionate.

For comparison with other players in the market, that other container orchestration platform does not seem to have issues with this being a specific metric (that does not make them right nor a standard)

Metric name Metric type Description Labels/tags Status
kube_cronjob_spec_suspend Gauge   cronjob=<cronjob-name>namespace=<cronjob-namespace> STABLE

Anyway, recommended path was clarified, so case closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/waiting-reply theme/batch Issues related to batch jobs and scheduling type/question
Projects
Development

No branches or pull requests

3 participants