JobMon is a job monitoring system, which is capable of:
- Monitoring child processes; generally, this means that JobMon can start a process for you, allow you to query its status (if it is running or not), kill the process, and receive notifications when the process is started or it dies.
- Configuring the environment that child processes run it. Currently, this
is a useful but small list of things that can be configured. These include:
- Telling the child processes what files it should hook its standard IO streams up to.
- Setting up environment variables for the child process.
- Setting the working directory of the child process.
- What signal is sent to the child process to kill it.
- Handling commands, and sending responses, over UNIX sockets. The protocol (which uses JSON) is covered later in this document.
- Dispatching events over UNIX sockets using a similar protocol to that employed by commands and responses.
- Automatically starting and restarting commands (assuming it is configured in the job's configuration).
JobMon is, conceptually, two pieces - an API (which can be used from Python code) and a command line utility. The pieces common to both (which is mostly configuration) are described first. The API is documented by the modules themselves, and will be elided here.
Configuring JobMon is done via a JSON-formatted configuration file, which may "include" any number of other JSON-formatted configuration files. The main file is called the "master", while the files it includes are called "job files". These have slightly different formats.
The master configuration can contain two sections - one which configures the JobMon supervisor itself, and another which specifies jobs. The configuration format for jobs is discussed in the next section, since that is common to both the master configuration as well as job files.
Here is an example of a master configuration:
{ "supervisor": { "working-dir": ".", "control-port": 6666, "event-port": 6667, "include-dirs": [ "jobs/*.json" ], "log-file": "$TMP/supervisor.log", "log-level": "WARNING" }, "jobs": { ... } }
Each part of the supervisor configuration is described below:
working-dir
sets the working directory for the supervisor daemon. This can be useful if the paths used in the configuration file are relative paths. The default is not to change the working directory (i.e. use.
).control-port
sets the TCP port over which JobMon accepts commands. By default, it is the port 6666.event-port
sets the TCP port over which JobMon will dispatch events. By default, it is the port 6667.include-dirs
is a list of globs, each of which should reference a list of job files to include. The default is that no files are included.log-file
is the path to the daemon's logs. Note that file is appended to, so no previous log data is lost on subsequent uses (but the file can also grow to large sizes, depending upon what is logged). The default is/dev/null
.log-level
indicates what kinds of messages are logged. The following constants are used in this field (note that this field is case-insensitive):DEBUG
prints out the most messages, and is really only useful for development.INFO
prints out messages which are generally not of interest outside of development (although fewer are printed than whenDEBUG
is used).WARN
(also,WARNING
) print out messages which indicate some kind of error in a configuration file or the code which, while not fatal, is generally worth paying attention to. This is the default logging level.ERROR
prints out serious error messages.CRITICAL
prints out messages which are extremely important.
Note that working-dir
, include-dirs
and log-file
will expand
shell variables using the traditional $NAME
syntax. Note that $$
escapes into a single $
.
Job files are a subset of the full master configuration file. The section elided in the original example is where job information would be stored, that is, inside the hash called "jobs".
When using a jobs file (that is included by the master), the top-level "jobs" can be elided. A sample jobs file follows:
{ "true-job": { "command": "/bin/true", "stdin": "$DEVNULL", "stdout": "$DEVNULL", "stderr": "$DEVNULL", "env": { "HOME": "/home/bob", }, "working-dir": "/home/bob", "signal": "SIGSTOP", "autostart": false, "restart: true } }
"true-job"
is the name of the job. These names can include any character, but must be globally unique (that is, neither the master nor any other files included by the master can use this same name).The
command
option (which is mandatory) indicates the command to launch. Note that this command can use syntax supported by/bin/sh
.stdin
,stdout
, andstderr
give a filename which is hooked up to the named standard IO stream. Each of these has a default of/dev/null
. Note thatstdout
andstderr
are appended to, not cleared.env
is a hash which gives a set of environment variables to pass to the child process, and their values. Note that all of the daemon's environment variables are passed in as well, in addition to these variables, but the variables in the configuration file take precedence.working-dir
sets the working directory of the child - the default is.
signal
sets the signal that is sent to the child process when it is stopped. The values allowed in this (case-insensitive) field can be found by runningkill -l
on your system - however, the precedingSIG
is required. The default signal isSIGTERM
.autostart
dictates whether or not the job should be started automatically by the daemon (the default is that the job is not started automatically).restart
dictates whether or not the job will be restarted when it crashes.Note that this is subject to an important restriction - if the job dies within 5 seconds of it dying previously, then JobMon will force the job to wait for 15 seconds before it can be restarted again. This is meant to prevent the job from burning CPU cycles by restarting repeatedly. The 5 and 15 second parameters cannot currently be changed.
Note that the stdin
, stdout
, stderr
, and working-dir
fields do
environment substitution in the same way as in the supervisor configuration
discussed above.
The command line tool, called jobmon
, is designed to give a convenient
interface to the capabilities of JobMon. The tool's internal documentation
can be viewed by calling jobmon help
.
The first thing to remember about the command line tool is the special
environment variable called $JOBMON_PORT
. This variable must be
set if you are using any subcommand which is not help
or daemon
; this
is because it is used to store the command and event sockets:
# When starting the daemon... $ export JOBMON_PORT=`jobmon daemon CONFIG`
As a general rule, note that any command (other than status
) will return
0 on success and nonzero on failure (and will also print a message on
standard error). status
is special in this regard - if it encounters an
error, it returns a negative status code; if the job that it queries is
running, the it returns a 0, while if the job it queries is stopped, it
returns a positive status code.
jobmon list-jobs
and jobmon listen
share a common output format. For
example, consider a JobMon instance with two jobs, Job A which is running and
Job B which is stopped. jobmon list-jobs
should print:
RUNNING Job A STOPPED Job B
Let's say that Job A was started, then Job B was started, and then Job B
stopped. jobmon listen
might produce the following event stream:
RUNNING Job A RUNNNIG Job B STOPPED Job B
Finally, the jobmon wait
command will wait until the given job has
changed status. To find out what the status is afterwords, run
jobmon status
, since jobmon wait
does not print out anything.
Simply run python3 setup.py install
to install this package. Note that
Python 3 is required (I have not tested this on any version but 3.4, and
thus this code probably requires Python >=3.4).
JobMon is currently tested, although not completely (and the tests could probably be a bit neater too). To run a test, use the unittest module's automatic test discovery:
$ python3 -m unittest
Written by Adam Marchetti <[email protected]>, and released under the 2-clause BSD license.
The file jobmon/daemon.py
was written by David Mytton <[email protected]>
and released under a Creative Commons BY-SA 3.0 license. Modifications were
made by Adam Marchetti <[email protected]>. The original version can
be found at the link provided in the source file itself.