Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design and Discussion for One-off Tasks/Jobs/Whatever #2852

Open
dperny opened this issue May 3, 2019 · 8 comments
Open

Design and Discussion for One-off Tasks/Jobs/Whatever #2852

dperny opened this issue May 3, 2019 · 8 comments

Comments

@dperny
Copy link
Collaborator

dperny commented May 3, 2019

A highly requested feature in for Swarm Mode is the ability to run one-off operations of some kind. However, the scope of exactly what users need from these one-off jobs is too broad to make satisfactory progress on at the moment. Therefore, this issue is for design, discussion, and sharing of use cases, in order to pin down exactly what features users need.

The goal is to converge on a simple but powerful service mode that accomplishes one-off operations for the large majority of users while not compromising on Swarm's promises of simplicity.

In technical terms, currently, swarm Tasks can enter several different terminal states. Relevant among these is the COMPLETED state, which a Task enters if it exits with exit code zero. Therefore, it should be possible to create a new service mode (in addition to replicated or global) which has special handling for this terminal state in order to run one-off tasks. This would require a new Orchestrator component to handle this new service mode.

Some examples of open questions are below. This is a non-exhaustive list, and you should feel free to bring up anything else in this space.

  • Should one-off tasks be only one task per service, or should you be able to schedule more than one concurrent task as part of the same job?
  • What happens if a one-off task fails? Should it be rescheduled and retried? If there are multiple tasks in a job, what happens if some fail and some succeed? What happens if a task persistently fails? Should there be a failure threshold for when we stop trying?
  • Is the existing docker service CLI command adequate to express the desired behaviors, or is a new CLI command needed? What should it look like?
  • What kind of workloads do you want to run? What do you currently do to solve those use cases?
  • Is cron-style periodic scheduling support a necessary part of this for you?

The discussion in this issue will lead to a full design document for community review before we start building anything.

@mshirley
Copy link

mshirley commented May 7, 2019

so i come from a python background and what i would love to see is something similar to celery which has a lot of the functionality that would be nice.

https://docs.celeryproject.org/en/latest/getting-started/introduction.html

• Should one-off tasks be only one task per service, or should you be able to schedule more than one concurrent task as part of the same job?

i think that each task should be a standalone entity which is scheduled and put on the queue. if you want concurrent tasks as part of a job perhaps you can simply tag each task with a parent job and abstract the linking of tasks to a job to a higher level.

• What happens if a one-off task fails? Should it be rescheduled and retried? If there are multiple tasks in a job, what happens if some fail and some succeed? What happens if a task persistently fails? Should there be a failure threshold for when we stop trying?

if a single task fails there should be an option to retry given a certin configurable limit in time or count. it would be nice to have the option to configure the job as a whole to be marked as failed if >n tasks fail but the status of every task should be reported.

• Is the existing docker service CLI command adequate to express the desired behaviors, or is a new CLI command needed? What should it look like?

existing cli would be fine for me

• What kind of workloads do you want to run? What do you currently do to solve those use cases?

any periodic or ondemand command. basically anything you can think of, long running and short running.

• cron-like jobs that need to be run async on a daily or hourly schedule
• submitting jobs to other systems based on progromatic input such as reading from a kafka queue and executing a task every time a specific message comes across. this could be used for alerting or simply
kicking off a job in another system when a certin condition is met.

• Is cron-style periodic scheduling support a necessary part of this for you?

yes

@ohnotnow
Copy link

ohnotnow commented May 7, 2019

My use-case for one-off tasks is mostly the problem of running DB migrations (or similar 'update something once that the new code needs'. So I'd be really, really delighted by this. Especially (yes, feature creep!) if there was something we could tag like k8s 'init containers'.

At the moment we have a special part of our entrypoint scripts that pretty much does migrate_db(); while true; do sleep 86400; done - which always feels a bit hacky.

@olljanat
Copy link
Contributor

Btw. Portainer (which IMO is best open source management tool for swarm) recently introduced their cron type scheduler. It works on way that it schedules one-off containers to run those tasks.

You can try it on on their public demo instance http://demo.portainer.io/ (login with the username admin and the password tryportainer). Just enable Enable host management features from settings and you will see Scheduler -> Host jobs view.

I can see that one-off service support on swarmkit would allow them to improve it schedule jobs which example run once on all nodes with certain labels, etc.

@usbrandon
Copy link

We wanted to use swarm as a fabric of connected hosts that could take on various ETL and automation jobs we have. We wanted to schedule them, like we do now in cron and be able to capture the log output. Swarm should detect which host has enough free resources to run the job/container is important. Successful jobs should end and clean themselves up. Failed jobs should stick around in some way that we can study them to see what went wrong; but they should not get in the way of a next scheduled run.

@markbirbeck
Copy link

I don't know if this helps with specifying this functionality, but from the implementation standpoint I split the problem in two--first, the ability to run a one-off job on a swarm, which requires it to be specified as a service, and second, the ability to schedule jobs.

I've only tackled the first part so far, for which I developed 'Docker Job':

https://github.com/markbirbeck/docker-job

"The docker-job command-line application launches an image by wrapping it in a service and running it. Options are available to determine whether to output the logs when the job has completed, run more than one replica at the same time, and so on."

@dperny
Copy link
Collaborator Author

dperny commented Jul 1, 2019

I've opened a proposal for this, which is at moby/moby#39447. PTAL if you're interested in this feature, and confirm that the proposal meets your needs as a user.

@jnovack
Copy link

jnovack commented Oct 27, 2019

Can you provide some examples (actual or contrived) of jobs ("community"-use, "enterprise"-use) that one would submit?

I'm just not that worldly, so my frame-of-reference is limited (I'm a low-impact swarm-user, home/hobby/small-business area; so I don't understand "enterprise-level" requirements); or would the intention be that a third-party product pick up scheduling (much like portainer provides UI management; Docker makes the backend, someone can make the front-end, like alexellis/jaas).

(1) What kind of job(s) would you want to create, on-demand, that wouldn't already be a running service sitting idle, waiting for it's next call?

(2) How would you envision these jobs to be created, and separately, to be run (on-demand), if not for the scheduler? (Through the third-party front-end which uses the API?)

Is one-of-the-potentially-many use-cases to be a CI/CD task-runner? Am I thinking about this right? Some third-party scheduler additionally has a queue, and then just sends it over to swarm to execute and capture output/return status?

Help me want this. :)

Sidebar: Having the discussion here to not clog moby/moby#39447.

@cblomart
Copy link

When reading the thread it feels that too much is sometimes required from jobs.

When thinking of jobs I mainly think about being able to:

  • run a one off task when requested
  • run tasks on schedules this includes equivalent of « at » and « cron »

One of the main use case I have is the integration of ci/cd in there:
Most of the existing tools (Jenkins/GitLab-runners/Drone/...) will have a way to integrate docker/swarm... most of the time they would add an agent per swarm node and run containers directly on the node... so dependent job (I.e a build of a git repo) will happen on only one node. I think this can limit scalability and parallelism.
Swarm could be there to handle loadbalancing (scheduling) and the CI/CD can still handle the job orchestration.

I don’t know the inner workings on faas (function as a service) but I can imagine the same principle there.

Going from there:

  • docker logs in a swarm: acces logs of a task from the swarm and not needing to address the specific node
  • docker exec in a swarm: same principle as previous point but for exécution. Some CI/CD platforms will think like... start a container and so things in it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants