Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Device Support #2682

Open
dperny opened this issue Jul 2, 2018 · 56 comments
Open

Proposal: Device Support #2682

dperny opened this issue Jul 2, 2018 · 56 comments

Comments

@dperny
Copy link
Collaborator

dperny commented Jul 2, 2018

This is a rough overview of a proposed design for device support in Swarm. This is a possible implementation of #1244. The objective is to implement, in a way sensible to the cluster, support for devices. Please note that this is not yet on the road map; this is an early-stage proposal.

For community members, even if you don't or haven't contributed directly to swarmkit:

Does this meet or exceed the community's needs for device support? Is the UI flexible, ergonomic, and easy to use? Feel free to leave a comment explaining what is good and bad about this proposal.

Overview

Devices will be added as a first-class feature of swarm. The user will be able to define device classes, to which devices belong to. The user will be able to register devices on specific nodes, indicating to what class the device belongs to and what path the device is located at. The user can then specify device classes that a task needs to execute, and the swarmkit scheduler will assign a device to the task and place the task on the node with that device.

Goals

The goal of this proposal is to implement the most basic device-aware scheduling system, to swarmkit to fully support devices in a clustered environment.

Non-Goals

Non-goals of this proposal are to support things like security profiles or permissions. Additionally, though the device management workflow presented in this PR is a bit onerous and requires manual registration of devices, implementing automatic device detection and registration is out of scope.

Detailed Design

Data Model

The basic data model of devices is as follows:

  1. Device Classes represent a set of interchangeable and equivalent devices equally suited for scheduling. All devices belong to exactly one device class.
  2. Individual devices will be registered belonging to a class on a per-node basis. Once registered, a task may be assigned to use them.
  3. Task Specs will be updated to include the desired device classes and attachment options.

Devices are host-specific resources, but different devices on the same or different hosts may possibly be treated as interchangeable or equivalent. For example, many nodes in the cluster could possibly be attached to some GPU. Though the actual GPU on different nodes may be different, and there may even be more than one GPU per node, their functionality is equivalent, and any of these nodes is an equally suitable candidate for scheduling. Further, some devices should only be used by one task in the cluster, whereas others can be shared between as many tasks as needed.

Device classes are the object that represents the top-level concept of a device. Tasks can only specify devices in terms of device classes they desire. The specific device chosen is the prerogative of the swarmkit scheduler.

The individual devices available are a property of the node. A node may have as many devices specified as necessary. In keeping with the security pattern of not trusting workers, devices are always registered through the swarmkit manager, never self-reported or self-discovered.

Task Specs will include a list of device classes and options desired, including where in the task’s file system to place the device. Tasks must be prepared to accept any device in the class as equivalent. When a task is created, it will have the full run-time device parameter included in the object.

User Interface

Adding devices will introduce a new command and subcommands for the management of devices. The first command, and the biggest change, will be to add new subcommands to manage device classes:

Usage: docker swarm device COMMAND

Manage Swarm devices

Commands:
  add      Add a new device class to the swarm
  ls       List device classes on this swarm
  inspect  Show information about a device class and its devices
  rm       Remove a device class from the swarm

The add command adds a new device to the the swarm:

Usage: docker swarm device add [OPTIONS] CLASS

Add a new device class to the swarm

Options:
     --shared      Allow this class to be shared between tasks
     --label list  Set metadata on this device class

The ls command will allow listing all available device classes

Usage: docker swarm device ls [OPTIONS]

List device classes on this swarm

Options:
  -q, --quiet   Only display IDs
  -f, --filter  Filter output based on conditions provided

The inspect command will allow showing full information about device classes, as well as allowing the user to include all devices currently registered belonging to a device class.

Usage: docker swarm device inspect [OPTIONS] CLASS [CLASS]

Display detailed information one one or more device classes

Options:
  -f, --format string   Format the output
      --pretty          Print the information human friendly
      --devices         Include devices belonging to this class

The remove command is similar to all other rm commands, and its usage is obvious, with the caveat that removal of a device class will be disallowed if a device is in use by task. There is no update command, as device classes will not be treated as updateable.

To manage particular devices on nodes, the existing node update command will receive new flags:

--device-add device  Register a device on a node with the swarm
--device-rm device   Deregister a device with the swarm

Similar to other options like ports and volumes, devices will accept both short- and long-form versions.

The short form will take the format target:class, where path is the path of device on the host, and class is the device class to register with. as such

--device-add /dev/nvidia0:gpu

The long form of the command allows specifying these options independently, and allows future expansion of options for devices (such as host-specific cgroup options):

--device-add target=”/dev/nvidia0”,class=”gpu”

The device rm option for node update acts as expected, but will disallow removing a device that is in use.

Services would also support new flags. Service create will have a new option, --device, with both a long form and a short form. The short form will be reciprocal of the the --device flag on the node, taking the form class:path. It will also optionally support a third rwm field, mirroring the --device flag on docker run. The long form will take discrete arguments, and allow the user to specify cgroup options as supported in th

The short form, for mounting a GPU:

--device gpu:/dev/nvidia0

Services would also support a long form of the command:

--device class=”gpu”,path=”/dev/nvidia0”

Note: the long form of the command could possibly support further cgroup options, as allowed in the docker REST API for container creation.

Service update would include --device-add and --device-rm flags. --device-add syntax will be equivalent to the --device flag of create. Because a task may have more than one device of a class mounted into its running container, --device-rm would require both the class and path of the device to disambiguate the specific device that is to be removed.

--device-rm class=”gpu”,path=”/dev/nvidia0”

REST API

The Docker engine REST API would require a new set of endpoint to accommodate the concept of device classes. These endpoints would return the JSON representation of the objects described in the example Protocol Buffers. These endpoints would be as follows:

GET    /devices             List device classes
POST   /devices/create      Create a new device class
GET    /devices/{id}        Inspect a device class
POST   /devices/{id}/update Update a device class
DELETE /devices/{id}        Delete a device class

Protocol Buffers

In swarm, protocol buffers define the internal API and object structure.

The DeviceClass proto will form a new top-level type, like a Network or a Service. It will have an ID and a name.

// DeviceClass is a specification for a particular device, zero or more of
// which may be available on the cluster. It refers to the general class of
// devices that the user wishes to be assumed as interchangeably usable. For
// example, a cluster may have many possible block devices on many nodes, but
// any of them are valid. The specific implementation of a specific device on a
// node is provided by the node. A particular device may only belong to one
// device class.
message DeviceClass {
  string id = 1;

  Meta meta = 2 [(gogoproto.nullable = false];

  // Shared represents whether this device can be shared between many tasks, or
  // whether it should be uniquely mapped to a particular task. Shared devices
  // may have any number of tasks assigned to them.
  //
  // Note that Shared has strong security risks; shared devices may be used by
  // tasks to communicate with one another.
  bool shared = 3;
}

The Device proto is included as a repeated field on Node specs. It defines a particular available device belonging to a class.

// Device represents a particular available device on a node. It is one
// particular instance of a DeviceClass, and is interchangeable with other
// devices in the DeviceClass
message Device {
  // DeviceClassID is the ID of the device class that this device belongs to.
  string device_class_id = 1;

  // PathOnHost is the path in the host's filesystem that this device should be
  // mounted from.  For example, a block device may have this value as
  // "/dev/sda". A particular device may belong to only 1 device class;
  // assigning a device to more than one class may cause it to be conflictingly
  // scheduled.
  string path_on_host = 2;
}

The DeviceAttachmentSpec is a repeated field found in the TaskSpec proto, and defines the devices that a task should be attached to.

// DeviceAttachmentSpec represents the spec for a device attachment
message DeviceAttachmentSpec {
  // DeviceClass is the ID or name of the device class that is to be used for
  // this spec. The actual device may be any device of this class on any node.
  string device_class = 1;

  // Path represents the path in the task's filesystem that this device should
  // be mounted at.
  string path = 2;
 
  // DeviceCgroupRules represents the cgroup rules that should be applied to
  // this device.
  repeated string device_cgroup_rules = 16;
}

The DeviceAttachment is a repeated field on Tasks which defines specifically the run-time parameters of a device attachment for a particular task.

// DeviceAttachment represents the run-time configuration of a device in use.
// It includes both the path on the host and the path in the Task of the
// device, because a Task may have many devices of the same class reserved, and
// those reservations would be otherwise indistinguishable.
message DeviceAttachment {
  // DeviceClassID is the ID of the device class used for this device
  string device_class_id = 1;

  // PathOnHost is the path on the host's filesystem of the device to be used
  // by the task.
  string path_on_host = 2;

  // PathInTask is the path in the task's filesystem that the device will be
  // mounted at.
  string path_in_task = 3;

  // DeviceCgroupRules represents the Cgroup rules that should be applied to
  // this device.
  repeated string device_cgroup_rules = 16;
}

Swarmkit Implementation

The device allocator will be implemented as a sub-component of the Scheduler. It will be created when a scheduler is created, and keep track of the available devices in the cluster. Scheduling for available devices forms part of the constraint-solving portion of the scheduler.

Task updates present a difficulty for devices. If devices in the class can be shared between tasks (marked --shared), then there is not problem. However, the start-first update strategy would fail if there were not at least one device in a class available, such that the new task could start with a fresh device, allowing the old task to shut down and free its in-use device. There is no easy solution for this, I think. We should instead document thoroughly that using start-first with devices may cause trouble.

Error Handling

Because of the nature of distrusting the workers, it is difficult or impossible for swarm to “prove” that a given device exists on a node, or performs as the user expects. Swarm will therefore make no attempts to verify the correctness of provided user data. If a device is mistakenly assigned to the wrong class, or if it does not exist at all, the task is expected to fail to start. It should enter a terminal state of FAILED and should include an error message explaining that the errant device is at fault.

Notably, in this proposal, there will be no attempt to “downweight” or otherwise attempt to avoid a node with a failing device. This functionality may come later, but not as part of this proposal.

Security

It must be understood that once on the host, swarmkit has no control over how a task uses devices. If improperly used, devices can be an extreme security hole for swarm tasks. For example, mounting block devices may allow read or write access to all of their contents. If the host’s primary block device were mounted into a task, that task could have full access to the host filesystem.

About Generic Resources

Swarmkit currently includes a feature called “Generic Resources”, which serves to allow scheduling based on kinds of resources. The design doc for Generic Resources [2] outlines their use, which overlaps with the use case of this proposal. Specifically, Generic Resource already keeps track of resources which are available and in use on a cluster.

However, GenericResource has a notable deficiency: it lacks context about the runtime usage of a particular reserved resource. Essentially, a task is only informed of a resource at runtime, and the swarmkit worker has no way to know how to make use of a particular resource, which makes the feature quite useless.

The obvious solution would be to include in the TaskSpec instructions for how to make use of a resource. However, this puts the information about how to use a resource separate from the information about what resource is required. A TaskSpec might, for example, request in its ResourceReservations 3 GPUs, but in its ContainerSpec in a hypothetic Devices field, only use 2 of them, leaving 1 wasted. Or, alternatively, a TaskSpec might include instructions for mounting an audio device, but not include a reservation for one. This means that run time checks would be needed to make sure that the requested resources match the runtime instructions for using resources. Instead, this proposal uses the type system to make this kind of mismatch impossible to express.

Additionally, we cannot simply annotate or augment the GenericResource type in the task resource reservations, because the same type is shared between the TaskSpec (requested resources), the Task itself (assigned resources), and the Node (available resources). The same type is used to express which resources are available, which resources are assigned, and which resources are requested. However, these types all serve different purposes. Available resources don’t need to be aware of how they should be used by a task and requested resources can’t be aware of what resource will be assigned. This means that fields on the GenericResource would either mean different things in different places, or there would only be a subset of fields in use on any given object.

[2] https://github.com/docker/swarmkit/blob/de950a7/design/generic_resources.md

@connormcmk
Copy link

While I cannot claim to know anything about the implementation or protocols, I can say that this is a desperately needed feature for any sort of IoT development for which current solutions (however clever) are insufficient. +1 due to that.

The user interface that's proposed also seems fairly intuitive. My question is, would this then support docker-compose files?

@dperny
Copy link
Collaborator Author

dperny commented Jul 2, 2018

I don't have a design for compose support, but I imagine it would be straightforward. You would just include devices in a service definition, like you do networks or ports. Something like this (very rough, not part of the proposal):

version: '3'
services:
  iot:
    ports:
     - "5000:5000"
    volumes:
     - .:/datastore
    devices:
    - target: sensor
      path: /dev/sensor

The only open question is whether a compose file should also be able to define device classes and devices per node. That's a better question for the compose team, after we've passed this phase of design.

@connormcmk
Copy link

@dperny I like the plan, would be great to see this!

@apollo13
Copy link

apollo13 commented Jul 7, 2018

@dperny This would cover our needs for using hardware security modules in containers. I cannot find anything wrong in the proposal.

@dperny
Copy link
Collaborator Author

dperny commented Jul 9, 2018

I'm... kind of a doofus? And totally forgot that swarm supports Generic Resource constraints, design doc here: https://github.com/docker/swarmkit/blob/master/design/generic_resources.md

This work, which everyone seems to have forgotten even happened, handles the difficulty of managing which resources are in use on which nodes and by which tasks, which is the more complicated part of this proposal.

However, there is a big problem with the generic resources: the resource availability is decoupled at the data model from the way the resource is used. Essentially, you can keep track of which and how many resources a node has, but not how to actually make use of those resources. This is an explicit non-goal of the Generic Resource design. Quote,

As swarmkit is not responsible for exposing the resources to the container (or acquiring them),
it needs a way to communicate how many generic resources were assigned (in the case of
discrete resources) or / and what resources were selected (in the case of sets).

The reference implementation of the executor exposes the resource value to
software running in containers through environment variables.
The exposed environment variable is prefixed with DOCKER_RESOURCE_ and it's key
uppercased.

This implies that tasks should be responsible for requisitioning their own resources at run time. However, this is impossible for devices. A task, from within a container, cannot attach devices after it has started. So the task has an awareness of what resources are available to it, but no actual way to make use of them. This basically explains why nobody uses this feature; the only way to do so would be to create tasks mounting the docker socket that spawn new containers.

The executor will have to be aware of how devices are accessed for devices to work. The responsibility for putting those devices into the task will have to live entirely within the agent.

I'll need to rewrite this proposal to accommodate this existing GenericResource feature, so we don't have two overlapping features with different but similar purposes.

@dperny
Copy link
Collaborator Author

dperny commented Jul 9, 2018

I'm poking at how to leverage the existing GenericResource code, and it's honestly not that sensible. The use case is too different. The amount of mogrification to the GenericResource concept that one would have to do is untenable.

Honestly... GenericResource isn't a super sensible implementation anyway. It totally decouples a task's resource demands from the actual use of resources, which is a serious problem. If a Task reserves a resource, but does not have any way to use it, the resource is wasted. However, if a Task specifies how to use a resource, but no such reservation was made, then the Task will fail in strange ways.

I think, despite the slight duplication of efforts, the use case for actually using devices is sufficiently different to warrant a separate design.

@dperny
Copy link
Collaborator Author

dperny commented Jul 9, 2018

Updated the design document to include section on GenericResource

@mbonato
Copy link

mbonato commented Jul 12, 2018

@dperny I would love to see this implemented! This would allow us to proper use hardware security modules (HSM) which are required by our application in swarm mode.

@connormcmk
Copy link

@dperny Any update on progress for those of us who are eagerly waiting?

@dperny
Copy link
Collaborator Author

dperny commented Aug 22, 2018

Yes, I'm gonna do it, I just keep getting pulled away on other things internally. But it's gonna happen. Soon™.

@connormcmk
Copy link

connormcmk commented Aug 22, 2018 via email

dperny added a commit to dperny/swarmkit-1 that referenced this issue Aug 24, 2018
Adds the protocol buffers needed to support allocation of devices in
swarmkit. Part of the proposal in moby#2682.

Signed-off-by: Drew Erny <[email protected]>
dperny added a commit to dperny/swarmkit-1 that referenced this issue Aug 27, 2018
Adds the protocol buffers needed to support allocation of devices in
swarmkit. Part of the proposal in moby#2682.

Signed-off-by: Drew Erny <[email protected]>
dperny added a commit to dperny/swarmkit-1 that referenced this issue Aug 27, 2018
Adds the protocol buffers needed to support allocation of devices in
swarmkit. Part of the proposal in moby#2682.

Signed-off-by: Drew Erny <[email protected]>
@connormcmk
Copy link

connormcmk commented Sep 20, 2018

@dperny Any updates or timeline? Thanks!

@swift1911
Copy link

is any progress about this issue?

dperny added a commit to dperny/swarmkit-1 that referenced this issue Dec 17, 2018
Adds the protocol buffers needed to support allocation of devices in
swarmkit. Part of the proposal in moby#2682.

Signed-off-by: Drew Erny <[email protected]>
@flopon
Copy link

flopon commented Feb 13, 2019

Seems not ^^

@dperny
Copy link
Collaborator Author

dperny commented Feb 13, 2019

i had a bunch of free time for a little while, and then it rapidly became not a bunch of free time, and now i'm doing other things. i'm really sorry, i started promising this a year ago and i feel The Guilt over not delivering on it.

@apollo13
Copy link

@dperny Please do not feel any guilt. No matter how snarky the comments from people like @flopon are (and I am sure he didn't mean to put any pressure on you), without throwing loads of money towards you there is no right to expect any progress.

Please do not ever feel bad for not delivering on a ticket on an (mostly) OSS project. Your work is highly appreciated and please do not let any comments get your motivation down!

@dperny
Copy link
Collaborator Author

dperny commented Feb 13, 2019

i mean, i am having loads of money thrown at me, it's just being thrown at me to work on other features.

@joekrom
Copy link

joekrom commented Aug 7, 2020

is the devices functionality alreday available in the swarm mode ???? if not how could i hack it i would like to access dev/video on a pc and pin usb port on the raspberry pi

@n1nj4888
Copy link

@dperny - Any update on this? Really hoping this is on the roadmap for swarm?

@TeoTN
Copy link

TeoTN commented Dec 29, 2020

I understand you may not want to hear yet another "any updates?" from me... Yet, with this issue running for years now, it would really be nice to hear something like "yea, we will deliver it in Q1 2021" or whatever is the plan for it or "no, we're not implementing this, swarm is not for you if you need to access devices". It'd really help making informed decisions.

@TeoTN
Copy link

TeoTN commented Apr 23, 2021

Okay, now I realize the magnitude of the problem, the project is apparently silently deprecated, looking at the support. I'm gonna start migrating to Kubernetes and would advise others to consider doing the same.

@prologic
Copy link

I really don't think the project is deprecated at all

@johny-mnemonic
Copy link

@prologic What makes you think so? 😲

@prologic
Copy link

@prologic What makes you think so? 😲

Oh I dunno, the fact that there were 5 fixes/enhancements in in Docker Engine two versions ago in 20.10.5? 🤷‍♂️ Some of which were long awaited feature requests?

@johny-mnemonic
Copy link

@prologic I admit, that I didn't expect docker engine getting updates in 2021, but still according to changelog link you sent and also according to Github there is no change for docker engine in 20.10.5.
Anyway, I am happy to be proven wrong that Docker is dead. So there is still some hope. That's always good to hear.

@prologic
Copy link

@prologic I admit, that I didn't expect docker engine getting updates in 2021, but still according to changelog link you sent and also according to Github there is no change for docker engine in 20.10.5.
Anyway, I am happy to be proven wrong that Docker is dead. So there is still some hope. That's always good to hear.

Just because it doesn't see any new hot 'n shiny new features doesn't mean it's dead 😀 -- Just because Kubernetes is getting all the attention also doesn't mean Docker is dead or somehow worth less. In fact AFAIK Kubernetes still uses Docker as the default container engine anyway. 🤷‍♂️

@johny-mnemonic
Copy link

Well, Docker is now considered deprecated in Kubernetes and will switch to unsupported in some of the next version this year.
The Kubernetes clusters I work with switched from Docker almost two years ago and with the recent deprecation of it as a container engine and rumors about it being sold and killed I thought the development is close to zero...

I am using Swarm on my cluster of Raspberry PIs at home as I love it's simplicity and it is also not a resource hog as Kubernetes is for these small SBCs.
Issues like this one are driving me in K3s way though...

@prologic
Copy link

I dunno @johny-mnemonic I think the reason for Docker's deprecation in Kubernetes is more "political" than "technical". But what would I know 🤷‍♂️ 😂

OTOH Docker is Open Source Software. Anyone is welcome to contribute to it. There is absolutely no reason Docker as a software, platform, set of libraries (whatever) should die -- But this happens all the time in open-source. I guess it's human nature? 🤔

@johny-mnemonic
Copy link

Yeah, I don't know either.
My bet is, that since containerd was taken out of Docker, there is no reason why would Kubernetes even wanted to use Docker. The only thing it needs is engine capable of running containers and that's containerd. Most of other features Docker have are of not much use for Kubernetes.

@Obskuro
Copy link

Obskuro commented Jun 7, 2021

There are no updates on this issue? @dperny the last message from you in this thread is over a year ago. maybe something has changed?

@dperny
Copy link
Collaborator Author

dperny commented Jun 7, 2021

Volume support wrapping up soon, this is probably next on the list (but might not be).

@neben
Copy link

neben commented Jun 11, 2021

Volume support wrapping up soon, this is probably next on the list (but might not be).

@dperny Where can I follow the volume support development?

@se7enXF
Copy link

se7enXF commented Jan 28, 2022

This issue was updated six months ago, could you tell me is the problem solved now please? I urgently need to mount the device in swarm mode.

@djmaze
Copy link

djmaze commented Jan 28, 2022

@se7enXF If you need it urgently, you can use workaround and use a wrapper service which runs your container: https://serverfault.com/a/1089792

@se7enXF
Copy link

se7enXF commented Jan 29, 2022

@djmaze Thank you for your advice. In reality, I need to use compose-file to start the service, so the method you propose is not applicable.

@djmaze
Copy link

djmaze commented Jan 29, 2022

@se7enXF And I guess this doesn't count as using a compose file?

version: "3.7"

services:
  app:
    image: docker
    command: docker run --rm --device <DEVICE> <IMAGE> <ARGS>
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro

@allfro
Copy link

allfro commented Nov 17, 2022

I've created a docker plugin for this but unfortunately I do not know how to make it work for cgroup v2 yet. If you are using a cgroup v1 OS like alpine to run docker swarm then this plugin may work for you https://github.com/allfro/device-volume-driver. Help is appreciated if anyone knows how to manipulate cgroup v2 EBPF device controller programs.

@zikaeroh
Copy link

I sent a chain of PRs which implement something like this at #3106 (see also #1244 (comment)).

But, I didn't go so far as to actually do everything in this proposal, instead opting for the plain plumbing which is done for other options. I personally feel like this is fine.

@Hobbit44
Copy link

I would also like just a basic implemetation to start. Just something that i can work with. I accept the risks associated with that.

To me its better to have a simple version and let others guide the focus of development than go fully featured in one go. You may find that half of the implementation goes unused.

@yodog
Copy link

yodog commented Dec 17, 2023

everybody ready for this christmas present :)

@demsking
Copy link

...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests