Skip to content

Latest commit

 

History

History
379 lines (257 loc) · 18.6 KB

planning_index.adoc

File metadata and controls

379 lines (257 loc) · 18.6 KB

Application Planning

As you begin to contemplate the containerization of your application, there are number of factors that should be considered prior to authoring a Dockerfile. You will want to plan out everything from how to start the application, to network considerations, to making sure your image is architected in a way that can run in multiple environments like Atomic Host or OpenShift.

The very act of containerizing any application presents a few hurdles that are perhaps considered defacto in a traditional Linux environment. The following sections highlight these hurdles and offer solutions which would be typical in a containerized environment.

Security and user requirements

Host and image synchronization

Some applications that run in containers require the host and container to more or less be synchronized on certain attributes so their behaviors are also similar. One such common attribute can be time. The following sections discuss best practices to keeping those attributes similar or the same.

Time

Consider a case where multiple containers (and the host) are running applications and logging to something like a log server. The log timestamps and information would almost be entirely useless if each container reported a different time than the host.

The best way to synchronize the time between a container and its host is through the use of bind mounts. You simply need to bind mount the host’s /etc/localtime with the container’s /etc/localtime. We use the ro flag to ensure the container can’t modify the host’s time zone by default.

In your Dockerfile, this can be accomplished by adding the following to your RUN label:

Synchronizing the timezone of the host to the container.
-v /etc/localtime:/etc/localtime:ro

Machine ID

The /etc/machine-id file is often used as an identifier for things like applications and logging. Depending on your application, it might be beneficial to also bind-mount the machine ID of the host into the container. For example, in many cases journald relies on the machine ID for identification. The sosreport application also uses it. To bind-mount the machine ID of the host to the container, specify the following when running your container or add the following line to your atomic RUN label so atomic can utilize it:

Synchronizing the host machine ID with a container
-v /etc/machine-id:/etc/machine-id:ro

Starting your applications within a container

At some point in the design of your Dockerfile and image, you will need to determine how to start your application. There are three prevalent methods for starting applications:

  • Call the application binary directly

  • Call a script that results in your binary starting

  • Use systemd to start the application

For the most part, there is no single right answer on which method should be used; however, there are some softer decision points that might help you decide which would be easiest for you as the Dockerfile owner.

Calling the Binary Directly

If your application is not service-oriented, calling the binary directly might be the simplest and most straight-forward method to start a container. There is no memory overhead and no additional packages are needed (like systemd and its dependencies). However, it is more difficult to deal with setting environment variables.

Using a Script

Using a special script to start your application in a container can be a handy way to deal with slightly more complex applications. One upside is that it is generally trivial to set environment variables. This method is also good when you need to call more than a single binary to start the application correctly. One downside is that you now have to maintain the script and ensure it is always present in the image.

Use systemd

Using systemd to start your application is great if your application is service-oriented (like httpd). It can benefit from leveraging well tested unit files generally delivered with the applications themselves and therefore can make complex applications that require environment variables easy to work with. One disadvantage is that systemd will increase the size of your image and there is a small amount of memory used for systemd itself.

Note
As of docker-1.10, the docker run parameter of --privileged is no longer needed to use systemd within a container.

You can implement using systemd fairly simply in your Dockerfile.

Network Considerations

Single Host

Multi Host

Networking in OpenShift

Establishing network connection between containers in OpenShift is different from the standard Docker container linking approach. OpenShift uses a built-in DNS so that services can be reached by the service DNS and service IP address. In other words, applications running in a container can connect to another container using the service name. For example, if a container running the MySQL database is a database service endpoint, database will be used as a hostname to connect to it from another container. In addition to DNS record, you can also use the environment variables with IP address of the service which are provided for every container running in the same project as the service. However, if the IP address (environment variable) changes, you will need to redeploy the container. Using service names is therefore recommended.

For details, see OpenShift Documentation.

Storage Considerations

When you architect your container image, storage can certainly be a critical consideration. The power of containers is that they can mimic, replicate, or replace all kinds of applications and therein lies why you must be careful in considering how you deal with storage needs.

By nature, the storage for containers is ephemeral. This makes sense because one of the attractions of containers is that they can be easily created, deleted, replicated, and so on. If no consideration to storage is given, the container will only have access to its own filesystem. This means if the container is deleted, whatever information, whether it is logs or data, will be lost. For some applications, this is perfectly acceptable if not preferred.

However, if your application generates important data that should be retained or perhaps could be shared amongst multiple containers, you will need to ensure that this storage is set up for the user.

Persistent Storage for Containers: Data Volumes

Docker defines persistent storage in two ways.

  1. Data volumes

  2. Data volume containers

However at present, the use of data volumes is emerging to be the preferred storage option for users of Docker. The Docker website defines a data volume as "a specially-designated directory within one or more containers that bypasses the Union File System." It has the distinct advantages that they can be shared and reused for one or more containers. Moreover, a data volume will persist even if the associated container is deleted.

Data volumes must be explicitly created and preferably should be named to provide it with a meaningful name. You can manually create a data volume with the docker volume create command.

Creating a data volume for persistent storage
$ docker volume create <image_name>
Note
You can also specify a driver name with the -d option
Using Data Volumes in a Dockerfile

For developers whose applications require persistent storage, the trick will be instantiating the data volume prior to running the image. This, however, can be achieved leveraging the LABEL metadata and applications like atomic.

We recommend that the data volume be created through the use of the INSTALL label. The INSTALL label is meant to identify a script that should be run prior to ever running the image. In that install script, adding something like the following can be used to create the data volume.

Creating a data volume in your install script
chroot /host /usr/bin/docker volume create <image_name>

To then use the data volume, the RUN label would need to use the bind mount feature. Adding the following to your RUN label would bind mount the data volume by name:

Adding a data volume by name into your RUN label
-v <data_volume_name>:/<mount_path_inside_container>

Persistent Storage for Containers: Mounting a Directory from the Host

You can also leverage the host filesystem for persistent storage through the use of bind mounts. The basic idea for this is to use a directory on the host filesystem that will be bind mounted into the container at runtime. This can be simply used by adding a bind mount to your RUN label:

Bind mounting a directory from the rootfs to a running container
-v /<path_on_the_rootfs>:/<mount_path_inside_container>

One downside to this approach is that anyone with privileges to that directory on the host will be able to view and possibly alter the content.

OpenShift Persistent Storage

Storage Backends for Persistent Storage

Logging

If your application logs actions, errors, and warnings to some sort of log mechanism, you will want to consider how to allows users to obtain, review, and possibly retain those logs. The flexibility of a container environment can however present some challenges when it comes to logging because typically your containers are separated by namespace and cannot leverage the system logging without some explicit action by the users. There are also several solutions for logging containers like:

  • using a logging service like rsyslog or fluentd

  • setting the docker daemon’s log driver

  • logging to a file shared with the host (bind mounting)

As a developer, if your application uses logging of some manner, you should be thinking about how you will handle your log files. Each of the aforementioned solutions has its pros and cons.

Using a Logging Service

Most traditional Linux systems use a logging service like rsyslog to collect and store its log files. Often the logging service will coordinate logging with journald but nevertheless it too is a service and will accept log input.

If your application uses a logger and you want to take advantage of the host’s logger, you can bind mount /dev/log between the host and container as part of the RUN label like so:

-v /dev/log:/dev/log

Depending on the host distribution, log messages will now be in the host’s journald and subsequently in /var/log/messages assumming the host is using something like rsyslog.

Setting the Log Driver for docker Daemon

Docker has the ability to configure a logging driver. When implemented, it will impact all containers on the system. This is only useful when you can ensure that the host will only be running your application as this might impact other containers. Therefore this method has limited usefulness unless you can ensure the final runtime environment.

Using Shared Storage with the Host

The use of persistent storage can be another effective way to deal with log files whether you choose to perform a simple bind mount with the host or data volumes. Like using a logging service, it has the advantage that the logs can be preserved irregardless of the state of the container. Shared storage also reduces the potential to chew up filesystem space assigned to the container itself. You can bind mount either a file or directory between host and container using the -v flag in your RUN label.

-v <host_dir|file>:<image_dir|file>

Logging in OpenShift

OpenShift automatically collects logs of image builds and processes running inside containers. The recommended way to log for containers running in OpenShift is to send the logs to standard output or standard error rather than storing it in a file. This way, OpenShift can catch the logs and output them directly in the console, as seen in the picture below, or on the command line (oc logs).

openshift logs

For details on logs aggregation, see the OpenShift docuemntation.

Security and User Considerations

Passing Credentials and Secrets

Storing sensitive data like credentials is a hot topic, especially because Docker does not provide a designated option for storing and passing secret values to containers.

Currently, a very popular way to pass credentials and secrets in a container is specifying them as environment variables at container runtime. You as a consumer of such an image don’t expose any of your sensitive data publicly.

However, this approach also has caveats:

  • If you commit such a container and push your changes to a registry, the final image will contain also all your sensitive data publicly.

  • Processes inside your container and other containers linked to your container might be able to access this information. Similarily, everything you pass as an environment variable is accessible from the host machine using docker inspect as seen in the mysql example below.

# docker run -d --name mysql_database -e MYSQL_USER=user -e MYSQL_PASSWORD=password -e MYSQL_DATABASE=db -p 3306:3306 openshift/mysql-56-centos
# docker inspect openshift/mysql-56-centos

<snip>

"Env": [
            "MYSQL_USER=user",
            "MYSQL_PASSWORD=password",
            "MYSQL_DATABASE=db",

<snip>

There are other ways how to store secrets and although using environment variables might lead to leaking private data in certain corner cases, it still belongs to the safest workarounds available.

There are a couple of things to keep in mind when operating with secrets:

  • For obvious reasons, you should avoid using default passwords - users tend to forget to change the default configuration and in case a known password leaks, it can be easily misused.

  • Although squashing removes intermediate layers from the final image, secrets from those layers will still be present in the build cache.

Handling Secrets in Kubernetes

Containers running through Kubernetes can take advantage of the secret resource type to store sensitive data such as passwords or tokens. Kubernetes uses tmpfs volumes for storing secrets. To learn how to create and access these, refer to the Kubernetes User Guide.

Other Projects Facilitating Secret Management

Custodia

Custodia is an open-source project that defines an API for storing and sharing secrets such as passwords and certificates in a way that keeps data secure, manageable and audiatable. Custodia uses the HTTP protocol and a RESTful API as an IPC mechanism over a local Unix Socket. Custodia is fully modular and users can control how authentication, authorization and API plugins are combined and exposed. You can learn more details on the project’s github repository or wiki page.

Vault

Another open-source project that aims to handle secure accessing and storing of secrets is Vault. Detailed information about the tool and use cases can be found on the Vault project website.

User NameSpace Mapping (docker-1.10 feature)

Preventing Your Image Users from Filling the Filesystem

Most default docker deployments only set aside about 20GB of storage for each container. For many applications, that amount of storage is more than enough. But if your containerized application produces significant log output, or your deployment scenario restarts containers infrequently, file system space can become a concern.

The first step to preventing the container filesystem from filling up is to make sure your images are small and concise. This obviously will reduce how much space your image consumes from the filesystem right away. However, as a developer, the following techniques can be used by you to manage the container filesystem size.

Ask for a Larger Storage Space on Run

One solution to dealing with filesystem space could be to increase the amount of storage allocated to the container when your image is run. This can be achieved with the following switch to docker run.

Increase the container storage space to 60GB
--storage-opt size:60

If you are using a defined RUN label in your Dockerfile, you could add the switch to that label. However, abuse of this switch could lead to irking users and you should be prudent in using it.

Space Considerations for Logging

Logging can sometimes unknowingly consume disk space, particularily when a service or daemon has failed. If your application performs logging, or more specifically verbose logging, consider the following approaches to help keep filesystem usage down:

Deployment Considerations

Preparing applications for production distribution and deployment must carefully consider the supported deployment platforms. Production services require high uptime, injection of private or sensitive data, storage integration and configuration control. The deployment platform determines methods for load balancing, scheduling and upgrading. A platform that does not provide these services requires additional work when developing the container packaging.

Platform

Lifecycle

Maintenance

Build infrastructure