Skip to content

Conversation

@1ambda
Copy link
Member

@1ambda 1ambda commented Dec 14, 2016

What is this PR for?

This PR

  • added Dockerfiles for released zeppelin binaries (0.6.0, 0.6.1, 0.6.2)
  • refactored Dockerfiles scripts/docker/zeppelin-base so that we can keep small size base images
  • added missing R, python packages to run tutorials properly
  • also included scripts/docker/create-dockerfile.sh to create docker images for newly released zeppelin binaries
├── zeppelin-base
│   └── alpine
│       ├── java
│       │   └── Dockerfile   # base image just including openjdk7 and dumb init
│       ├── python
│       │   └── Dockerfile   # base image including python related package based on alpine/java
│       └── r
│           └── Dockerfile   # base image including R related package based on alpine/java
└── zeppelin-bin-all
    ├── Dockerfile.template  # Dockerfile template for zeppelin binary images
    ├── alpine
    │   ├── 0.6.0_java
    │   │   └── Dockerfile
    │   ├── 0.6.0_python
    │   │   └── Dockerfile
    │   ├── 0.6.0_r
    │   │   └── Dockerfile
    │   ├── 0.6.1_java
    │   │   └── Dockerfile
    │   ├── 0.6.1_python
    │   │   └── Dockerfile
    │   ├── 0.6.1_r
    │   │   └── Dockerfile
    │   ├── 0.6.2_java
    │   │   └── Dockerfile
    │   ├── 0.6.2_python
    │   │   └── Dockerfile
    │   └── 0.6.2_r
    │       └── Dockerfile
    └── create-dockerfile.sh  # shell script used to create zeppelin binary images

For Reviewers

I have things to be discussed in this PR

  1. Support other linux (centos) or not: Alpine linux is designed for light weight OS so it doesn't have graphic devices. But usually people use zeppelin to render graph like things. In this case Alpine is not enough
  2. Extracting scripts to installing required packages into testing/install_external_dependencies.sh or not: Alpine linux has it's own command to install packages like apk add. I think this is not reusable. And extracting scripts from Dockerfile causes other problems. For example, we need to add script just for building base images for python, r to copy scripts for external dependencies because Docker command COPY and ADD doens't support relative paths. It means, we have to copy external dependency scripts before building into the path where Dockerfile is built.
  3. Including git to base images or not: I am not sure git is used in zeppelin as default. It will increase the size of base images about 20MB.
  4. Adding Binary Zeppelin Image to docker-library/official-images: I haven't experiences about this.
  5. We can reduced size by removing packages such as googleVis, data.table and ramnathv/rCharts in R images for example. But then we might not be able to run tutorials properly.

What type of PR is it?

[Feature]

What is the Jira issue?

ZEPPELIN-1711

How should this be tested?

Building Base Images

$ cd scripts/docker/zeppelin-base/alpine/java
$ docker build . -t zeppelin:alpine-base_java

$ cd scripts/docker/zeppelin-base/alpine/python
$ docker build . -t zeppelin:alpine-base_python

$ cd scripts/docker/zeppelin-base/alpine/r
$ docker build . -t zeppelin:alpine-base_r

Building Zeppelin Images

Make sure you have base images before building

$ docker images
REPOSITORY               TAG                   IMAGE ID            CREATED             SIZE
zeppelin                 alpine-base_java      5a24572968d6        5 hours ago         145 MB
zeppelin                 alpine-base_python    e41dcf04760b        4 hours ago         684.9 MB
zeppelin                 alpine-base_r         80c2d7aa7156        4 hours ago         538.9 MB
$ scripts/docker/zeppelin-bin-all/alpine/0.6.2_java
$ docker build . -t zeppelin:alpine-0.6.2_java

$ scripts/docker/zeppelin-bin-all/alpine/0.6.2_python
$ docker build . -t zeppelin:alpine-0.6.2_python

$ scripts/docker/zeppelin-bin-all/alpine/0.6.2_r
$ docker build . -t zeppelin:alpine-0.6.2_r

# build 0.6.1, 0.6.0 images too

Running Zeppelin Images

Make sure you have zeppelin docker images before running containers

$ docker images
REPOSITORY               TAG                   IMAGE ID            CREATED             SIZE
zeppelin                 alpine-0.6.2_java     c65b2f6a1128        3 hours ago         786.4 MB
...
...

Then, running containers by replacing tags (0.6.2, 0.6.1, 0.6.0)

$ docker run -it --name zeppelin --rm -p 8080:8080 -p 7077:7077 zeppelin:alpine-0.6.2_java
$ docker run -it --name zeppelin --rm -p 8080:8080 -p 7077:7077 zeppelin:alpine-0.6.2_python
$ docker run -it --name zeppelin --rm -p 8080:8080 -p 7077:7077 zeppelin:alpine-0.6.2_r

# run 0.6.1, 0.6.0 also

Testing Zeppelin Tutorials

Here are things you need to know before testing.

  • zeppelin:alpine-$TAG_java images can run Zeppelin Tutorial since it has openjdk7
  • zeppelin:alpine-$TAG_python images can run Zeppelin Tutorial: Python - matplotlib basic as well as Zeppelin Tutorial since it has python related packages
  • zeppelin:alpine-$TAG_r images can run R Tutorial as well as Zeppelin Tutorial since it has python related packages
  • zeppelin:alpine-0.6.0_$PLATFORM images will not run Zeppelin Tutorial properly while throwing errors like zeppelin java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy. I am not sure whether this is a problem of zeppelin or spark.
  • Since alpine linux doesn't have graphical device, some function may not work (e.g plot in R). Some R examples throw invalid images like chunked.png
  • For the same reason, we have to change some code of Python - matplotlib basic before testing otherwise it will throw tkinter errors in the python interpreter
// before
%python
import numpy as np
import matplotlib.pyplot as plt

// should be
%python
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as pet

Then, run each zeppelin container and run tutorial while replacing tags by accessing localhost:8080 in your browsers

# run: Zeppelin Tutorial
$ docker run -it --name zeppelin --rm -p 8080:8080 -p 7077:7077 zeppelin:alpine-0.6.2_java

# run: Zeppelin Tutorial, Python - matplotlib basic
# and modify the code mentioned above before running paragraphs
$ docker run -it --name zeppelin --rm -p 8080:8080 -p 7077:7077 zeppelin:alpine-0.6.2_python

# run: Zeppelin Tutorial,  R Tutorial
$ docker run -it --name zeppelin --rm -p 8080:8080 -p 7077:7077 zeppelin:alpine-0.6.2_r

# test 0.6.1, 0.6.0 containers too

Testing create-dockerfile.sh

$ scripts/docker/zeppelin-bin-all/create-dockerfile.sh -h 

USAGE: ./create-dockerfile.sh version linux platform
* version: 0.6.2 (released zeppelin binary version)
* linux: [alpine]
* platform: [java, python, r]

# For example,
# This command will create `scripts/docker/zeppelin-bin-all/alpine/0.7.0_java/Dockerfile
$ ./create-dockerfile.sh 0.7.0 alpine java

Screenshots (if appropriate)

Image Sizes

zeppelin base image size added in #1538: 301.3 MB

REPOSITORY               TAG                   IMAGE ID            CREATED             SIZE
zeppelin                 alpine-base_java      5a24572968d6        5 hours ago         145 MB
zeppelin                 alpine-base_python    e41dcf04760b        4 hours ago         684.9 MB
zeppelin                 alpine-base_r         80c2d7aa7156        4 hours ago         538.9 MB
zeppelin                 alpine-0.6.0_java     85b4502fd255        2 hours ago         737.7 MB
zeppelin                 alpine-0.6.0_python   709dc585b94c        2 hours ago         1.278 GB
zeppelin                 alpine-0.6.0_r        f6fba0c66563        2 hours ago         1.131 GB
zeppelin                 alpine-0.6.1_java     ed95ca6204a9        3 hours ago         751.3 MB
zeppelin                 alpine-0.6.1_python   34ceea2aeba8        3 hours ago         1.291 GB
zeppelin                 alpine-0.6.1_r        3886596f3759        3 hours ago         1.145 GB
zeppelin                 alpine-0.6.2_java     c65b2f6a1128        3 hours ago         786.4 MB
zeppelin                 alpine-0.6.2_python   16c2c3630545        3 hours ago         1.326 GB
zeppelin                 alpine-0.6.2_r        544311ec1e6a        3 hours ago         1.18 GB

We can reduced size by removing packages such as googleVis, data.table and ramnathv/rCharts in R images for example. But then we might not be able to run tutorials properly.

Questions:

  • Does the licenses files need update? - NO
  • Is there breaking changes for older versions? - NO
  • Does this needs documentation? - YES, I updated docs/install/docker.md

@1ambda
Copy link
Member Author

1ambda commented Dec 14, 2016

\cc @bzz @mfelgamal @astroshim Thanks!

@1ambda 1ambda changed the title [ZEPPELIN-1711] Create Released Binary Zeppelin Docker Images [ZEPPELIN-1711] Create Docker Images for Release Zeppelin Binaries Dec 14, 2016
@1ambda 1ambda changed the title [ZEPPELIN-1711] Create Docker Images for Release Zeppelin Binaries [ZEPPELIN-1711] Create Docker Images for Released Zeppelin Binaries Dec 15, 2016
@1ambda 1ambda force-pushed the feat/docker-images branch from 4af7ceb to 8c71f6d Compare December 16, 2016 05:39
@1ambda 1ambda force-pushed the feat/docker-images branch from 8c71f6d to b7b0fa0 Compare December 17, 2016 03:26
@mfelgamal
Copy link
Contributor

@1ambda awesome!.


RUN echo "$LOG_TAG Cleanup" && \
apk del build_deps && \
apk del python_build_deps
Copy link
Contributor

@mfelgamal mfelgamal Dec 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to install and delete the packages in the same layer (on one line) so it’s not committed to the image as separate layers to reduce the image size. What do you think?

Copy link
Member Author

@1ambda 1ambda Dec 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfelgamal Thanks for review :)

  • I agree but we need to keep the balance between readability and size i think. So there are many official images having multiple RUN commands. For example, openjdk. Additionally, separating layers also affects on build time (productivity) while developing docker images.
  • Let me compare image sizes and add comments about it :)

Copy link
Member

@felixcheung felixcheung Dec 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from experience it might be useful also to have multiple layers (ie RUN command) for reliability - these cached images are helpful as checkpoints to resume from if one of these step fails

apk add --no-cache --virtual=python_build_deps \
musl-dev linux-headers gfortran \
freetype-dev py-numpy-dev@testing \
py-numpy python-dev libpng-dev libxml2-dev libxslt-dev \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

py-numpy is here and L30?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix it

curl --silent --location https://github.com/sgerrand/alpine-pkg-R/releases/download/${R_VERSION}/R-dev-${R_VERSION}.apk \
--output /var/cache/apk/R-dev-${R_VERSION}.apk && \
apk add --no-cache --allow-untrusted /var/cache/apk/R-dev-${R_VERSION}.apk && \
R -e "install.packages('knitr', repos='http://cran.us.r-project.org', lib='$R_LIBS')" && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could pass a list of packages to install.packages()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought separating install packages statements into multiple lies would be easy to maintain. But it's ok to keep it one line. I will fix it.

ENV LOG_TAG="[ZEPPELIN_BASE_R]:" \
LANG=C.UTF-8 \
R_VERSION="3.3.1-r0" \
R_LIBS="/usr/local/rbin/R"
Copy link
Member

@felixcheung felixcheung Dec 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R_LIBS is optional - is there a reason you want to create/pass this, instead of using just the default location?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know it. I will fix it.

@felixcheung
Copy link
Member

My apologies if discussed before, but is there a reason for separating into Scala, Python, R individual Docker images? Don't people want to run Python and R together, for example?

Also, +1 for graphing being important for Python and R.

@jongyoul
Copy link
Member

jongyoul commented Jan 8, 2017

@1ambda ping

@1ambda
Copy link
Member Author

1ambda commented Jan 10, 2017

@felixcheung Thanks for review.

Previously, we discussed about separating images in ZEPPEILN-1711

IMO, having 1 image which includes R, Python related packages is better. Since

  • our users usually use zeppelin with R, python not with java only
  • it's easy to maintain

I would like to use ubuntu which is comfortable as a desktop OS instead of alpine

  • alpine is not designed for rending images. It throws some errors when you draw charts using R, python.

What do you think? @felixcheung @jongyoul @bzz

@1ambda
Copy link
Member Author

1ambda commented Apr 19, 2017

Let me create new PR which including only 1 OS (ubuntu) based on this PR.

asfgit pushed a commit that referenced this pull request Apr 28, 2017
### What is this PR for?

Created `Dockerfile` for released bin

- based on **Ubuntu:16.04 (LTS)** for desktop usage
- **JDK 8**
- **R** with basic packages
- **Python 2** with basic packages
- **miniconda2** for `%python.conda`

### Details

We already discussed about using alpine image in #1761.

- However, it's not designed for desktop usage
- Doesn't have some official packages (R, ...)
- Not familiar to users for desktop OS

That the reason why ubuntu is used in base image

```
zeppelin                  base                b3818f9ae4b1        11 hours ago        1.67 GB
zeppelin                  0.6.2               c0a4d8556f92        7 hours ago         2.29 GB
zeppelin                  0.7.0               c4a5ad0d04bd        8 hours ago         2.5 GB
zeppelin                  0.7.1               54173b77743b        7 hours ago         2.49 GB
```

### What type of PR is it?
[Feature]

### Todos
* [x] - base image
* [x] - script for creating bin images
* [x] - bin image template

### What is the Jira issue?

[ZEPPELIN-1711](https://issues.apache.org/jira/browse/ZEPPELIN-1711)

### How should this be tested?

1. build base image `cd scripts/docker/zeppelin/base; docker build -t zeppelin:base ./`
2. build bin image `cd scripts/docker/zeppelin/0.7.1; docker build -t zeppelin:0.7.1 ./`
3. execute docker images

```
docker run -p 8080:8080 --rm --name zeppelin zeppelin:0.7.1
```

since it takes time to build, you can use already [published docker images](https://hub.docker.com/r/1ambda/docker-zeppelin/)

```
docker run -p 8080:8080 --rm --name zeppelin 1ambda/docker-zeppelin:0.7.1
```

4. should be able to run spark, python and R tutorials

### Screenshots (if appropriate)

NO

### Questions:
* Does the licenses files need update? - NO
* Is there breaking changes for older versions? - NO
* Does this needs documentation? - YES, updated

Author: 1ambda <[email protected]>

Closes #2264 from 1ambda/ZEPPELIN-1711/bin-dockerfile and squashes the following commits:

69a0b1f [1ambda] docs: Update docker.md
ced897f [1ambda] fix: DON'T remove /tmp
1f6da76 [1ambda] feat: Dockerfiles for 060, 070, 071
0fc3f75 [1ambda] feat: Add template for bin image
5cba56e [1ambda] feat: Use ubuntu for base image
@1ambda
Copy link
Member Author

1ambda commented May 2, 2017

#2264 was merged.

@1ambda 1ambda closed this May 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants