Skip to content

pmcrosta/datascience-r

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

datascience-r

Docker files for data science with R

This repo contains a Dockerfile for building a container with RStudio and some often-used packages. The container is based on the rocker/verse Docker container. The build procedure below assumes a docker client is installed locally. The GCP piece assumes there is a GCP project set up (e.g. $GCP_PROJECT).

Building the image

Clone the repo and run the following in the directory to build the container:

docker build -t datascience-r .

Adding to Google Container Registry

Launching GCE instance with container

  • Set environment variable to control machine size and desired rstudio password
    • $ MACHINETYPE=g1-small
    • $ PASSWORD=change_this
    • $ SERVICE_ACCOUNT=943134138766-compute@developer.gserviceaccount.com (change to yours)
  • Use gcloud SDK to launch instance (can also be done through GUI)
    • $ gcloud beta compute --project=$GCP_PROJECT instances create-with-container dsserver --zone=us-east1-c --machine-type=$MACHINETYPE --subnet=default --maintenance-policy=MIGRATE --service-account=$SERVICE_ACCOUNT --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append --min-cpu-platform=Automatic --tags=rstudio,http-server,shiny-server --image=cos-stable-65-10323-75-0 --image-project=cos-cloud --boot-disk-size=100GB --boot-disk-type=pd-ssd --boot-disk-device-name=dsserver --container-image=gcr.io/$GCP_PROJECT/datascience-r:latest --container-restart-policy=always --container-privileged --container-env=ROOT=TRUE,PASSWORD=$PASSWORD
    • Currently this command requires "gcloud beta compute..."
  • Also create firewall rule to allow access to 8787
    • $ gcloud compute --project=$GCP_PROJECT firewall-rules create rstudio --direction=INGRESS --priority=1000 --network=default --action=ALLOW --rules=tcp:8787 --source-ranges=0.0.0.0/0 --target-tags=rstudio
    • $ gcloud compute --project=$GCP_PROJECT firewall-rules create shiny-server --direction=INGRESS --priority=1000 --network=default --action=ALLOW --rules=tcp:3838 --source-ranges=0.0.0.0/0 --target-tags=shiny-server
    • $ gcloud compute --project=$GCP_PROJECT firewall-rules create default-allow-http --direction=INGRESS --priority=1000 --network=default --action=ALLOW --rules=tcp:80 --source-ranges=0.0.0.0/0 --target-tags=http-server
  • Access RStudio via the http://external_ip:8787
  • Start shiny server by opening a terminal in Rstudio and executing sudo su && shiny-server
  • To shut down instance
    • $ gcloud compute instances stop dsserver
  • To delete instance
    • $ gcloud compute instances delete dsserver

TODO

Thus far we have created a custom/extended/modified Docker image, uploaded it to GCP Container Registry, and launched a VM on GCE running the container. This is enough to get things going and to run some code. However, there are other items that are common in the analysis workflow, such as:

  • Streamlining Git setup to quickly clone repository code and integrate into R project
  • Mounting GCS storage over Fuse to facilitate access to persistent storage
  • Adding nginx and suitable proxies (helps with prettier URLs, avoiding explicit ports in URLs)
  • Investigate GCP Container Builder to automate build process when Github Dockerfile is modified

Helpful commands

To build a container:

$ docker build -t datascience-r .

To for a rebuild:

$ docker build --no-cache -t datascience-r .

To list images:

$ docker image ls -a

To run the container:

$ docker run -d -e ROOT=TRUE -p 8787:8787 datascience-r

To list containers:

$ docker container ls -a

To ssh into a container with ID container_id

$ docker exec -it container_id /bin/bash

To ssh into a version of the container as root:

$ docker run --rm -ti datascience-r bash

To stop the container with ID container_id

$ docker container stop container_id

To remove all stopped containers

$ docker container prune

To get a list of installed debian packages

$ dpkg -l | less

About

Docker files for data science with R on GCP

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages