Skip to content

n0k0m3/pyspark-notebook-deltalake-docker

Repository files navigation

PySpark Notebook with DeltaLake for production

This repo tries to replicate databricks runtime, plus feature-rich jupyter/docker-stacks.

Base image: rapidsai/rapidsai:22.02-cuda11.5-runtime-ubuntu20.04-py3.8

Additional packages:

Planning:

  • All additional packages from jupyter/r-notebook.
  • All additional packages that are on top of Databricks runtime dependencies tree (10.3 ML GPU runtime)
  • xgboost and Spark distribution of xgboost (Waiting for this PR)
  • hyperopt

Starting Docker

Generate environment variables

Check .env.template for environment variables template, or modify and copy these lines

echo "JUPYTER_PATH=<path-to-notebook-directory>" > .env
echo "NB_UID=`id -u`" >> .env
echo "NB_GID=`id -g`" >> .env

Get path-to-notebook-directory using pwd in the notebook directory

Docker Compose

docker-compose up -d

About

Jupyter Notebook Docker with Spark and DeltaLake support

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages