diff --git a/README.md b/README.md index 759a24ac2..6f94e70c8 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,6 @@ [![CircleCI](https://circleci.com/gh/projectglow/glow.svg?style=svg&circle-token=7511f70b2c810a18e88b5c537b0410e82db8617d)](https://circleci.com/gh/projectglow/glow) +[![Documentation +Status](https://readthedocs.org/projects/glow/badge/?version=latest)](https://glow.readthedocs.io/en/latest/?badge=latest) # Building and Testing This project is built using sbt: https://www.scala-sbt.org/1.0/docs/Setup.html diff --git a/docs/source/additional-resources.rst b/docs/source/additional-resources.rst new file mode 100644 index 000000000..c632877a0 --- /dev/null +++ b/docs/source/additional-resources.rst @@ -0,0 +1,15 @@ +Additional Resources +==================== + +Blog posts +---------- + +- `Scaling Genomic Workflows with Spark SQL BGEN and VCF Readers + `_ +- `Parallelizing SAIGE Across Hundreds of Cores `_ + + + Parallelize SAIGE using Glow and the Pipe Transformer + +- `Accurately Building Genomic Cohorts at Scale with Delta Lake and Spark SQL `_ + + + Joint genotyping with Glow and Databricks diff --git a/docs/source/conf.py b/docs/source/conf.py index d2f4dfc5f..f1b6b3cc6 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -107,6 +107,10 @@ # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ['_static'] +html_logo = '../../static/glow_logo_horiz_color_dark_bg.png' + +html_favicon = '../../static/favicon.ico' + # Custom sidebar templates, must be a dictionary that maps document names # to template names. # diff --git a/docs/source/etl/index.rst b/docs/source/etl/index.rst index a3a265ab3..7ab93b8eb 100644 --- a/docs/source/etl/index.rst +++ b/docs/source/etl/index.rst @@ -10,9 +10,9 @@ Glow offers functionalities to perform genomic variant data ETL, manipulation, a .. toctree:: :maxdepth: 2 - variant-data.rst - vcf2delta.rst - variant-qc.rst - sample-qc.rst - lift-over.rst - utility-functions.rst + variant-data + vcf2delta + variant-qc + sample-qc + lift-over + utility-functions diff --git a/docs/source/getting-started.rst b/docs/source/getting-started.rst new file mode 100644 index 000000000..7135c506d --- /dev/null +++ b/docs/source/getting-started.rst @@ -0,0 +1,58 @@ +Getting Started +=============== + +Running Locally +--------------- + +Glow requires Apache Spark 2.4.2 or above. If you don't have a local Apache Spark installation, +you can install it from PyPI: + +.. code-block:: sh + + pip install pyspark==2.4.2 + +or `download a specific distribution `_. + +Install the Python frontend from pip: + +.. code-block:: sh + + pip install glow.py + +and then start the `Spark shell `_ +with the Glow maven package: + +.. code-block:: sh + + ./bin/pyspark --packages io.projectglow:glow_2.11:0.1.0 + +To start a Jupyter notebook instead of a shell: + +.. code-block:: sh + + PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook ./bin/pyspark --packages io.projectglow:glow_2.11:0.1.0 + +And now your notebook is glowing! To access the Glow functions, you need to register them with the +Spark session. + +.. code-block:: python + + import glow + glow.register(spark) + df = spark.read.format('vcf').load('my_first.vcf') + +Running in the cloud +-------------------- + +The easiest way to use Glow in the cloud is with the `Databricks Runtime for Genomics +`_. However, it works with any cloud +provider or Spark distribution. You need to install the maven package +``io.project:glow_2.11:${version}`` and optionally the Python frontend ``glow.py``. + +Example notebook +---------------- + +This notebook demonstrates some of the key functionality of Glow, like reading in a genomic dataset, +saving it as a `Delta Lake `_, and performing a genome-wide assocation study. + +.. notebook:: _static/notebooks/tertiary/gwas.html diff --git a/docs/source/index.rst b/docs/source/index.rst index 0d18947a9..1303c961f 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -1,11 +1,17 @@ Glow ==== -Glow is an open-source genomic data analysis tool using `Apache Spark `__. +Glow is an open-source toolkit for working with genomic data at biobank-scale and beyond. The +toolkit is natively built on Apache Spark, the leading unified engine for big data processing and +machine learning, enabling the scale of the cloud for genomics workflows. .. toctree:: :maxdepth: 2 + introduction + getting-started etl/index tertiary/index + additional-resources + .. modules diff --git a/docs/source/introduction.rst b/docs/source/introduction.rst new file mode 100644 index 000000000..a5550094e --- /dev/null +++ b/docs/source/introduction.rst @@ -0,0 +1,23 @@ +Introduction to Glow +==================== + +Glow aims to simplify genomic workflows at scale. The best way to accomplish this goal is to take a system that +has already been proven to work and adapt it to fit into the genomics ecosystem. + +Apache Spark and in particular `Spark SQL `_, its module for working with +structured data, is used at organizations across industries with datasets at the petabyte scale and +beyond. Glow smoothes the rough edges so that you can be productive immediately. + +Glow features: + +- Genomic datasources: To read datasets in common file formats like VCF and BGEN into Spark SQL DataFrames. +- Genomic functions: Common operations like computing quality control statistics, running regression + tests, and performing simple transformations are provided as Spark SQL functions that can be + called from Python, SQL, Scala, or R. +- Data preparation building blocks: Glow includes transformations like variant normalization and + lift over to help produce analysis ready datasets. +- Integration with existing tools: With Spark SQL, you can write user-defined functions (UDFs) in + Python, R, or Scala. Glow also makes it easy to run DataFrames through command line tools. +- Integration with other data types: Genomic data can generate additional insights when joined with data sets + such as electronic health records, real world evidence, and medical images. Since Glow returns native Spark + SQL DataFrames, its simple to join multiple data sets together. diff --git a/static/favicon.ico b/static/favicon.ico new file mode 100644 index 000000000..dff1072c5 Binary files /dev/null and b/static/favicon.ico differ diff --git a/static/glow_logo_horiz_color.png b/static/glow_logo_horiz_color.png new file mode 100644 index 000000000..961215b80 Binary files /dev/null and b/static/glow_logo_horiz_color.png differ diff --git a/static/glow_logo_horiz_color_dark_bg.png b/static/glow_logo_horiz_color_dark_bg.png new file mode 100644 index 000000000..091f92bfa Binary files /dev/null and b/static/glow_logo_horiz_color_dark_bg.png differ