Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
[![CircleCI](https://circleci.com/gh/projectglow/glow.svg?style=svg&circle-token=7511f70b2c810a18e88b5c537b0410e82db8617d)](https://circleci.com/gh/projectglow/glow)
[![Documentation
Status](https://readthedocs.org/projects/glow/badge/?version=latest)](https://glow.readthedocs.io/en/latest/?badge=latest)

# Building and Testing
This project is built using sbt: https://www.scala-sbt.org/1.0/docs/Setup.html
Expand Down
15 changes: 15 additions & 0 deletions docs/source/additional-resources.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Additional Resources
====================

Blog posts
----------

- `Scaling Genomic Workflows with Spark SQL BGEN and VCF Readers
<https://databricks.com/blog/2019/06/26/scaling-genomic-workflows-with-spark-sql-bgen-and-vcf-readers.html>`_
- `Parallelizing SAIGE Across Hundreds of Cores <https://databricks.com/blog/2019/10/02/parallelizing-saige-across-hundreds-of-cores.html>`_

+ Parallelize SAIGE using Glow and the Pipe Transformer

- `Accurately Building Genomic Cohorts at Scale with Delta Lake and Spark SQL <https://databricks.com/blog/2019/06/19/accurately-building-genomic-cohorts-at-scale-with-delta-lake-and-spark-sql.html>`_

+ Joint genotyping with Glow and Databricks
4 changes: 4 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,10 @@
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']

html_logo = '../../static/glow_logo_horiz_color_dark_bg.png'

html_favicon = '../../static/favicon.ico'

# Custom sidebar templates, must be a dictionary that maps document names
# to template names.
#
Expand Down
12 changes: 6 additions & 6 deletions docs/source/etl/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ Glow offers functionalities to perform genomic variant data ETL, manipulation, a
.. toctree::
:maxdepth: 2

variant-data.rst
vcf2delta.rst
variant-qc.rst
sample-qc.rst
lift-over.rst
utility-functions.rst
variant-data
vcf2delta
variant-qc
sample-qc
lift-over
utility-functions
58 changes: 58 additions & 0 deletions docs/source/getting-started.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
Getting Started
===============

Running Locally
---------------

Glow requires Apache Spark 2.4.2 or above. If you don't have a local Apache Spark installation,
you can install it from PyPI:

.. code-block:: sh

pip install pyspark==2.4.2

or `download a specific distribution <https://spark.apache.org/downloads.html>`_.

Install the Python frontend from pip:

.. code-block:: sh

pip install glow.py

and then start the `Spark shell <http://spark.apache.org/docs/latest/rdd-programming-guide.html#using-the-shell>`_
with the Glow maven package:

.. code-block:: sh

./bin/pyspark --packages io.projectglow:glow_2.11:0.1.0

To start a Jupyter notebook instead of a shell:

.. code-block:: sh

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook ./bin/pyspark --packages io.projectglow:glow_2.11:0.1.0

And now your notebook is glowing! To access the Glow functions, you need to register them with the
Spark session.

.. code-block:: python

import glow
glow.register(spark)
df = spark.read.format('vcf').load('my_first.vcf')

Running in the cloud
--------------------

The easiest way to use Glow in the cloud is with the `Databricks Runtime for Genomics
<https://docs.databricks.com/runtime/genomicsruntime.html>`_. However, it works with any cloud
provider or Spark distribution. You need to install the maven package
``io.project:glow_2.11:${version}`` and optionally the Python frontend ``glow.py``.

Example notebook
----------------

This notebook demonstrates some of the key functionality of Glow, like reading in a genomic dataset,
saving it as a `Delta Lake <https://delta.io>`_, and performing a genome-wide assocation study.

.. notebook:: _static/notebooks/tertiary/gwas.html
8 changes: 7 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,17 @@
Glow
====

Glow is an open-source genomic data analysis tool using `Apache Spark <https://spark.apache.org>`__.
Glow is an open-source toolkit for working with genomic data at biobank-scale and beyond. The
toolkit is natively built on Apache Spark, the leading unified engine for big data processing and
machine learning, enabling the scale of the cloud for genomics workflows.

.. toctree::
:maxdepth: 2

introduction
getting-started
etl/index
tertiary/index
additional-resources

.. modules
23 changes: 23 additions & 0 deletions docs/source/introduction.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
Introduction to Glow
====================

Glow aims to simplify genomic workflows at scale. The best way to accomplish this goal is to take a system that
has already been proven to work and adapt it to fit into the genomics ecosystem.

Apache Spark and in particular `Spark SQL <https://spark.apache.org/sql/>`_, its module for working with
structured data, is used at organizations across industries with datasets at the petabyte scale and
beyond. Glow smoothes the rough edges so that you can be productive immediately.

Glow features:

- Genomic datasources: To read datasets in common file formats like VCF and BGEN into Spark SQL DataFrames.
- Genomic functions: Common operations like computing quality control statistics, running regression
tests, and performing simple transformations are provided as Spark SQL functions that can be
called from Python, SQL, Scala, or R.
- Data preparation building blocks: Glow includes transformations like variant normalization and
lift over to help produce analysis ready datasets.
- Integration with existing tools: With Spark SQL, you can write user-defined functions (UDFs) in
Python, R, or Scala. Glow also makes it easy to run DataFrames through command line tools.
- Integration with other data types: Genomic data can generate additional insights when joined with data sets
such as electronic health records, real world evidence, and medical images. Since Glow returns native Spark
SQL DataFrames, its simple to join multiple data sets together.
Binary file added static/favicon.ico
Binary file not shown.
Binary file added static/glow_logo_horiz_color.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/glow_logo_horiz_color_dark_bg.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.