Merge branch 'master' of github.com:pepkit/looper

pepkit · May 24, 2018 · f42d59c · f42d59c
2 parents 431466f + 7751d1e
commit f42d59c
Show file tree

Hide file tree

Showing 20 changed files with 129 additions and 372 deletions.
diff --git a/README.md b/README.md
@@ -5,33 +5,35 @@
 [![Documentation Status](http://readthedocs.org/projects/looper/badge/?version=latest)](http://looper.readthedocs.io/en/latest/?badge=latest)
 [![Build Status](https://travis-ci.org/pepkit/looper.svg?branch=master)](https://travis-ci.org/pepkit/looper)
 
-`Looper` is a command-line pipeline submission engine written in Python. It reads project metadata in [standard PEP format](http://pepkit.github.io) and maps sample inputs to any command-line tool. The typical use case is to run a bioinformatics pipeline across many different samples. Looper provides a convenient interface for submitting pipelines for each sample to any cluster resource manager. Looper was conceived to use [pypiper pipelines](https://github.com/epigen/pypiper/), but it is in fact compatible with any tool that can be run via the command line.
+`Looper` is a pipeline submission engine. The typical use case is to run a bioinformatics pipeline across many different input samples. `Looper`'s key advantages are the following:
 
-You can download the latest version from the [releases page](https://github.com/pepkit/looper/releases) or at [pypi](https://pypi.python.org/pypi/looper/). You can find a list of looper-compatible pipelines in the [hello looper! repository](https://github.com/pepkit/hello_looper/blob/master/looper_pipelines.md). `Looper` is built on the python [peppy](https://github.com/pepkit/peppy) package. `Looper` and `peppy` were originally developed together as a single package, but `peppy` has been extracted to make the projects more modular. `Looper` now imports `peppy` for its sample input and processing, and `peppy` can be used independently of `looper`.
+* **it completely divides sample handling from pipeline processing**. This modular approach simplifies the pipeline-building process because pipelines no longer need to worry about sample metadata parsing. 
+* **it subscribes to a single, standardized project metadata format**. `Looper` reads project metadata in [standard PEP format](http://pepkit.github.io) using [peppy](https://github.com/pepkit/peppy). It then maps sample inputs to any command-line tool. This means you only need to learn 1 way to format your project metadata, and it will work with any pipeline.
+* **it provides a universal implementation of sample-parallelization**, so individual pipelines do not need reinvent the wheel. This allows looper to provide a convenient interface for submitting pipelines either to local compute or to any cluster resource manager, so individual pipeline authors do not need to worry about cluster job submission at all.
+* **it is compatible with any command-line tool**. Looper was conceived to use [pypiper pipelines](https://github.com/databio/pypiper/), but it is in fact compatible with any tool that can be run via the command line.
 
-You should start with the documentation, which is hosted at [Read the Docs](http://looper.readthedocs.org/). 
+You can download the latest version from the [releases page](https://github.com/pepkit/looper/releases). You can find a list of looper-compatible pipelines in the [hello looper! repository](https://github.com/pepkit/hello_looper/blob/master/looper_pipelines.md). 
 
-# Install and quick start
+## Install and quick start
 
 Detailed instructions are in the [Read the Docs documentation](http://looper.readthedocs.org/), and that's the best place to start. To get running quickly, you can install the latest release and put the `looper` executable in your `$PATH`: 
 
-
 ```
 pip install --user https://github.com/pepkit/looper/zipball/master
 export PATH=$PATH:~/.local/bin
 ```
 
-To use looper with your project, you must define your project using [standard PEP project definition format](http://pepkit.github.io), which is a `yaml` config file passed as an argument to looper. To test, grab an [example PEP](https://pepkit.github.io/docs/example_PEPs/) and run it through looper with this command:
+To use looper with your project, you must define your project using [standard PEP project definition format](http://pepkit.github.io), which is a `yaml` config file passed as an argument to looper. To test, follow the basic tutorial in the [hello looper repository](https://github.com/pepkit/hello_looper). You run it through looper with this command:
 
 ```bash
 looper run project_config.yaml
 ```
 
-# Installation troubleshooting
+## Installation troubleshooting
 
 Looper supports Python 2.7 and Python 3, and has been tested in Linux. If you clone this repository and then an attempt at local installation, e.g. with `pip install --upgrade ./`, fails, this may be due to an issue with `setuptools` and `six`. A `FileNotFoundError` (Python 3) or an `IOError` (Python2), with a message/traceback about a nonexistent `METADATA` file means that this is even more likely the cause. To get around this, you can first manually `pip install --upgrade six` or `pip install six==1.11.0`, as upgrading from `six` from 1.10.0 to 1.11.0 resolves this issue, then retry the `looper` installation.
 
-# Contributing
+## Contributing
 
 Pull requests or issues are welcome.
 

diff --git a/doc/source/advanced.rst b/doc/source/advanced.rst
@@ -7,7 +7,7 @@ Handling multiple input files
 Sometimes you have multiple input files that you want to merge for one sample. For example, a common use case is a single library that was spread across multiple sequencing lanes, yielding multiple input files that need to be merged, and then run through the pipeline as one. Rather than putting multiple lines in your sample annotation sheet, which causes conceptual and analytical challenges, we introduce two ways to merge these:
 
 1. Use shell expansion characters (like '*' or '[]') in your `data_source` definition or filename (good for simple merges)
-2. Specify a *sample subannotation* which maps input files to samples for samples with more than one input file (infinitely customizable for more complicated merges).
+2. Specify a *merge table* which maps input files to samples for samples with more than one input file (infinitely customizable for more complicated merges).
 
 To do the first option, just change your data source specifications, like this:
 
@@ -16,15 +16,16 @@ To do the first option, just change your data source specifications, like this:
       data_R1: "${DATA}/{id}_S{nexseq_num}_L00*_R1_001.fastq.gz"
       data_R2: "${DATA}/{id}_S{nexseq_num}_L00*_R2_001.fastq.gz"
 
-To do the second option, just provide a sample subannotation in the *metadata* section of your project config:
+To do the second option, just provide a merge table in the *metadata* section of your project config:
 
 metadata:
-  sample_subannotation: mergetable.csv
+  merge_table: mergetable.csv
 
 Make sure the ``sample_name`` column of this table matches, and then include any columns you need to point to the data. ``Looper`` will automatically include all of these files as input passed to the pipelines. Warning: do not use both of these options simultaneously for the same sample, it will lead to multiple merges.
 
 Note: to handle different *classes* of input files, like read1 and read2, these are *not* merged and should be handled as different derived columns in the main sample annotation sheet (and therefore different arguments to the pipeline).
 
+.. _connecting-multiple-pipelines:
 
 Connecting to multiple pipelines
 ****************************************
@@ -41,7 +42,19 @@ For example:
 
 In this case, for a given sample, looper will first look in ``pipeline_iface1.yaml`` to see if appropriate pipeline exists for this sample type. If it finds one, it will use this pipeline (or set of pipelines, as specified in the ``protocol_mappings`` section of the ``pipeline_interface.yaml`` file). Having submitted a suitable pipeline it will ignore the pipeline_iface2.yaml interface. However if there is no suitable pipeline in the first interface, looper will check the second and, if it finds a match, will submit that. If no suitable pipelines are found in any of the interfaces, the sample will be skipped as usual.
 
+If your project contains samples with different protocols, you can use this to run several different pipelines. For example, if you have ATAC-seq, RNA-seq, and ChIP-seq samples in your project, you may want to include a `pipeline interface` for 3 different pipelines, each accepting one of those protocols. In the event that more than one of the `pipeline interface` files provide pipelines for the same protocol, looper will only submit the pipeline from the first interface. Thus, this list specifies a `priority order` to pipeline repositories.
+
+
+
 Command-line argument pass-through
 ****************************************
 
 Any command-line arguments passed to `looper run` *that are not consumed by looper* will be added to the command of every pipeline submitted in that looper run. This gives you a handy way to pass-through command-line arguments that you want passed to every job in a given looper run. For example, pypiper pipelines understand the `--recover` flag; so if you want to pass this flag through `looper` to all your pipeline runs, you may run `looper run config.yaml --recover`. Since `looper `does not understand `--recover`, this will be passed to every pipeline. Obviously, this feature is limited to passing flags that `looper` does not understand, because those arguments will be consumed by `looper` and not passed through to individual pipelines.
+
+
+Project models
+****************************************
+
+Looper uses the ``peppy`` package to model Project and Sample objects under the hood. These project objects are actually useful outside of looper. If you define your project using looper's :doc:`standardized project definition format <define-your-project>` , you can use the project models to instantiate an in memory representation of your project and all of its samples, without using looper. 
+
+If you're interested in this, you should check out the `peppy package <http://peppy.readthedocs.io/en/latest/models.html>`_. All the documentation for model objects has moved there.
diff --git a/doc/source/api.rst b/doc/source/api.rst
diff --git a/doc/source/cluster-computing.rst b/doc/source/cluster-computing.rst
@@ -1,9 +1,8 @@
 .. _cluster-resource-managers:
 
-Cluster computing
+How to run jobs on a cluster
 =============================================
 
-
 By default, looper will build a shell script for each sample and then run each sample serially on the local computer. But where looper really excels is in large projects that require submitting these jobs to a cluster resource manager (like SLURM, SGE, LFS, etc.). Looper handles the interface to the resource manager so that projects and pipelines can be moved to different environments with ease. 
 
 To configure looper to use cluster computing, all you have to do is tell looper a few things about your cluster setup: you create a configuration file (`compute_config.yaml`) and point an environment variable (``PEPENV``) to this file, and that's it!

diff --git a/doc/source/config-files.rst b/doc/source/config-files.rst
@@ -27,4 +27,4 @@ If you want to add a new pipeline to looper, tweak the way looper interacts with
 
 Finally, if you're using Pypiper to develop pipelines, it uses a pipeline-specific configuration file (detailed in the Pypiper documentation):
 
--   `Pypiper pipeline config file <http://pypiper.readthedocs.io/en/latest/advanced.html#pipeline-config-files>`_: Each pipeline may (optionally) provide a configuration file describing where software is, and parameters to use for tasks within the pipeline. This configuration file is by default named identical to the pypiper script name, with a `.yaml` extension instead of `.py` (So `rna_seq.py` looks for an accompanying `rna_seq.yaml` file by default). These files can be changed on a per-project level using the :ref:`pipeline_config section in your project config file <pipeline-config-section>`.
+-   `Pypiper pipeline config file <http://pypiper.readthedocs.io/en/latest/advanced.html#pipeline-config-files>`_: Each pipeline may (optionally) provide a configuration file describing where software is, and parameters to use for tasks within the pipeline. This configuration file is by default named identical to the pypiper script name, with a `.yaml` extension instead of `.py` (So `rna_seq.py` looks for an accompanying `rna_seq.yaml` file by default). These files can be changed on a per-project level using the `pipeline_config` section in your project config file.
diff --git a/doc/source/define-your-project.rst b/doc/source/define-your-project.rst
@@ -3,45 +3,13 @@
 How to define a project
 =============================================
 
-To use ``looper`` with your project, you must define your project using Looper's standard project definition format. If you follow this format, then your project can be read not only by looper for submitting pipelines, but also for other tasks, like: summarizing pipeline output, analysis in R (using the pending `pepr package <https://github.com/pepkit/pepr>`_), or building UCSC track hubs.
+``Looper`` subscribes to standard PEP format. So to use ``looper``, you first define your project using Portable Encapsulated Project (PEP) structure. PEP is a standardized way to represent metadata about your project and each of its samples. If you follow this format, then your project can be read not only by ``looper``, but also by other software, like the `pepr R package <https://github.com/pepkit/pepr>`_, or the `peppy python package <https://github.com/pepkit/peppy>`_. This will let you use the same metadata description for downstream data analysis.
 
-The format is simple and modular, so you only need to define the components you plan to use. You need to supply 2 files:
+The PEP format is simple and modular and uses 2 key files:
 
-1. **Project config file** - a ``yaml`` file describing input and output file paths and other (optional) project settings
+1. **Project config file** - a ``yaml`` file describing file paths and project settings
 2. **Sample annotation sheet** - a ``csv`` file with 1 row per sample
 
-**Quick example**: In the simplest case, ``project_config.yaml`` is just a few lines of ``yaml``. Here's a minimal example **project_config.yaml**:
-
-
-.. code-block:: yaml
-
-   metadata:
-     sample_annotation: /path/to/sample_annotation.csv
-     output_dir: /path/to/output/folder
-     pipeline_interfaces: path/to/pipeline_interface.yaml
-
-
-The **output_dir** key specifies where to save results. The **pipeline_interfaces** key points to your looper-compatible pipelines (described in :doc:`linking the pipeline interface <pipeline-interface>`). The **sample_annotation** key points to another file, which is a comma-separated value (``csv``) file describing samples in the project. Here's a small example of **sample_annotation.csv**:
-
-
-.. csv-table:: Minimal Sample Annotation Sheet
-   :header: "sample_name", "library", "file"
-   :widths: 30, 30, 30
-
-   "frog_1", "RNA-seq", "frog1.fq.gz"
-   "frog_2", "RNA-seq", "frog2.fq.gz"
-   "frog_3", "RNA-seq", "frog3.fq.gz"
-   "frog_4", "RNA-seq", "frog4.fq.gz"
-
-
-With those two simple files, you could run looper, and that's fine for just running a quick test on a few files. In practice, you'll probably want to use some of the more advanced features of looper by adding additional information to your configuration ``yaml`` file and your sample annotation ``csv`` file.
-
-For example, by default, your jobs will run serially on your local computer, where you're running ``looper``. If you want to submit to a cluster resource manager (like SLURM or SGE), you just need to specify a ``compute`` section.
-
-Let's go through the more advanced details of both annotation sheets and project config files:
-
-.. include:: sample-annotation-sheet.rst.inc
-
-.. include:: project-config.rst.inc
-
+You can find complete details of both files at `the official documentation for Portable Encapsulated Projects <https://pepkit.github.io/docs/home/>`_.
 
+Once you've specified a PEP, it's time to link it to the looper pipelines you want to use with the project. You'll do this by adding one more line to your project config file to indicate the **pipeline_interfaces** you need. This is described in the next section on how to :doc:`link a project to a pipeline <linking-a-pipeline>`).
diff --git a/doc/source/faq.rst b/doc/source/faq.rst
@@ -19,7 +19,12 @@ FAQ
 - How can I resubmit a subset of jobs that failed?
 	By default, looper **will not submit a job that has already run**. If you want to re-rerun a sample (maybe you've updated the pipeline, or you want to run restart a failed attempt), you can do so by passing ``--ignore-flags`` to looper at startup, but this will **resubmit all samples**. If you only want to re-run or restart a few samples, it's best to just delete the flag files manually for the samples you want to restart, then use ``looper run`` as normal.	
 
-
 - Can I pass additional command-line arguments to my pipeline on-the-fly?
 	Yes! Any command-line arguments passed to `looper run` *that are not consumed by looper* will simply be handed off untouched to *all the pipelines*. This gives you a handy way to pass-through command-line arguments that you want passed to every job in a given looper run.	For example, you may run `looper run config.yaml -R` -- since `looper `does not understand `-R`, this will be passed to every pipeline.
 
+	For example, pypiper pipelines understand the `--recover` flag; so if you want to pass this flag through `looper` to all your pipeline runs, you may run `looper run config.yaml --recover`. Since `looper `does not understand `--recover`, this will be passed to every pipeline. Obviously, this feature is limited to passing flags that `looper` does not understand, because those arguments will be consumed by `looper` and not passed through to individual pipelines.
+
+- How can I analyze my project interactively?
+	Looper uses the ``peppy`` package to model Project and Sample objects under the hood. These project objects are actually useful outside of looper. If you define your project using looper's :doc:`standardized project definition format <define-your-project>` , you can use the project models to instantiate an in memory representation of your project and all of its samples, without using looper. 
+
+	If you're interested in this, you should check out the `peppy package <http://peppy.readthedocs.io/en/latest/models.html>`_. All the documentation for model objects has moved there.
diff --git a/doc/source/features.rst b/doc/source/features.rst
@@ -1,5 +1,5 @@
 
-Features
+Features and benefits
 ******************************
 
 Simplicity for the beginning, power when you need to expand.

diff --git a/doc/source/how-to-merge-inputs.rst b/doc/source/how-to-merge-inputs.rst
@@ -0,0 +1,11 @@
+How to handle multiple input files
+=============================================
+
+Sometimes you have multiple input files that you want to merge for one sample. For example, a common use case is a single library that was spread across multiple sequencing lanes, yielding multiple input files that need to be merged, and then run through the pipeline as one. Rather than putting multiple lines in your sample annotation sheet, which causes conceptual and analytical challenges, PEP has two ways to merge these:
+
+1. Use shell expansion characters (like '*' or '[]') in your `data_source` definition or filename (good for simple merges)
+2. Specify a *merge table* which maps input files to samples for samples with more than one input file (infinitely customizable for more complicated merges).
+
+Dealing with multiple input files is described in detail in the `PEP documentation <https://pepkit.github.io/docs/sample_subannotation/>`_.
+
+Note: to handle different *classes* of input files, like read1 and read2, these are *not* merged and should be handled as different derived columns in the main sample annotation sheet (and therefore different arguments to the pipeline).
Original file line number	Diff line number	Diff line change
Expand Up		@@ -27,4 +27,4 @@ If you want to add a new pipeline to looper, tweak the way looper interacts with

		Finally, if you're using Pypiper to develop pipelines, it uses a pipeline-specific configuration file (detailed in the Pypiper documentation):

		- `Pypiper pipeline config file <http://pypiper.readthedocs.io/en/latest/advanced.html#pipeline-config-files>`_: Each pipeline may (optionally) provide a configuration file describing where software is, and parameters to use for tasks within the pipeline. This configuration file is by default named identical to the pypiper script name, with a `.yaml` extension instead of `.py` (So `rna_seq.py` looks for an accompanying `rna_seq.yaml` file by default). These files can be changed on a per-project level using the :ref:`pipeline_config section in your project config file <pipeline-config-section>`.
		- `Pypiper pipeline config file <http://pypiper.readthedocs.io/en/latest/advanced.html#pipeline-config-files>`_: Each pipeline may (optionally) provide a configuration file describing where software is, and parameters to use for tasks within the pipeline. This configuration file is by default named identical to the pypiper script name, with a `.yaml` extension instead of `.py` (So `rna_seq.py` looks for an accompanying `rna_seq.yaml` file by default). These files can be changed on a per-project level using the `pipeline_config` section in your project config file.