diff --git a/README.md b/README.md index 04a423cb9..7020aad50 100644 --- a/README.md +++ b/README.md @@ -5,44 +5,4 @@ [![Documentation Status](http://readthedocs.org/projects/looper/badge/?version=latest)](http://looper.readthedocs.io/en/latest/?badge=latest) [![Build Status](https://travis-ci.org/pepkit/looper.svg?branch=master)](https://travis-ci.org/pepkit/looper) -`Looper` is a pipeline submission engine. The typical use case is to run a bioinformatics pipeline across many different input samples. `Looper`'s key advantages are the following: - -* **it completely divides sample handling from pipeline processing**. This modular approach simplifies the pipeline-building process because pipelines no longer need to worry about sample metadata parsing. -* **it subscribes to a single, standardized project metadata format**. `Looper` reads project metadata in [standard PEP format](http://pepkit.github.io) using [peppy](http://github.com/pepkit/peppy). It then maps sample inputs to any command-line tool. This means you only need to learn 1 way to format your project metadata, and it will work with any pipeline. -* **it provides a universal implementation of sample-parallelization**, so individual pipelines do not need reinvent the wheel. This allows looper to provide a convenient interface for submitting pipelines either to local compute or to any cluster resource manager, so individual pipeline authors do not need to worry about cluster job submission at all. -* **it is compatible with any command-line tool**. Looper was conceived to use [pypiper pipelines](https://github.com/databio/pypiper/), but it is in fact compatible with any tool that can be run via the command line. - -You can download the latest version from the [releases page](https://github.com/pepkit/looper/releases). You can find a list of looper-compatible pipelines in the [hello looper! repository](https://github.com/pepkit/hello_looper/blob/master/looper_pipelines.md). - -## Install and quick start - -Detailed instructions are in the [Read the Docs documentation](http://looper.readthedocs.org/), and that's the best place to start. To get running quickly, you can install the latest release and put the `looper` executable in your `$PATH`: - -``` -pip install --user https://github.com/pepkit/looper/zipball/master -export PATH=$PATH:~/.local/bin -``` - -To use looper with your project, you must define your project using [standard PEP project definition format](http://pepkit.github.io), which is a `yaml` config file passed as an argument to looper. To test, follow the basic tutorial in the [hello looper repository](https://github.com/pepkit/hello_looper). You run it through looper with this command: - -```bash -looper run project_config.yaml -``` - -## Installation troubleshooting - -Looper supports Python 2.7 and Python 3, and has been tested in Linux. If you clone this repository and then an attempt at local installation, e.g. with `pip install --upgrade ./`, fails, this may be due to an issue with `setuptools` and `six`. A `FileNotFoundError` (Python 3) or an `IOError` (Python2), with a message/traceback about a nonexistent `METADATA` file means that this is even more likely the cause. To get around this, you can first manually `pip install --upgrade six` or `pip install six==1.11.0`, as upgrading from `six` from 1.10.0 to 1.11.0 resolves this issue, then retry the `looper` installation. - -## Contributing - -Pull requests or issues are welcome. - -- After adding tests in `tests` for a new feature or a bug fix, please run the test suite. -- To do so, the only additional dependencies needed beyond those for the package can be -installed with: - - ```pip install -r requirements/requirements-dev.txt``` - -- Once those are installed, the tests can be run with `pytest`. Alternatively, -`python setup.py test` can be used. - +`Looper` is a pipeline submission engine. The typical use case is to run a bioinformatics pipeline across many different input samples. Instructions are in the [Read the Docs documentation](http://looper.readthedocs.org/). diff --git a/doc/source/_static/HTML.svg b/doc/source/_static/HTML.svg new file mode 100644 index 000000000..3282c9982 --- /dev/null +++ b/doc/source/_static/HTML.svg @@ -0,0 +1,526 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + Openclipart + + + + + + + + + + + + diff --git a/doc/source/_static/cli.svg b/doc/source/_static/cli.svg new file mode 100644 index 000000000..803ad3b99 --- /dev/null +++ b/doc/source/_static/cli.svg @@ -0,0 +1,379 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/doc/source/_static/collate.svg b/doc/source/_static/collate.svg new file mode 100644 index 000000000..c536fff2e --- /dev/null +++ b/doc/source/_static/collate.svg @@ -0,0 +1,133 @@ + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + diff --git a/doc/source/_static/computing.svg b/doc/source/_static/computing.svg new file mode 100644 index 000000000..eb3fb2f8d --- /dev/null +++ b/doc/source/_static/computing.svg @@ -0,0 +1,756 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/doc/source/_static/favicon_looper.ico b/doc/source/_static/favicon_looper.ico new file mode 100644 index 000000000..d118e4754 Binary files /dev/null and b/doc/source/_static/favicon_looper.ico differ diff --git a/doc/source/_static/favicon_looper.svg b/doc/source/_static/favicon_looper.svg new file mode 100644 index 000000000..8b16d8fee --- /dev/null +++ b/doc/source/_static/favicon_looper.svg @@ -0,0 +1,72 @@ + + + + + + + + + + image/svg+xml + + + + + + + + + + diff --git a/doc/source/_static/file_yaml.svg b/doc/source/_static/file_yaml.svg new file mode 100644 index 000000000..2aaa54142 --- /dev/null +++ b/doc/source/_static/file_yaml.svg @@ -0,0 +1,394 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + .SH + + + + + + + + + + + .yaml + + + + + + + + diff --git a/doc/source/_static/flexible_pipelines.svg b/doc/source/_static/flexible_pipelines.svg new file mode 100644 index 000000000..5a331625c --- /dev/null +++ b/doc/source/_static/flexible_pipelines.svg @@ -0,0 +1,270 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + diff --git a/doc/source/_static/job_monitoring.svg b/doc/source/_static/job_monitoring.svg new file mode 100644 index 000000000..3f09da534 --- /dev/null +++ b/doc/source/_static/job_monitoring.svg @@ -0,0 +1,286 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + diff --git a/doc/source/_static/logo_looper.svg b/doc/source/_static/logo_looper.svg new file mode 100644 index 000000000..57c3f0245 --- /dev/null +++ b/doc/source/_static/logo_looper.svg @@ -0,0 +1,93 @@ + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + diff --git a/doc/source/_static/modular.svg b/doc/source/_static/modular.svg new file mode 100644 index 000000000..10e1edf81 --- /dev/null +++ b/doc/source/_static/modular.svg @@ -0,0 +1,118 @@ + + + + + + + + + + + + + + + + image/svg+xml + + + + + Openclipart + + + ftnetwork connected + 2011-01-31T02:06:32 + Originally uploaded by Danny Allen for OCAL 0.18 this icon is part of the flat theme + https://openclipart.org/detail/113647/ftnetwork-connected-by-anonymous + + + Anonymous + + + + + flat + icon + theme + + + + + + + + + + + diff --git a/doc/source/_static/resources.svg b/doc/source/_static/resources.svg new file mode 100644 index 000000000..944f83f2e --- /dev/null +++ b/doc/source/_static/resources.svg @@ -0,0 +1,635 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + diff --git a/doc/source/_static/subprojects.svg b/doc/source/_static/subprojects.svg new file mode 100644 index 000000000..e35e1db46 --- /dev/null +++ b/doc/source/_static/subprojects.svg @@ -0,0 +1,293 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + diff --git a/doc/source/advanced.rst b/doc/source/advanced.rst deleted file mode 100644 index 407f41732..000000000 --- a/doc/source/advanced.rst +++ /dev/null @@ -1,60 +0,0 @@ -Advanced features -===================================== - -Handling multiple input files -**************************************** - -Sometimes you have multiple input files that you want to merge for one sample. For example, a common use case is a single library that was spread across multiple sequencing lanes, yielding multiple input files that need to be merged, and then run through the pipeline as one. Rather than putting multiple lines in your sample annotation sheet, which causes conceptual and analytical challenges, we introduce two ways to merge these: - -1. Use shell expansion characters (like '*' or '[]') in your `data_source` definition or filename (good for simple merges) -2. Specify a *merge table* which maps input files to samples for samples with more than one input file (infinitely customizable for more complicated merges). - -To do the first option, just change your data source specifications, like this: - -.. code-block:: yaml - - data_R1: "${DATA}/{id}_S{nexseq_num}_L00*_R1_001.fastq.gz" - data_R2: "${DATA}/{id}_S{nexseq_num}_L00*_R2_001.fastq.gz" - -To do the second option, just provide a merge table in the *metadata* section of your project config: - -metadata: - merge_table: mergetable.csv - -Make sure the ``sample_name`` column of this table matches, and then include any columns you need to point to the data. ``Looper`` will automatically include all of these files as input passed to the pipelines. Warning: do not use both of these options simultaneously for the same sample, it will lead to multiple merges. - -Note: to handle different *classes* of input files, like read1 and read2, these are *not* merged and should be handled as different derived columns in the main sample annotation sheet (and therefore different arguments to the pipeline). - -.. _connecting-multiple-pipelines: - -Connecting to multiple pipelines -**************************************** - -If you have a project that contains samples of different types, then you may need to specify multiple pipeline repositories to your project. Starting in version 0.5, looper can handle a priority list of pipelines. Starting with version 0.6, these pointers should point directly at a pipeline interface files (instead of at directories as previously). in the metadata.pipeline_interfaces attribute. - -For example: - -.. code-block:: yaml - - metadata: - pipeline_interfaces: [pipeline_iface1.yaml, pipeline_iface2.yaml] - - -In this case, for a given sample, looper will first look in ``pipeline_iface1.yaml`` to see if appropriate pipeline exists for this sample type. If it finds one, it will use this pipeline (or set of pipelines, as specified in the ``protocol_mappings`` section of the ``pipeline_interface.yaml`` file). Having submitted a suitable pipeline it will ignore the pipeline_iface2.yaml interface. However if there is no suitable pipeline in the first interface, looper will check the second and, if it finds a match, will submit that. If no suitable pipelines are found in any of the interfaces, the sample will be skipped as usual. - -If your project contains samples with different protocols, you can use this to run several different pipelines. For example, if you have ATAC-seq, RNA-seq, and ChIP-seq samples in your project, you may want to include a `pipeline interface` for 3 different pipelines, each accepting one of those protocols. In the event that more than one of the `pipeline interface` files provide pipelines for the same protocol, looper will only submit the pipeline from the first interface. Thus, this list specifies a `priority order` to pipeline repositories. - - - -Command-line argument pass-through -**************************************** - -Any command-line arguments passed to `looper run` *that are not consumed by looper* will be added to the command of every pipeline submitted in that looper run. This gives you a handy way to pass-through command-line arguments that you want passed to every job in a given looper run. For example, pypiper pipelines understand the `--recover` flag; so if you want to pass this flag through `looper` to all your pipeline runs, you may run `looper run config.yaml --recover`. Since `looper `does not understand `--recover`, this will be passed to every pipeline. Obviously, this feature is limited to passing flags that `looper` does not understand, because those arguments will be consumed by `looper` and not passed through to individual pipelines. - - -Project models -**************************************** - -Looper uses the ``peppy`` package to model Project and Sample objects under the hood. These project objects are actually useful outside of looper. If you define your project using looper's :doc:`standardized project definition format ` , you can use the project models to instantiate an in memory representation of your project and all of its samples, without using looper. - -If you're interested in this, you should check out the `peppy package `_. All the documentation for model objects has moved there. diff --git a/doc/source/changelog.rst b/doc/source/changelog.rst index a764cbd73..0ed650189 100644 --- a/doc/source/changelog.rst +++ b/doc/source/changelog.rst @@ -1,5 +1,17 @@ Changelog ****************************** +- **v0.9.0** (*Unreleased*): + + - New + + - Support for custom summarizers + + - Add ``allow-duplicate-names`` command-line options + + - Allow any variables in environment config files or other ``compute`` sections to be used in submission templates + + - Add nice universal project-level HTML reporting + - **v0.8.1** (*2018-04-02*): diff --git a/doc/source/conf.py b/doc/source/conf.py index 2037083e2..83fea6055 100644 --- a/doc/source/conf.py +++ b/doc/source/conf.py @@ -134,7 +134,7 @@ # The name of an image file (within the static path) to use as favicon of the # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 # pixels large. -#html_favicon = None +html_favicon = "_static/favicon_looper.ico" # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, diff --git a/doc/source/containers.rst b/doc/source/containers.rst new file mode 100644 index 000000000..464b429f6 --- /dev/null +++ b/doc/source/containers.rst @@ -0,0 +1,62 @@ +.. _containers: + +How to run jobs in a linux container +============================================= + +Looper uses a template system to build scripts for each job. To start, looper includes a few built-in templates so you can run basic jobs without messing with anything, but the template system provides ultimate flexibility to customize your job scripts however you wish. This template system is how we can use looper to run jobs on any cluster resource manager, by simply setting up a template that fits our particular cluster manager. We can also exploit the template system to run any job in a linux container (for example, using docker or singularity). + +Here is a guide on how to run a job in a container: + +1. Get your container image. This could be a docker image (hosted on dockerhub), which you would download via `docker pull`, or it could be a `singularity` image you have saved in a local folder. This is pipeline-specific, and you'll need to download the image recommended by the authors of the pipeline or pipelines you want to run. + + +2. Specify the location of the image for your pipeline. Looper will need to know what image you are planning to use. Probably, the author of the pipeline has already done this for you by specifying the image in the `pipeline_interface.yaml` file. That `yaml` file will need a `compute` section for each pipeline that can be run in a container, specifying the location of the container. For example: + + + +.. code-block:: yaml + + compute: + singularity_image: ${SIMAGES}myimage + docker_image: databio/myimage + + +For singularity images, you just need to make sure that the images indicated in the `pipeline_interface` are available in those locations on your system. For docker, make sure you have the docker images pulled. + + +3. Configure your `PEPENV`. Looper will need a computing environment configuration that provides templates that work with the container system of your choice. This just needs to be set up once for your compute environment, which would enable you to run any pipeline in a container (as long as you have an image). You should set up the PEPENV compute environment configuration by following instructions in the `pepenv readme `_. If it's not already container-aware, you will just need to add a new container-aware "compute package" to your PEPENV file. Here's an example of how to add one for using singularity in a SLURM environment: + +.. code-block:: yaml + + singularity_slurm: + submission_template: templates/slurm_singularity_template.sub + submission_command: sbatch + singularity_args: -B /sfs/lustre:/sfs/lustre,/nm/t1:/nm/t1 + +In `singularity_args` you'll need to pass any mounts or other settings to be passed to singularity. The actual `slurm_singularity_template.sub` file looks something like this. Notice how these values will be used to populate a template that will run the pipeline in a container. + +.. code-block:: bash + + #!/bin/bash + #SBATCH --job-name='{JOBNAME}' + #SBATCH --output='{LOGFILE}' + #SBATCH --mem='{MEM}' + #SBATCH --cpus-per-task='{CORES}' + #SBATCH --time='{TIME}' + #SBATCH --partition='{PARTITION}' + #SBATCH -m block + #SBATCH --ntasks=1 + + echo 'Compute node:' `hostname` + echo 'Start time:' `date +'%Y-%m-%d %T'` + + singularity instance.start {SINGULARITY_ARGS} {SINGULARITY_IMAGE} {JOBNAME}_image + srun singularity exec instance://{JOBNAME}_image {CODE} + + singularity instance.stop {JOBNAME}_image + + +Now, to use singularity, you just need to activate this compute package in the usual way, which is using the `compute` argument: ``looper run --compute singularity_slurm``. More detailed instructions can be found in the `pepenv readme `_ under `containers`. + + + diff --git a/doc/source/contributing.rst b/doc/source/contributing.rst new file mode 100644 index 000000000..957fedd98 --- /dev/null +++ b/doc/source/contributing.rst @@ -0,0 +1,14 @@ +Contributing +===================================== + +Pull requests or issues are welcome. + +- After adding tests in `tests` for a new feature or a bug fix, please run the test suite. +- To do so, the only additional dependencies needed beyond those for the package can be +installed with: + + ```pip install -r requirements/requirements-dev.txt``` + +- Once those are installed, the tests can be run with `pytest`. Alternatively, +`python setup.py test` can be used. + diff --git a/doc/source/derived-columns.rst b/doc/source/derived-columns.rst deleted file mode 100644 index 8ab4b26f2..000000000 --- a/doc/source/derived-columns.rst +++ /dev/null @@ -1,59 +0,0 @@ -.. _advanced-derived-columns: - -Derived columns -============================================= - -On your sample sheet, you will need to point to the input file or files for each sample. Of course, you could just add a column with the file path, like ``/path/to/input/file.fastq.gz``. For example: - - -.. csv-table:: Sample Annotation Sheet (bad example) - :header: "sample_name", "library", "organism", "time", "file_path" - :widths: 20, 20, 20, 10, 30 - - "pig_0h", "RRBS", "pig", "0", "/data/lab/project/pig_0h.fastq" - "pig_1h", "RRBS", "pig", "1", "/data/lab/project/pig_1h.fastq" - "frog_0h", "RRBS", "frog", "0", "/data/lab/project/frog_0h.fastq" - "frog_1h", "RRBS", "frog", "1", "/data/lab/project/frog_1h.fastq" - - -This is common, and it works in a pinch with Looper, but what if the data get moved, or your filesystem changes, or you switch servers or move institutes? Will this data still be there in 2 years? Do you want long file paths cluttering your annotation sheet? What if you have 2 or 3 input files? Do you want to manually manage these unwieldy absolute paths? - - -``Looper`` makes it really easy to do better: you can make one or your annotation columns into a flexible ``derived column`` that will be populated based on a source template you specify in the ``project_config.yaml``. What was originally ``/long/path/to/sample.fastq.gz`` would instead contain just a key, like ``source1``. Columns that use a key like this are called ``derived columns``. Here's an example of the same sheet using a ``derived column`` for ``file_path``: - -.. csv-table:: Sample Annotation Sheet (good example) - :header: "sample_name", "library", "organism", "time", "file_path" - :widths: 20, 20, 20, 10, 30 - - "pig_0h", "RRBS", "pig", "0", "source1" - "pig_1h", "RRBS", "pig", "1", "source1" - "frog_0h", "RRBS", "frog", "0", "source1" - "frog_1h", "RRBS", "frog", "1", "source1" - -To do this, your project config file must specify two things: First, which columns are to be derived (in this case, ``file_path``); and second, a ``data_sources`` section mapping keys to strings that will construct your path, like this: - - -.. code-block:: yaml - - derived_columns: [file_path] - data_sources: - source1: /data/lab/project/{sample_name}.fastq - source2: /path/from/collaborator/weirdNamingScheme_{external_id}.fastq - -That's it! The source string can use other sample attributes (columns) using braces, as in ``{sample_name}``. The attributes will be automatically populated separately for each sample. To take this a step further, you'd get the same result with this config file, which substitutes ``{sample_name}`` for other sample attributes, ``{organism}`` and ``{time}``: - -.. code-block:: yaml - - derived_columns: [file_path] - data_sources: - source1: /data/lab/project/{organism}_{time}h.fastq - source2: /path/from/collaborator/weirdNamingScheme_{external_id}.fastq - - -As long as your file naming system is systematic, you can easily deal with any external naming scheme, no problem at all. The idea is: don't put absolute paths to files in your annotation sheet. Instead, specify a data source and then provide a regex in the config file. This way if your data changes locations (which happens more often than we would like), or you change servers, or you want to share or publish the project, you just have to change the config file and not update paths in the annotation sheet; this makes the annotation sheet universal across environments, users, publication, etc. The whole project is now portable. - -You can specify as many derived columns as you want (``data_source`` is considered a derived column by default). An expression including any sample attributes (using ``{attribute}``) will be populated for each of those columns. - -Think of each sample as belonging to a certain type (for simple experiments, the type will be the same); then define the location of these samples in the project configuration file. As a side bonus, you can easily include samples from different locations, and you can also share the same sample annotation sheet on different environments (i.e. servers or users) by having multiple project config files (or, better yet, by defining a subproject for each environment). The only thing you have to change is the project-level expression describing the location, not any sample attributes (plus, you get to eliminate those annoying long/path/arguments/in/your/sample/annotation/sheet). - -Check out the complete working example in the `microtest repository `__. diff --git a/doc/source/features.rst b/doc/source/features.rst index c49c44561..32cfc21e2 100644 --- a/doc/source/features.rst +++ b/doc/source/features.rst @@ -1,26 +1,48 @@ -Features and benefits +.. |logo| image:: _static/logo_looper.svg + +|logo| Features and benefits ****************************** -Simplicity for the beginning, power when you need to expand. +.. |cli| image:: _static/cli.svg +.. |computing| image:: _static/computing.svg +.. |flexible_pipelines| image:: _static/flexible_pipelines.svg +.. |job_monitoring| image:: _static/job_monitoring.svg +.. |resources| image:: _static/resources.svg +.. |subprojects| image:: _static/subprojects.svg +.. |collate| image:: _static/collate.svg +.. |file_yaml| image:: _static/file_yaml.svg +.. |html| image:: _static/HTML.svg +.. |modular| image:: _static/modular.svg + + +|modular| **Modular approach to sample handling** + Looper **completely divides sample handling from pipeline processing**. This modular approach simplifies the pipeline-building process because pipelines no longer need to worry about sample metadata parsing. -- **Flexible pipelines:** - Use looper with any pipeline, any library, in any domain. We designed it to work with `pypiper `_, but looper has an infinitely flexible command-line argument system that will let you configure it to work with any script (pipeline) that accepts command-line arguments. You can also configure looper to submit multiple pipelines per sample. +|file_yaml| **Standardized project format** + Looper subscribes to a single, standardized project metadata format called `standard PEP format `_. This means **you only need to learn 1 way to format your project metadata, and it will work with any pipeline**. You can also use the `pepr `_ R package or the `peppy `_ python package to import all your sample metadata (and pipeline results) in an R or python analysis environment. -- **Flexible compute:** - If you don't change any settings, looper will simply run your jobs serially. But Looper includes a templating system that will let you process your pipelines on any cluster resource manager (SLURM, SGE, etc.). We include default templates for SLURM and SGE, but it's easy to add your own as well. Looper also gives you a way to determine which compute queue/partition to submit on-the-fly, by passing the ``--compute`` parameter to your call to ``looper run``, making it simple to use by default, but very flexible if you have complex resource needs. +|computing| **Universal parallelization implementation** + Looper's sample-level parallelization applies to all pipelines, so individual pipelines do not need reinvent the wheel. This allows looper to provide a convenient interface for submitting pipelines either to local compute or to any cluster resource manager, so individual pipeline authors do not need to worry about cluster job submission at all.If you don't change any settings, looper will simply run your jobs serially. But Looper includes a template system that will let you process your pipelines on any cluster resource manager (SLURM, SGE, etc.). We include default templates for SLURM and SGE, but it's easy to add your own as well. Looper also gives you a way to determine which compute queue/partition to submit on-the-fly, by passing the ``--compute`` parameter to your call to ``looper run``, making it simple to use by default, but very flexible if you have complex resource needs. -- **Standardized project definition:** - Looper reads a flexible standard format for describing projects, which we call PEP (Portable Encapsulated Projects). Once you describe your project in this format, other PEP-compatible tools can also read your project. For example, you may use the `pepr `_ R package or the (pending) ``pep`` python package to import all your sample metadata (and pipeline results) in an R or python analysis environment. With a standardized project definition, the possibilities are endless. +|flexible_pipelines| **Flexible pipelines** + Use looper with any pipeline, any library, in any domain. We designed it to work with `pypiper `_, but **looper has an infinitely flexible command-line argument system that will let you configure it to work with any script (pipeline) that accepts command-line arguments**. You can also configure looper to submit multiple pipelines per sample. -- **Subprojects:** +|subprojects| **Subprojects** Subprojects make it easy to define two very similar projects without duplicating project metadata. -- **Job completion monitoring:** +|job_monitoring| **Job completion monitoring** Looper is job-aware and will not submit new jobs for samples that are already running or finished, making it easy to add new samples to existing projects, or re-run failed samples. -- **Flexible input files:** +|collate| **Flexible input files** Looper's *derived column* feature makes projects portable. You can use it to collate samples with input files on different file systems or from different projects, with different naming conventions. How it works: you specify a variable filepath like ``/path/to/{sample_name}.txt``, and looper populates these file paths on the fly using metadata from your sample sheet. This makes it easy to share projects across compute environments or individuals without having to change sample annotations to point at different places. -- **Flexible resources:** +|resources| **Flexible resources** Looper has an easy-to-use resource requesting scheme. With a few lines to define CPU, memory, clock time, or anything else, pipeline authors can specify different computational resources depending on the size of the input sample and pipeline to run. Or, just use a default if you don't want to mess with setup. + +|cli| **Command line interface** + Looper uses a command-line interface so you have total power at your fingertips. + +|html| **Beautiful linked result reports** + Looper automatically creates an internally linked, portable HTML report highlighting all results for your pipeline, for every pipeline. + diff --git a/doc/source/hello-world.rst b/doc/source/hello-world.rst index 54310bab4..4598f61d4 100644 --- a/doc/source/hello-world.rst +++ b/doc/source/hello-world.rst @@ -1,5 +1,6 @@ +.. |logo| image:: _static/logo_looper.svg -Installing and Hello, World! +|logo| Installing and Hello, World! ===================================== Release versions are posted on the GitHub `looper releases page `_. You can install the latest release directly from GitHub using pip: diff --git a/doc/source/implied-columns.rst b/doc/source/implied-columns.rst deleted file mode 100644 index aef12c51a..000000000 --- a/doc/source/implied-columns.rst +++ /dev/null @@ -1,25 +0,0 @@ -.. _advanced-implied-columns: - -Implied columns -============================================= - -At some point, you may have a situation where you need a single sample attribute (or column) to populate several different pipeline arguments with different values. In other words, the value of a given attribute may **imply** values for other attributes. It would be nice if you didn't have to enumerate all of these secondary, implied attributes, and could instead just infer them from the value of the original attribute. For example, if my `organism` attribute is ``human``, this implies a few other secondary attributes (which may be project-specific): For one project, I want to set ``genome`` to ``hg38`` and ``macs_genome_size`` to ``hs``. Of course, I could just define columns called ``genome`` and ``macs_genome_size``, but these would be invariant, so it feels inefficient and unweildy; and then, changing the aligned genome would require changing the sample annotation sheet (every sample, in fact). You can do this with looper, of course, but a better way would be handle these things at the project level. - -As a more elegant alternative, Looper offers a ``project_config`` section called ``implied_columns``. Instead of hard-coding ``genome`` and ``macs_genome_size`` in the sample annotation sheet, you can simply specify that the attribute ``organism`` **implies** additional attribute-value pairs (which may vary by sample based on the value of the ``organism`` attribute). This lets you specify the genome, transcriptome, genome size, and other similar variables all in your project configuration file. - -To do this, just add an ``implied_columns`` section to your project_config.yaml file. Example: - -.. code-block:: yaml - - implied_columns: - organism: - human: - genome: "hg38" - macs_genome_size: "hs" - mouse: - genome: "mm10" - macs_genome_size: "mm" - -There are 3 layers in the ``implied_columns`` hierarchy. The first layer, (sub-values under ``implied_columns``; here, ``organism``), are primary columns from which new attributes will rely. The second layer (here, ``human`` or ``mouse``) are possible values your samples may take in the primary column. The third layer (``genome`` and ``macs_genome_size``) are the key-value pair of new, implied columns for any samples with the required value for that primary column. In this example, any samples with organism set to "human" will automatically also have attributes for genome (hg38) and for macs_genome_size (hs). Any samples with organism set to "mouse" will have the corresponding values. A sample with organism set to ``frog`` would lack attributes for ``genome`` and ``macs_genome_size``, since those columns are not implied by ``frog``. - -This system essentially lets you set global, species-level attributes at the project level instead of duplicating that information for every sample that belongs to a species. Even better, it's generic -- so you can do this for any subdivision of samples (just replace ``organism`` with whatever you like). This makes your project more portable and does a better job conceptually with separating sample attributes from project attributes; after all, a reference genome assembly is really not an inherent property of a sample, but of a sample in respect to a particular project or alignment. diff --git a/doc/source/index.rst b/doc/source/index.rst index 0e083f439..f4bd8b9be 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -1,5 +1,7 @@ -Looper documentation -^^^^^^^^^^^^^^^^^^^^ +.. |logo| image:: _static/logo_looper.svg + +|logo| Looper documentation +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Looper is a python application that deploys pipelines across samples with minimal effort. To get started, proceed with the :doc:`Introduction `. If you're looking for actual pipelines, you can find a list in the `Hello, Looper! example repository `_. @@ -7,6 +9,7 @@ Contents ^^^^^^^^ .. toctree:: + :titlesonly: :caption: Getting started :maxdepth: 1 @@ -24,6 +27,7 @@ Contents linking-multiple-pipelines.rst pipeline-interface.rst cluster-computing.rst + containers.rst how-to-merge-inputs.rst .. toctree:: @@ -35,6 +39,7 @@ Contents faq.rst changelog.rst support.rst + contributing.rst diff --git a/doc/source/intro.rst b/doc/source/intro.rst index e67d06458..46a75b082 100644 --- a/doc/source/intro.rst +++ b/doc/source/intro.rst @@ -1,5 +1,6 @@ +.. |logo| image:: _static/logo_looper.svg -Introduction +|logo| Introduction ===================================== Looper is a pipeline submitting engine. Once you've built a command-line pipeline, Looper helps you deploy that pipeline across lots of samples. Looper standardizes the way the user (you) communicates with pipelines. While most pipelines specify a unique interface, looper lets you to use the same interface for every pipeline and every project. As you have more projects, this will save you time. diff --git a/doc/source/summarizers.rst b/doc/source/summarizers.rst new file mode 100644 index 000000000..785670de8 --- /dev/null +++ b/doc/source/summarizers.rst @@ -0,0 +1,30 @@ +How to write a summarizer +============================================= + +One of looper's primary functions is ``looper summarize``. By default, this will summarize any basic stats reported by your pipeline in the ``stats.tsv`` file. But it can also be made much more powerful, and can run any custom `summarizers` that accompany the pipeline. + +A custom summarizer is any script that summarizes pipeline results as a whole, across an entire project. If you specify these in your pipeline interface, ``looper`` will run them automatically when you run ``looper summarize``. You do this by adding a ``summarizers`` section the pipelines section in the ``pipeline_interface.yaml`` file. If you want to include any output in the HTML reports generated by ``looper summarize``, you can also use a ``summary_results`` section to specify any files produced by your summarizer. + +A basic example of including a custom summarizer may look like this: + + +.. code-block:: yaml + + summarizers: + - tools/summary_plot.R + summary_results: + - alignment_percent_file: + caption: "Alignment percent file" + description: "Plots percent of total alignment to all pre-alignments and primary genome." + thumbnail_path: "summary/{name}_alignmentPercent.png" + path: "summary/{name}_alignmentPercent.pdf" + - alignment_raw_file: + caption: "Alignment raw file" + description: "Plots raw alignment rates to all pre-alignments and primary genome." + thumbnail_path: "summary/{name}_alignmentRaw.png" + path: "summary/{name}_alignmentRaw.pdf" + + +Here, the ``summarizers`` section is just a list of scripts, which will be run by ``looper summarize``. The script must be executable, and should take a single command-line argument, which is the path to the ``project_config.yaml`` file. Looper will execute the script with the command: ``script /path/to/project_config.yaml``. Other than that, your summarizer can do whatever you want. We recommend using our pre-built PEP parsing software, which includes `peppy `_ for python code and `pepr `_ for R code, which will simplify handling the project input. + +The ``summary_results`` section includes information about what your summarizer produces. These files, which can use ``{variable}`` codes for any project attributes, will be used by ``looper summarizer`` to construct an HTML project report. diff --git a/doc/source/support.rst b/doc/source/support.rst index bd5e9e453..1cb1b6063 100644 --- a/doc/source/support.rst +++ b/doc/source/support.rst @@ -2,3 +2,5 @@ Support ===================================== Please use the issue tracker at GitHub to file bug reports or feature requests: https://github.com/epigen/looper/issues. + +Looper supports Python 2.7 and Python 3, and has been tested in Linux. If you clone this repository and then an attempt at local installation, e.g. with `pip install --upgrade ./`, fails, this may be due to an issue with `setuptools` and `six`. A `FileNotFoundError` (Python 3) or an `IOError` (Python2), with a message/traceback about a nonexistent `METADATA` file means that this is even more likely the cause. To get around this, you can first manually `pip install --upgrade six` or `pip install six==1.11.0`, as upgrading from `six` from 1.10.0 to 1.11.0 resolves this issue, then retry the `looper` installation. diff --git a/doc/source/tutorials.rst b/doc/source/tutorials.rst index 5f6c4cb2c..46cdc8d29 100644 --- a/doc/source/tutorials.rst +++ b/doc/source/tutorials.rst @@ -1,4 +1,6 @@ -Extended tutorial +.. |logo| image:: _static/logo_looper.svg + +|logo| Extended tutorial *************************************************** The best way to learn is by example, so here's an extended tutorial to get you started using looper to run pre-made pipelines on a pre-made project. diff --git a/doc/source/usage.rst b/doc/source/usage.rst index 7585d861b..a8e58c4f9 100644 --- a/doc/source/usage.rst +++ b/doc/source/usage.rst @@ -22,7 +22,7 @@ Here you can see the command-line usage instructions for the main looper command .. code-block:: none - version: 0.8.1 + version: 0.9.0-dev usage: looper [-h] [-V] [--logfile LOGFILE] [--verbosity {0,1,2,3,4}] [--dbg] {run,summarize,destroy,check,clean} ... @@ -46,16 +46,17 @@ Here you can see the command-line usage instructions for the main looper command --dbg Turn on debug mode (default: False) For subcommand-specific options, type: 'looper -h' - https://github.com/peppykit/looper + https://github.com/pepkit/looper ``looper run --help`` ---------------------------------- .. code-block:: none - version: 0.8.1 - usage: looper run [-h] [-t TIME_DELAY] [--ignore-flags] [--compute COMPUTE] - [--env ENV] [--limit LIMIT] [--lump LUMP] [--lumpn LUMPN] + version: 0.9.0-dev + usage: looper run [-h] [-t TIME_DELAY] [--ignore-flags] + [--allow-duplicate-names] [--compute COMPUTE] [--env ENV] + [--limit LIMIT] [--lump LUMP] [--lumpn LUMPN] [--file-checks] [-d] [--exclude-protocols [EXCLUDE_PROTOCOLS [EXCLUDE_PROTOCOLS ...]] | --include-protocols @@ -77,11 +78,16 @@ Here you can see the command-line usage instructions for the main looper command exists marking the run (e.g. as 'running' or 'failed'). Set this option to ignore flags and submit the runs anyway. + --allow-duplicate-names + Allow duplicate names? Default: False. By default, + pipelines will not be submitted if a sample name is + duplicated, since samples names should be unique. Set + this option to override this setting. --compute COMPUTE YAML file with looper environment compute settings. --env ENV Employ looper environment compute settings. --limit LIMIT Limit to n samples. --lump LUMP Maximum total input file size for a lump/batch of - commands in a single job + commands in a single job (in GB) --lumpn LUMPN Number of individual scripts grouped into single submission --file-checks Perform input file checks. Default=True. @@ -100,7 +106,7 @@ Here you can see the command-line usage instructions for the main looper command .. code-block:: none - version: 0.8.1 + version: 0.9.0-dev usage: looper summarize [-h] [--file-checks] [-d] [--exclude-protocols [EXCLUDE_PROTOCOLS [EXCLUDE_PROTOCOLS ...]] | --include-protocols @@ -131,7 +137,7 @@ Here you can see the command-line usage instructions for the main looper command .. code-block:: none - version: 0.8.1 + version: 0.9.0-dev usage: looper destroy [-h] [--file-checks] [-d] [--exclude-protocols [EXCLUDE_PROTOCOLS [EXCLUDE_PROTOCOLS ...]] | --include-protocols @@ -162,7 +168,7 @@ Here you can see the command-line usage instructions for the main looper command .. code-block:: none - version: 0.8.1 + version: 0.9.0-dev usage: looper check [-h] [-A] [-F [FLAGS [FLAGS ...]]] [--file-checks] [-d] [--exclude-protocols [EXCLUDE_PROTOCOLS [EXCLUDE_PROTOCOLS ...]] | --include-protocols @@ -198,7 +204,7 @@ Here you can see the command-line usage instructions for the main looper command .. code-block:: none - version: 0.8.1 + version: 0.9.0-dev usage: looper clean [-h] [--file-checks] [-d] [--exclude-protocols [EXCLUDE_PROTOCOLS [EXCLUDE_PROTOCOLS ...]] | --include-protocols diff --git a/looper/_version.py b/looper/_version.py index 8ea1e3450..3e2f46a3a 100644 --- a/looper/_version.py +++ b/looper/_version.py @@ -1 +1 @@ -__version__ = "0.9.0-dev" +__version__ = "0.9.0" diff --git a/looper/html_reports.py b/looper/html_reports.py new file mode 100644 index 000000000..5cc6e92e5 --- /dev/null +++ b/looper/html_reports.py @@ -0,0 +1,1267 @@ +""" Generate HTML reports """ + +import os +import glob +import pandas as _pd +import logging + +from peppy.utils import alpha_cased + +_LOGGER = logging.getLogger('HTMLReportBuilder') + +__author__ = "Jason Smith" +__email__ = "jasonsmith@virginia.edu" + +# HTML generator vars +HTML_HEAD_OPEN = \ +"""\ + + + + + + + + + +""" +HTML_TITLE = \ +"""\ + Looper project summary for {project_name} +""" +HTML_HEAD_CLOSE = \ +"""\ + + +""" +HTML_BUTTON = \ +"""\ +
+
+

+ {label} +

+
+
+""" +HTML_FIGURE = \ +"""\ +
+
+ +
'{label}'
+
+
\ +""" +HTML_FOOTER = \ +"""\ + + + + + + + +""" + +HTML_VARS = ["HTML_HEAD_OPEN", "HTML_TITLE", "HTML_HEAD_CLOSE", + "HTML_BUTTON", "HTML_FIGURE", "HTML_FOOTER"] + +# Navigation-related vars +NAVBAR_HEADER = \ +"""\ +
+ +""" +NAVBAR_FOOTER = \ +"""\ + + + +""" +HTML_NAVBAR_STYLE_BASIC = \ +"""\ + +""" +HTML_NAVBAR_BASIC = \ +"""\ + +""" + +NAVBAR_VARS = ["HTML_NAVBAR_STYLE_BASIC", "HTML_NAVBAR_BASIC", "NAVBAR_HEADER", + "NAVBAR_LOGO", "NAVBAR_DROPDOWN_HEADER", "NAVBAR_DROPDOWN_LINK", + "NAVBAR_DROPDOWN_DIVIDER", "NAVBAR_DROPDOWN_FOOTER", + "NAVBAR_MENU_LINK", "NAVBAR_SEARCH_FOOTER", "NAVBAR_FOOTER"] + +# Generic HTML vars +GENERIC_HEADER = \ +"""\ +

{header}

+""" +GENERIC_LIST_HEADER = \ +"""\ +
    +""" + +GENERIC_LIST_ENTRY = \ +"""\ +
  • {label}
  • +""" + +GENERIC_LIST_FOOTER = \ +""" +
+""" + +GENERIC_VARS = ["HTML_HEAD_OPEN", "HTML_TITLE", "HTML_HEAD_CLOSE", + "HTML_FOOTER", "GENERIC_HEADER", "GENERIC_LIST_HEADER", + "GENERIC_LIST_ENTRY", "GENERIC_LIST_FOOTER"] + +# Table-related +TABLE_STYLE_BASIC = \ +""" + +""" +TABLE_STYLE_ROTATED_HEADER = \ +"""\ + .table-header-rotated th.row-header{ + width: auto; + } + .table-header-rotated td{ + width: 60px; + border-left: 1px solid #dddddd; + border-right: 1px solid #dddddd; + vertical-align: middle; + text-align: center; + } + .table-header-rotated th.rotate-45{ + height: 120px; + width: 60px; + min-width: 60px; + max-width: 60px; + position: relative; + vertical-align: bottom; + padding: 0; + font-size: 14px; + line-height: 0.8; + border-top: none !important; + } + .table-header-rotated th.rotate-45 > div{ + position: relative; + top: 0px; + left: 60px; + height: 100%; + -ms-transform:skew(-45deg,0deg); + -moz-transform:skew(-45deg,0deg); + -webkit-transform:skew(-45deg,0deg); + -o-transform:skew(-45deg,0deg); + transform:skew(-45deg,0deg); + overflow: ellipsis; + border-left: 1px solid #dddddd; + border-right: 1px solid #dddddd; + } + .table-header-rotated th.rotate-45 span { + -ms-transform:skew(45deg,0deg) rotate(315deg); + -moz-transform:skew(45deg,0deg) rotate(315deg); + -webkit-transform:skew(45deg,0deg) rotate(315deg); + -o-transform:skew(45deg,0deg) rotate(315deg); + transform:skew(45deg,0deg) rotate(315deg); + position: absolute; + bottom: 30px; + left: -25px; + display: inline-block; + width: 85px; + text-align: left; + } +""" +TABLE_STYLE_TEXT = \ +""" + .table td.text { + max-width: 150px; + + padding: 0px 4px 0px 4px; + } + .table td.text span { + white-space: nowrap; + overflow: hidden; + text-overflow: ellipsis; + display: inline-block; + max-width: 100%; + vertical-align: middle; + } + .table td.text span:active { + white-space: normal; + text-overflow: clip; + max-width: 100%; + } + .table-condensed > tbody > tr > td, + .table-condensed > tbody > tr > th { + padding: 0px 2px 0px 2px; + vertical-align: middle; + } +""" +TABLE_HEADER = \ +""" +
Looper stats summary
+ +
+ + + +""" + +TABLE_COLS = \ +"""\ + +""" + +TABLE_COLS_FOOTER = \ +"""\ + + + +""" +TABLE_ROW_HEADER = \ +"""\ + +""" + +TABLE_ROWS = \ +"""\ + +""" +TABLE_ROW_FOOTER = \ +"""\ + +""" + +TABLE_FOOTER = \ +"""\ + +
{col_val}
{row_val}
+
+""" + +TABLE_ROWS_LINK = \ +"""\ + {link_name} +""" + +LINKS_STYLE_BASIC = \ +""" +a.LN1 { + font-style:normal; + font-weight:bold; + font-size:1.0em; +} + +a.LN2:link { + color:#A4DCF5; + text-decoration:none; +} + +a.LN3:visited { + color:#A4DCF5; + text-decoration:none; +} + +a.LN4:hover { + color:#A4DCF5; + text-decoration:none; +} + +a.LN5:active { + color:#A4DCF5; + text-decoration:none; +} +""" +TABLE_VARS = ["TABLE_STYLE_BASIC", "TABLE_HEADER", "TABLE_COLS", + "TABLE_COLS_FOOTER", "TABLE_ROW_HEADER", "TABLE_ROWS", + "TABLE_ROW_FOOTER", "TABLE_FOOTER", + "TABLE_ROWS_LINK", "LINKS_STYLE_BASIC", + "TABLE_STYLE_ROTATED_HEADER", "TABLE_STYLE_TEXT"] + +# Sample-page-related +SAMPLE_HEADER = \ +"""\ + +

{sample_name} figures

+ +

Return to summary page

+ +""" + +SAMPLE_BUTTONS = \ +"""\ +
+ +
+""" +SAMPLE_TABLE_HEADER = \ +"""\ +
Looper stats summary
+
+ + +""" +SAMPLE_TABLE_FIRSTROW = \ +"""\ + + + + +""" +SAMPLE_TABLE_ROW = \ +"""\ + + + + +""" +SAMPLE_TABLE_STYLE = \ +"""\ + .table td.text { + max-width: 50%; + + padding: 0px 0px 0px 0px; + } + .table td.text span { + white-space: nowrap; + overflow: hidden; + text-overflow: ellipsis; + display: inline-block; + max-width: 100%; + vertical-align: middle; + } + .table td.text span:active { + white-space: normal; + text-overflow: clip; + max-width: 100%; + } + .table-condensed > tbody > tr > td, + .table-condensed > tbody > tr > th { + padding: 0px 2px 0px 2px; + vertical-align: middle; + } +""" + +SAMPLE_PLOTS = \ +"""\ +
+ +
'{label}'
+
+""" + +SAMPLE_FOOTER = \ +""" +

Return to summary page

+ + +""" +SAMPLE_VARS = ["SAMPLE_HEADER", "SAMPLE_BUTTONS", "SAMPLE_PLOTS", + "SAMPLE_FOOTER", "SAMPLE_TABLE_HEADER", "SAMPLE_TABLE_STYLE", + "SAMPLE_TABLE_FIRSTROW", "SAMPLE_TABLE_ROW"] + +# Status-page-related +STATUS_HEADER = \ +"""\ +
+
+""" +STATUS_TABLE_HEAD = \ +"""\ +
+
+
{row_name}{link_name}
{row_name}{row_val}
+ + + + + + + +""" +STATUS_BUTTON = \ +"""\ + + + +""" +STATUS_ROW_HEADER = \ +"""\ + +""" +STATUS_ROW_VALUE = \ +"""\ + +""" +STATUS_ROW_LINK = \ +"""\ + +""" +STATUS_ROW_FOOTER = \ +"""\ + +""" +STATUS_FOOTER = \ +"""\ + +
Sample nameStatusLog fileRuntime
{value}{link_name}
+
+ +
+""" + +STATUS_VARS = ["STATUS_HEADER", "STATUS_TABLE_HEAD", "STATUS_BUTTON", + "STATUS_ROW_HEADER", "STATUS_ROW_VALUE", "STATUS_ROW_LINK", + "STATUS_ROW_FOOTER", "STATUS_FOOTER"] + +# Objects-page-related +OBJECTS_HEADER = \ +"""\ + +

{object_type} figures

+ +

Return to summary page

+ +""" +OBJECTS_LIST_HEADER = \ +"""\ +
    +""" +OBJECTS_LINK = \ +"""\ +
  • '{label}'
  • \ +""" +OBJECTS_LIST_FOOTER = \ +"""\ +
+""" +OBJECTS_PLOTS = \ +"""\ +
+ +
'{label}'
+
+""" +OBJECTS_FOOTER = \ +""" +

Return to summary page

+ + +""" +OBJECTS_VARS = ["OBJECTS_HEADER", "OBJECTS_LIST_HEADER", "OBJECTS_LINK", + "OBJECTS_LIST_FOOTER", "OBJECTS_PLOTS", "OBJECTS_FOOTER"] + +__all__ = HTML_VARS + NAVBAR_VARS + GENERIC_VARS + \ + TABLE_VARS + SAMPLE_VARS + STATUS_VARS + OBJECTS_VARS + +class HTMLReportBuilder(): + """ Generate HTML summary report for project/samples """ + + def __init__(self, prj): + """ + The Project defines the instance; establish an iteration counter. + + :param Project prj: Project with which to work/operate on + """ + super(HTMLReportBuilder, self).__init__() + self.prj = prj + + def __call__(self, objs, stats): + + def create_object_parent_html(objs): + # Generates a page listing all the project objects with links + # to individual object pages + reports_dir = os.path.join(self.prj.metadata.output_dir, + "reports") + object_parent_path = os.path.join(reports_dir, "objects.html") + if not os.path.exists(os.path.dirname(object_parent_path)): + os.makedirs(os.path.dirname(object_parent_path)) + with open(object_parent_path, 'w') as html_file: + html_file.write(HTML_HEAD_OPEN) + html_file.write(create_navbar(objs, reports_dir)) + html_file.write(HTML_HEAD_CLOSE) + html_file.write(GENERIC_HEADER.format(header="Objects")) + html_file.write(GENERIC_LIST_HEADER) + for key in objs['key'].drop_duplicates().sort_values(): + page_name = key + ".html" + page_path = os.path.join(reports_dir, page_name.replace(' ', '_').lower()) + page_relpath = os.path.relpath(page_path, reports_dir) + html_file.write(GENERIC_LIST_ENTRY.format( + page=page_relpath, label=key)) + html_file.write(HTML_FOOTER) + html_file.close() + + def create_sample_parent_html(objs): + # Generates a page listing all the project samples with links + # to individual sample pages + reports_dir = os.path.join(self.prj.metadata.output_dir, + "reports") + sample_parent_path = os.path.join(reports_dir, "samples.html") + if not os.path.exists(os.path.dirname(sample_parent_path)): + os.makedirs(os.path.dirname(sample_parent_path)) + with open(sample_parent_path, 'w') as html_file: + html_file.write(HTML_HEAD_OPEN) + html_file.write(create_navbar(objs, reports_dir)) + html_file.write(HTML_HEAD_CLOSE) + html_file.write(GENERIC_HEADER.format(header="Samples")) + html_file.write(GENERIC_LIST_HEADER) + for sample in self.prj.samples: + sample_name = str(sample.sample_name) + page_name = sample_name + ".html" + page_path = os.path.join(reports_dir, page_name.replace(' ', '_').lower()) + page_relpath = os.path.relpath(page_path, reports_dir) + html_file.write(GENERIC_LIST_ENTRY.format( + page=page_relpath, label=sample_name)) + html_file.write(HTML_FOOTER) + html_file.close() + + def create_object_html(objs, nb, type, filename, index_html): + # Generates a page for an individual object type with all of its + # plots from each sample + reports_dir = os.path.join(self.prj.metadata.output_dir, "reports") + object_path = os.path.join(reports_dir, + filename.replace(' ', '_').lower()) + if not os.path.exists(os.path.dirname(object_path)): + os.makedirs(os.path.dirname(object_path)) + with open(object_path, 'w') as html_file: + html_file.write(HTML_HEAD_OPEN) + html_file.write(create_navbar(nb, reports_dir)) + html_file.write(HTML_HEAD_CLOSE) + html_file.write("\t\t

{} objects

\n".format(str(type))) + links = [] + figures = [] + warnings = [] + for i, row in objs.iterrows(): + page_path = os.path.join( + self.prj.metadata.results_subdir, + row['sample_name'], row['filename']) + image_path = os.path.join( + self.prj.metadata.results_subdir, + row['sample_name'], row['anchor_image']) + page_relpath = os.path.relpath(page_path, reports_dir) + # Check for the presence of both the file and thumbnail + if os.path.isfile(image_path) and os.path.isfile(page_path): + image_relpath = os.path.relpath(image_path, reports_dir) + # If the object has a valid image, use it! + if str(image_path).lower().endswith(('.png', '.jpg', '.jpeg', '.svg', '.gif')): + figures.append(OBJECTS_PLOTS.format( + path=page_relpath, + image=image_relpath, + label=str(row['sample_name']))) + # Otherwise treat as a link + elif os.path.isfile(page_path): + links.append(GENERIC_LIST_ENTRY.format( + page=page_relpath, + label=str(row['sample_name']))) + else: + warnings.append(str(row['filename'])) + # If no thumbnail image is present, add as a link + elif os.path.isfile(page_path): + links.append(GENERIC_LIST_ENTRY.format( + page=page_relpath, + label=str(row['sample_name']))) + else: + warnings.append(str(row['filename'])) + + html_file.write(GENERIC_LIST_HEADER) + html_file.write("\n".join(links)) + html_file.write(GENERIC_LIST_FOOTER) + html_file.write("\t\t\t
\n") + html_file.write("\n".join(figures)) + html_file.write("\t\t\t
\n") + html_file.write(HTML_FOOTER) + html_file.close() + + if warnings: + _LOGGER.warn("Warning: " + filename.replace(' ', '_').lower() + + " references nonexistent object files") + _LOGGER.debug(filename.replace(' ', '_').lower() + + " nonexistent files: " + + ','.join(str(file) for file in warnings)) + + def create_status_html(all_samples): + # Generates a page listing all the samples, their run status, their + # log file, and the total runtime if completed. + reports_dir = os.path.join(self.prj.metadata.output_dir, "reports") + status_html_path = os.path.join(reports_dir, "status.html") + if not os.path.exists(os.path.dirname(status_html_path)): + os.makedirs(os.path.dirname(status_html_path)) + with open(status_html_path, 'w') as html_file: + html_file.write(HTML_HEAD_OPEN) + html_file.write(create_navbar(all_samples, reports_dir)) + html_file.write(HTML_HEAD_CLOSE) + html_file.write(STATUS_HEADER) + html_file.write(STATUS_TABLE_HEAD) + warning = False + for sample in self.prj.samples: + sample_name = str(sample.sample_name) + # Grab the status flag for the current sample + flag = glob.glob(os.path.join( + self.prj.metadata.results_subdir, + sample_name, '*.flag')) + if not flag: + button_class = "table-danger" + flag = "Missing" + _LOGGER.warn("create_status_html: No flag file found for {}".format(sample_name)) + elif len(flag) > 1: + button_class = "table-warning" + flag = "Multiple" + _LOGGER.warn("create_status_html: Multiple flag files found for {}".format(sample_name)) + else: + if "completed" in str(flag): + button_class = "table-success" + flag = "Completed" + elif "running" in str(flag): + button_class = "table-warning" + flag = "Running" + elif "failed" in str(flag): + button_class = "table-danger" + flag = "Failed" + else: + button_class = "table-secondary" + flag = "Unknown" + + # Create table entry for each sample + html_file.write(STATUS_ROW_HEADER) + # First Col: Sample_Name (w/ link to sample page) + page_name = sample_name + ".html" + page_path = os.path.join(reports_dir, page_name.replace(' ', '_').lower()) + page_relpath = os.path.relpath(page_path, reports_dir) + html_file.write(STATUS_ROW_LINK.format( + row_class="", + file_link=page_relpath, + link_name=sample_name)) + # Second Col: Status (color-coded) + html_file.write(STATUS_ROW_VALUE.format( + row_class=button_class, + value=flag)) + # Third Col: Log File (w/ link to file) + single_sample = all_samples[all_samples['sample_name'] == sample_name] + if single_sample.empty: + # When there is no objects.tsv file, search for the + # presence of log, profile, and command files + log_name = os.path.basename(str(glob.glob(os.path.join( + self.prj.metadata.results_subdir, + sample_name, '*log.md'))[0])) + # Currently unused. Future? + # profile_name = os.path.basename(str(glob.glob(os.path.join( + # self.prj.metadata.results_subdir, + # sample_name, '*profile.tsv'))[0])) + # command_name = os.path.basename(str(glob.glob(os.path.join( + # self.prj.metadata.results_subdir, + # sample_name, '*commands.sh'))[0])) + else: + log_name = str(single_sample.iloc[0]['annotation']) + "_log.md" + # Currently unused. Future? + # profile_name = str(single_sample.iloc[0]['annotation']) + "_profile.tsv" + # command_name = str(single_sample.iloc[0]['annotation']) + "_commands.sh" + log_file = os.path.join(self.prj.metadata.results_subdir, + sample_name, log_name) + log_relpath = os.path.relpath(log_file, reports_dir) + if os.path.isfile(log_file): + html_file.write(STATUS_ROW_LINK.format( + row_class="", + file_link=log_relpath, + link_name=log_name)) + else: + # Leave cell empty + html_file.write(STATUS_ROW_LINK.format( + row_class="", + file_link="", + link_name="")) + # Fourth Col: Sample runtime (if completed) + # If Completed, use stats.tsv + stats_file = os.path.join( + self.prj.metadata.results_subdir, + sample_name, "stats.tsv") + if os.path.isfile(stats_file): + t = _pd.read_table(stats_file, header=None, + names=['key', 'value', 'pl']) + t.drop_duplicates(subset=['key', 'pl'], + keep='last', inplace=True) + try: + time = str(t[t['key'] == 'Time'].iloc[0]['value']) + html_file.write(STATUS_ROW_VALUE.format( + row_class="", + value=str(time))) + except IndexError: + warning = True + else: + html_file.write(STATUS_ROW_VALUE.format( + row_class=button_class, + value="Unknown")) + html_file.write(STATUS_ROW_FOOTER) + + html_file.write(STATUS_FOOTER) + html_file.write(HTML_FOOTER) + html_file.close() + if warning: + _LOGGER.warn("The stats_summary.tsv file is incomplete") + + def create_sample_html(all_samples, sample_name, sample_stats, + index_html): + # Produce an HTML page containing all of a sample's objects + # and the sample summary statistics + reports_dir = os.path.join(self.prj.metadata.output_dir, "reports") + html_filename = sample_name + ".html" + html_page = os.path.join( + reports_dir, html_filename.replace(' ', '_').lower()) + sample_page_relpath = os.path.relpath( + html_page, self.prj.metadata.output_dir) + single_sample = all_samples[all_samples['sample_name'] == sample_name] + if not os.path.exists(os.path.dirname(html_page)): + os.makedirs(os.path.dirname(html_page)) + with open(html_page, 'w') as html_file: + html_file.write(HTML_HEAD_OPEN) + html_file.write("\t\t\n") + html_file.write(create_navbar(all_samples, reports_dir)) + html_file.write(HTML_HEAD_CLOSE) + html_file.write("\t\t

{}

\n".format(str(sample_name))) + if single_sample.empty: + # When there is no objects.tsv file, search for the + # presence of log, profile, and command files + log_name = os.path.basename(str(glob.glob(os.path.join( + self.prj.metadata.results_subdir, + sample_name, '*log.md'))[0])) + profile_name = os.path.basename(str(glob.glob(os.path.join( + self.prj.metadata.results_subdir, + sample_name, '*profile.tsv'))[0])) + command_name = os.path.basename(str(glob.glob(os.path.join( + self.prj.metadata.results_subdir, + sample_name, '*commands.sh'))[0])) + else: + log_name = str(single_sample.iloc[0]['annotation']) + "_log.md" + profile_name = str(single_sample.iloc[0]['annotation']) + "_profile.tsv" + command_name = str(single_sample.iloc[0]['annotation']) + "_commands.sh" + # Get relative path to the log file + log_file = os.path.join(self.prj.metadata.results_subdir, + sample_name, log_name) + log_relpath = os.path.relpath(log_file, reports_dir) + # Grab the status flag for the current sample + flag = glob.glob(os.path.join(self.prj.metadata.results_subdir, + sample_name, '*.flag')) + if not flag: + button_class = "btn btn-danger" + flag = "Missing" + _LOGGER.warn("create_sample_html: No flag file found for {}".format(sample_name)) + elif len(flag) > 1: + button_class = "btn btn-warning" + flag = "Multiple" + _LOGGER.warn("create_sample_html: Multiple flag files found for {}".format(sample_name)) + else: + if "completed" in str(flag): + button_class = "btn btn-success" + flag = "Completed" + elif "running" in str(flag): + button_class = "btn btn-warning" + flag = "Running" + elif "failed" in str(flag): + button_class = "btn btn-danger" + flag = "Failed" + else: + button_class = "btn btn-secondary" + flag = "Unknown" + # Create buttons linking the sample's STATUS, LOG, PROFILE, + # COMMANDS, and STATS files + stats_relpath = os.path.relpath(os.path.join( + self.prj.metadata.results_subdir, + sample_name, "stats.tsv"), reports_dir) + profile_relpath = os.path.relpath(os.path.join( + self.prj.metadata.results_subdir, + sample_name, profile_name), reports_dir) + command_relpath = os.path.relpath(os.path.join( + self.prj.metadata.results_subdir, + sample_name, command_name), reports_dir) + html_file.write(SAMPLE_BUTTONS.format( + button_class=button_class, + flag=flag, + log_file=log_relpath, + profile_file=profile_relpath, + commands_file=command_relpath, + stats_file=stats_relpath)) + + # Add the sample's statistics as a table + html_file.write("\t
\n") + html_file.write(SAMPLE_TABLE_HEADER) + # Produce table rows + for key, value in sample_stats.items(): + # Treat sample_name as a link to sample page + if key == 'sample_name': + page_relpath = os.path.relpath(html_page, reports_dir) + html_file.write(SAMPLE_TABLE_FIRSTROW.format( + row_name=str(key), + html_page=page_relpath, + page_name=html_filename, + link_name=str(value))) + # Otherwise add as a static cell value + else: + html_file.write(SAMPLE_TABLE_ROW.format( + row_name=str(key), + row_val=str(value))) + + html_file.write(TABLE_FOOTER) + html_file.write("\t
\n") + # Add all the objects for the current sample + html_file.write("\t\t
\n") + html_file.write("\t\t
{sample} objects
\n".format(sample=sample_name)) + links = [] + figures = [] + warnings = [] + for sample_name in single_sample['sample_name'].drop_duplicates().sort_values(): + o = single_sample[single_sample['sample_name'] == sample_name] + for i, row in o.iterrows(): + image_path = os.path.join( + self.prj.metadata.results_subdir, + sample_name, row['anchor_image']) + image_relpath = os.path.relpath(image_path, reports_dir) + page_path = os.path.join( + self.prj.metadata.results_subdir, + sample_name, row['filename']) + page_relpath = os.path.relpath(page_path, reports_dir) + # If the object has a thumbnail image, add as a figure + if os.path.isfile(image_path) and os.path.isfile(page_path): + # If the object has a valid image, add as a figure + if str(image_path).lower().endswith(('.png', '.jpg', '.jpeg', '.svg', '.gif')): + figures.append(SAMPLE_PLOTS.format( + label=str(row['key']), + path=page_relpath, + image=image_relpath)) + # Otherwise treat as a link + elif os.path.isfile(page_path): + links.append(GENERIC_LIST_ENTRY.format( + label=str(row['key']), + page=page_relpath)) + # If neither, there is no object by that name + else: + warnings.append(str(row['filename'])) + # If no thumbnail image, it's just a link + elif os.path.isfile(page_path): + links.append(GENERIC_LIST_ENTRY.format( + label=str(row['key']), + page=page_relpath)) + # If no file present, there is no object by that name + else: + warnings.append(str(row['filename'])) + + html_file.write(GENERIC_LIST_HEADER) + html_file.write("\n".join(links)) + html_file.write(GENERIC_LIST_FOOTER) + html_file.write("\t\t\t
\n") + html_file.write("\n".join(figures)) + html_file.write("\t\t
\n") + html_file.write("\t\t
\n") + html_file.write(HTML_FOOTER) + html_file.close() + + # TODO: accumulate warnings from these functions and only display + # after all samples are processed + # _LOGGER.warn("Warning: The following files do not exist: " + + # '\t'.join(str(file) for file in warnings)) + # Return the path to the newly created sample page + return sample_page_relpath + + def create_navbar(objs, wd): + # Return a string containing the navbar prebuilt html + # Includes link to all the pages + objs_html_path = "{root}_summary.html".format( + root=os.path.join(self.prj.metadata.output_dir, self.prj.name)) + reports_dir = os.path.join(self.prj.metadata.output_dir, + "reports") + index_page_relpath = os.path.relpath(objs_html_path, wd) + navbar_header = NAVBAR_HEADER.format(logo=NAVBAR_LOGO, + index_html=index_page_relpath) + # Add link to STATUS page + status_page = os.path.join(reports_dir, "status.html") + # Use relative linking structure + relpath = os.path.relpath(status_page, wd) + status_link = NAVBAR_MENU_LINK.format(html_page=relpath, + page_name="Status") + # Create list of object page links + obj_links = [] + # If the number of objects is 20 or less, use a drop-down menu + if len(objs['key'].drop_duplicates()) <= 20: + # Create drop-down menu item for all the objects + obj_links.append(NAVBAR_DROPDOWN_HEADER.format(menu_name="Objects")) + objects_page = os.path.join(reports_dir, "objects.html") + relpath = os.path.relpath(objects_page, wd) + obj_links.append(NAVBAR_DROPDOWN_LINK.format( + html_page=relpath, + page_name="All objects")) + obj_links.append(NAVBAR_DROPDOWN_DIVIDER) + for key in objs['key'].drop_duplicates().sort_values(): + page_name = key + ".html" + page_path = os.path.join(reports_dir, page_name.replace(' ', '_').lower()) + relpath = os.path.relpath(page_path, wd) + obj_links.append(NAVBAR_DROPDOWN_LINK.format( + html_page=relpath, + page_name=key)) + obj_links.append(NAVBAR_DROPDOWN_FOOTER) + else: + # Create a menu link to the objects parent page + objects_page = os.path.join(reports_dir, "objects.html") + relpath = os.path.relpath(objects_page, wd) + obj_links.append(NAVBAR_MENU_LINK.format( + html_page=relpath, + page_name="Objects")) + + # Create list of sample page links + sample_links = [] + # If the number of samples is 20 or less, use a drop-down menu + if len(objs['sample_name'].drop_duplicates()) <= 20: + # Create drop-down menu item for all the samples + sample_links.append(NAVBAR_DROPDOWN_HEADER.format(menu_name="Samples")) + samples_page = os.path.join(reports_dir, "samples.html") + relpath = os.path.relpath(samples_page, wd) + sample_links.append(NAVBAR_DROPDOWN_LINK.format( + html_page=relpath, + page_name="All samples")) + sample_links.append(NAVBAR_DROPDOWN_DIVIDER) + for sample_name in objs['sample_name'].drop_duplicates().sort_values(): + page_name = sample_name + ".html" + page_path = os.path.join(reports_dir, page_name.replace(' ', '_').lower()) + relpath = os.path.relpath(page_path, wd) + sample_links.append(NAVBAR_DROPDOWN_LINK.format( + html_page=relpath, + page_name=sample_name)) + sample_links.append(NAVBAR_DROPDOWN_FOOTER) + else: + # Create a menu link to the samples parent page + samples_page = os.path.join(reports_dir, "samples.html") + relpath = os.path.relpath(samples_page, wd) + sample_links.append(NAVBAR_MENU_LINK.format( + html_page=relpath, + page_name="Samples")) + + return ("\n".join([navbar_header, status_link, + "\n".join(obj_links), + "\n".join(sample_links), + NAVBAR_FOOTER])) + + def create_project_objects(): + # If a protocol produces project level summaries add those as + # additional figures/links + all_protocols = [sample.protocol for sample in self.prj.samples] + # For each protocol report the project summarizers' results + for protocol in set(all_protocols): + obj_figs = [] + num_figures = 0 + obj_links = [] + warnings = [] + ifaces = self.prj.interfaces_by_protocol[alpha_cased(protocol)] + # Check the interface files for summarizers + for iface in ifaces: + pl = iface.fetch_pipelines(protocol) + summary_results = iface.get_attribute(pl, "summary_results") + # Build the HTML for each summary result + for result in summary_results: + caption = str(result['caption']) + result_file = str(result['path']).replace( + '{name}', str(self.prj.name)) + result_img = str(result['thumbnail_path']).replace( + '{name}', str(self.prj.name)) + search = os.path.join(self.prj.metadata.output_dir, + '{}'.format(result_file)) + # Confirm the file itself was produced + if glob.glob(search): + file_path = str(glob.glob(search)[0]) + file_relpath = os.path.relpath( + file_path, + self.prj.metadata.output_dir) + search = os.path.join(self.prj.metadata.output_dir, + '{}'.format(result_img)) + # Add as a figure if thumbnail exists + if glob.glob(search): + img_path = str(glob.glob(search)[0]) + img_relpath = os.path.relpath( + img_path, + self.prj.metadata.output_dir) + # Add to single row + if num_figures < 3: + obj_figs.append(HTML_FIGURE.format( + path=file_relpath, + image=img_relpath, + label=caption)) + num_figures += 1 + # Close the previous row and start a new one + else: + num_figures = 1 + obj_figs.append("\t\t\t
") + obj_figs.append("\t\t\t
") + obj_figs.append(HTML_FIGURE.format( + path=file_relpath, + image=img_relpath, + label=caption)) + # No thumbnail exists, add as a link in a list + else: + obj_links.append(OBJECTS_LINK.format( + path=file_relpath, + label=caption)) + else: + warnings.append(caption) + + if warnings: + _LOGGER.warn("Summarizer was unable to find: " + + ', '.join(str(file) for file in warnings)) + while num_figures < 3: + # Add additional empty columns for clean format + obj_figs.append("\t\t\t
") + obj_figs.append("\t\t\t
") + num_figures += 1 + return ("\n".join(["\t\t
Looper project objects
", + "\t\t
", + "\t\t\t
", + "\n".join(obj_figs), + "\t\t\t
", + "\t\t
", + OBJECTS_LIST_HEADER, + "\n".join(obj_links), + OBJECTS_LIST_FOOTER])) + + def create_index_html(objs, stats): + # Generate an index.html style project home page w/ sample summary + # statistics + objs.drop_duplicates(keep='last', inplace=True) + reports_dir = os.path.join(self.prj.metadata.output_dir, "reports") + # Generate parent index.html page + objs_html_path = "{root}_summary.html".format( + root=os.path.join(self.prj.metadata.output_dir, self.prj.name)) + # Generate parent objects.html page + object_parent_path = os.path.join(reports_dir, "objects.html") + # Generate parent samples.html page + sample_parent_path = os.path.join(reports_dir, "samples.html") + + objs_html_file = open(objs_html_path, 'w') + objs_html_file.write(HTML_HEAD_OPEN) + objs_html_file.write("\t\t\n") + objs_html_file.write(HTML_TITLE.format(project_name=self.prj.name)) + navbar = create_navbar(objs, self.prj.metadata.output_dir) + objs_html_file.write(navbar) + objs_html_file.write(HTML_HEAD_CLOSE) + + # Add stats_summary.tsv button link + tsv_outfile_path = os.path.join(self.prj.metadata.output_dir, + self.prj.name) + if hasattr(self.prj, "subproject") and self.prj.subproject: + tsv_outfile_path += '_' + self.prj.subproject + tsv_outfile_path += '_stats_summary.tsv' + stats_relpath = os.path.relpath(tsv_outfile_path, + self.prj.metadata.output_dir) + objs_html_file.write(HTML_BUTTON.format( + file_path=stats_relpath, label="Stats Summary File")) + + # Add stats summary table to index page + if os.path.isfile(tsv_outfile_path): + objs_html_file.write(TABLE_HEADER) + # Produce table columns + sample_pos = 0 + # Get unique column name list + col_names = [] + while sample_pos < len(stats): + for key, value in stats[sample_pos].items(): + col_names.append(key) + sample_pos += 1 + unique_columns = uniqify(col_names) + # Write table column names to index.html file + for key in unique_columns: + objs_html_file.write(TABLE_COLS.format(col_val=str(key))) + objs_html_file.write(TABLE_COLS_FOOTER) + + # Produce table rows + sample_pos = 0 + col_pos = 0 + num_columns = len(unique_columns) + for row in stats: + # Match row value to column + table_row = [] + while col_pos < num_columns: + value = row.get(unique_columns[col_pos]) + if value is None: + value = '' + table_row.append(value) + col_pos += 1 + # Reset column position counter + col_pos = 0 + sample_name = str(stats[sample_pos]['sample_name']) + objs_html_file.write(TABLE_ROW_HEADER) + for value in table_row: + if value == sample_name: + # Generate individual sample page and return link + sample_page = create_sample_html( + objs, sample_name, + stats[sample_pos], objs_html_path) + # Treat sample_name as a link to sample page + objs_html_file.write(TABLE_ROWS_LINK.format( + html_page=sample_page, + page_name=sample_page, + link_name=sample_name)) + # If not the sample name, add as an unlinked cell value + else: + objs_html_file.write(TABLE_ROWS.format( + row_val=str(value))) + objs_html_file.write(TABLE_ROW_FOOTER) + sample_pos += 1 + objs_html_file.write(TABLE_FOOTER) + else: + _LOGGER.warn("No stats file '%s'", stats_file) + # Create parent samples page with links to each sample + create_sample_parent_html(objs) + + # Create objects pages + for key in objs['key'].drop_duplicates().sort_values(): + objects = objs[objs['key'] == key] + object_filename = str(key) + ".html" + create_object_html( + objects, objs, key, object_filename, objs_html_path) + + # Create parent objects page with links to each object type + create_object_parent_html(objs) + + # Create status page with each sample's status listed + create_status_html(objs) + + # Add project level objects + prj_objs = create_project_objects() + objs_html_file.write("\t\t
\n") + objs_html_file.write(prj_objs) + objs_html_file.write("\t\t
\n") + + # Complete and close HTML file + objs_html_file.write(HTML_FOOTER) + objs_html_file.close() + + _LOGGER.info( + "Summary (n=" + str(len(stats)) + "): " + tsv_outfile_path) + + # Generate HTML report + create_index_html(objs, stats) + +def uniqify(seq): + """ + Fast way to uniqify while preserving input order. + """ + # http://stackoverflow.com/questions/480214/ + seen = set() + seen_add = seen.add + return [x for x in seq if not (x in seen or seen_add(x))] \ No newline at end of file diff --git a/looper/looper.py b/looper/looper.py index 51557cc7e..7b5223826 100755 --- a/looper/looper.py +++ b/looper/looper.py @@ -1,6 +1,6 @@ #!/usr/bin/env python """ -Looper: a pipeline submission engine. https://github.com/peppykit/looper +Looper: a pipeline submission engine. https://github.com/pepkit/looper """ import abc @@ -26,12 +26,13 @@ from .submission_manager import SubmissionConductor from .utils import fetch_flag_files, sample_folder +from .html_reports import HTMLReportBuilder + from peppy import \ ProjectContext, COMPUTE_SETTINGS_VARNAME, SAMPLE_EXECUTION_TOGGLE from peppy.utils import alpha_cased - SUBMISSION_FAILURE_MESSAGE = "Cluster resource failure" # Descending by severity for correspondence with logic inversion. @@ -43,7 +44,6 @@ _LOGGER = logging.getLogger() - def parse_arguments(): """ Argument Parsing. @@ -56,8 +56,8 @@ def parse_arguments(): # Main looper program help text messages banner = "%(prog)s - Loop through samples and submit pipelines." additional_description = "For subcommand-specific options, type: " \ - "'%(prog)s -h'" - additional_description += "\nhttps://github.com/peppykit/looper" + "'%(prog)s -h'" + additional_description += "\nhttps://github.com/pepkit/looper" parser = _VersionInHelpParser( description=banner, @@ -87,11 +87,12 @@ def parse_arguments(): msg_by_cmd = { "run": "Main Looper function: Submit jobs for samples.", "summarize": "Summarize statistics of project samples.", - "destroy": "Remove all files of the project.", - "check": "Checks flag status of current runs.", + "destroy": "Remove all files of the project.", + "check": "Checks flag status of current runs.", "clean": "Runs clean scripts to remove intermediate " "files of already processed jobs."} subparsers = parser.add_subparsers(dest="command") + def add_subparser(cmd): message = msg_by_cmd[cmd] return subparsers.add_parser(cmd, description=message, help=message) @@ -110,6 +111,13 @@ def add_subparser(cmd): "flag file exists marking the run (e.g. as " "'running' or 'failed'). Set this option to ignore flags " "and submit the runs anyway.") + run_subparser.add_argument( + "--allow-duplicate-names", + action="store_true", + help="Allow duplicate names? Default: False. " + "By default, pipelines will not be submitted if a sample name" + " is duplicated, since samples names should be unique. " + " Set this option to override this setting.") run_subparser.add_argument( "--compute", dest="compute", help="YAML file with looper environment compute settings.") @@ -127,7 +135,7 @@ def add_subparser(cmd): run_subparser.add_argument( "--lump", type=float, default=None, help="Maximum total input file size for a lump/batch of commands " - "in a single job") + "in a single job (in GB)") run_subparser.add_argument( "--lumpn", type=int, default=None, help="Number of individual scripts grouped into single submission") @@ -148,7 +156,7 @@ def add_subparser(cmd): # Common arguments for subparser in [run_subparser, summarize_subparser, - destroy_subparser, check_subparser, clean_subparser]: + destroy_subparser, check_subparser, clean_subparser]: subparser.add_argument( "config_file", help="Project configuration file (YAML).") @@ -167,8 +175,8 @@ def add_subparser(cmd): "for which protocol is not in this collection.") protocols.add_argument( "--include-protocols", nargs='*', dest="include_protocols", - help="Operate only on samples associated with these protocols; " - "if not provided, all samples are used.") + help="Operate only on samples associated with these protocols;" + " if not provided, all samples are used.") subparser.add_argument( "--sp", dest="subproject", help="Name of subproject to use, as designated in the " @@ -202,16 +210,21 @@ def add_subparser(cmd): return args, remaining_args - class Executor(object): - """ Base class that ensures the program's Sample counter starts. """ + """ Base class that ensures the program's Sample counter starts. + + Looper is made up of a series of child classes that each extend the base + Executor class. Each child class does a particular task (such as run the + project, summarize the project, destroy the project, etc). The parent + Executor class simply holds the code that is common to all child classes, + such as counting samples as the class does its thing.""" __metaclass__ = abc.ABCMeta def __init__(self, prj): """ The Project defines the instance; establish an iteration counter. - + :param Project prj: Project with which to work/operate on """ super(Executor, self).__init__() @@ -224,7 +237,6 @@ def __call__(self, *args, **kwargs): pass - class Checker(Executor): def __call__(self, flags=None, all_folders=False, max_file_count=30): @@ -274,16 +286,15 @@ def __call__(self, flags=None, all_folders=False, max_file_count=30): len(files), "\n".join(files)) - class Cleaner(Executor): """ Remove all intermediate files (defined by pypiper clean scripts). """ - + def __call__(self, args, preview_flag=True): """ Execute the file cleaning process. - + :param argparse.Namespace args: command-line options and arguments - :param bool preview_flag: whether to halt before actually removing files + :param bool preview_flag: whether to halt before actually removing files """ _LOGGER.info("Files to clean:") @@ -318,20 +329,19 @@ def __call__(self, args, preview_flag=True): return self(args, preview_flag=False) - class Destroyer(Executor): """ Destroyer of files and folders associated with Project's Samples """ - + def __call__(self, args, preview_flag=True): """ Completely remove all output produced by any pipelines. - + :param argparse.Namespace args: command-line options and arguments :param bool preview_flag: whether to halt before actually removing files """ - + _LOGGER.info("Results to destroy:") - + for sample in self.prj.samples: _LOGGER.info( self.counter.show(sample.sample_name, sample.protocol)) @@ -341,15 +351,15 @@ def __call__(self, args, preview_flag=True): _LOGGER.info(str(sample_output_folder)) else: destroy_sample_results(sample_output_folder, args) - + if not preview_flag: _LOGGER.info("Destroy complete.") return 0 - + if args.dry_run: _LOGGER.info("Dry run. No files destroyed.") return 0 - + if not query_yes_no("Are you sure you want to permanently delete " "all pipeline results for this project?"): _LOGGER.info("Destroy action aborted by user.") @@ -361,17 +371,16 @@ def __call__(self, args, preview_flag=True): return self(args, preview_flag=False) - class Runner(Executor): """ The true submitter of pipelines """ def __call__(self, args, remaining_args): """ Do the Sample submission. - - :param argparse.Namespace args: parsed command-line options and - arguments, recognized by looper - :param list remaining_args: command-line options and arguments not + + :param argparse.Namespace args: parsed command-line options and + arguments, recognized by looper + :param list remaining_args: command-line options and arguments not recognized by looper, germane to samples/pipelines """ @@ -405,7 +414,7 @@ def __call__(self, args, remaining_args): pl_key, pl_iface, script_with_flags, self.prj, args.dry_run, args.time_delay, sample_subtype, remaining_args, args.ignore_flags, - self.prj.compute.partition, + self.prj.compute, max_cmds=args.lumpn, max_size=args.lump) submission_conductors[pl_key] = conductor pipe_keys_by_protocol[proto_key].append(pl_key) @@ -436,9 +445,12 @@ def __call__(self, args, remaining_args): sample.sample_name, sample.protocol)) skip_reasons = [] - # Don't submit samples with duplicate names. + # Don't submit samples with duplicate names unless suppressed. if sample.sample_name in processed_samples: - skip_reasons.append("Duplicate sample name") + if args.ignore_duplicate_names: + _LOGGER.warn("Duplicate name detected, but submitting anyway") + else: + skip_reasons.append("Duplicate sample name") # Check if sample should be run. if sample.is_dormant(): @@ -473,7 +485,7 @@ def __call__(self, args, remaining_args): sample.to_yaml(subs_folder_path=self.prj.metadata.submission_subdir) pipe_keys = pipe_keys_by_protocol.get(alpha_cased(sample.protocol)) \ - or pipe_keys_by_protocol.get(GENERIC_PROTOCOL_KEY) + or pipe_keys_by_protocol.get(GENERIC_PROTOCOL_KEY) _LOGGER.debug("Considering %d pipeline(s)", len(pipe_keys)) pl_fails = [] @@ -555,20 +567,23 @@ def __call__(self, args, remaining_args): """ - class Summarizer(Executor): """ Project/Sample output summarizer """ - + def __call__(self): """ Do the summarization. """ import csv columns = [] stats = [] - figs = _pd.DataFrame() - + objs = _pd.DataFrame() + + # First, the generic summarize will pull together all the fits + # and stats from each sample into project-combined spreadsheets. + # Create stats_summary file for sample in self.prj.samples: - _LOGGER.info(self.counter.show(sample.sample_name, sample.protocol)) + _LOGGER.info(self.counter.show(sample.sample_name, + sample.protocol)) sample_output_folder = sample_folder(self.prj, sample) # Grab the basic info from the annotation sheet for this sample. @@ -588,7 +603,6 @@ def __call__(self): t.drop_duplicates(subset=['key', 'pl'], keep='last', inplace=True) # t.duplicated(subset= ['key'], keep = False) - t.loc[:, 'plkey'] = t['pl'] + ":" + t['key'] dupes = t.duplicated(subset=['key'], keep=False) t.loc[dupes, 'key'] = t.loc[dupes, 'plkey'] @@ -597,26 +611,25 @@ def __call__(self): stats.append(sample_stats) columns.extend(t.key.tolist()) - self.counter.reset() - + self.counter.reset() + + # Create objects summary file for sample in self.prj.samples: + # Process any reported objects _LOGGER.info(self.counter.show(sample.sample_name, sample.protocol)) sample_output_folder = sample_folder(self.prj, sample) - # Now process any reported figures - figs_file = os.path.join(sample_output_folder, "figures.tsv") - if os.path.isfile(figs_file): - _LOGGER.info("Found figures file: '%s'", figs_file) + objs_file = os.path.join(sample_output_folder, "objects.tsv") + if os.path.isfile(objs_file): + _LOGGER.info("Found objects file: '%s'", objs_file) else: - _LOGGER.warn("No figures file '%s'", figs_file) + _LOGGER.warn("No objects file '%s'", objs_file) continue - - t = _pd.read_table( - figs_file, header=None, names=['key', 'value', 'pl']) + t = _pd.read_table(objs_file, header=None, + names=['key', 'filename', 'anchor_text', + 'anchor_image', 'annotation']) t['sample_name'] = sample.name - figs = figs.append(t, ignore_index=True) - - # all samples are parsed. Produce file. - + objs = objs.append(t, ignore_index=True) + tsv_outfile_path = os.path.join(self.prj.metadata.output_dir, self.prj.name) if hasattr(self.prj, "subproject") and self.prj.subproject: tsv_outfile_path += '_' + self.prj.subproject @@ -633,35 +646,32 @@ def __call__(self): tsv_outfile.close() - figs_tsv_path = "{root}_figs_summary.tsv".format( - root=os.path.join(self.prj.metadata.output_dir, self.prj.name)) - - figs_html_path = "{root}_figs_summary.html".format( - root=os.path.join(self.prj.metadata.output_dir, self.prj.name)) - - figs_html_file = open(figs_html_path, 'w') - html_header = "

Summary of sample figures for project {}

\n".format(self.prj.name) - figs_html_file.write(html_header) - sample_img_header = "

{sample_name}

\n" - sample_img_code = "

{key}

\n" - - figs.drop_duplicates(keep='last', inplace=True) - for sample_name in figs['sample_name'].drop_duplicates().sort_values(): - f = figs[figs['sample_name'] == sample_name] - figs_html_file.write(sample_img_header.format(sample_name=sample_name)) - - for i, row in f.iterrows(): - figs_html_file.write(sample_img_code.format( - key=str(row['key']), path=row['value'])) - - html_footer = "" - figs_html_file.write(html_footer) - - figs_html_file.close() _LOGGER.info( "Summary (n=" + str(len(stats)) + "): " + tsv_outfile_path) - + # Next, looper can run custom summarizers, if they exist. + all_protocols = [sample.protocol for sample in self.prj.samples] + + _LOGGER.debug("Protocols: " + str(all_protocols)) + _LOGGER.debug(self.prj.interfaces_by_protocol) + for protocol in set(all_protocols): + ifaces = self.prj.interfaces_by_protocol[alpha_cased(protocol)] + for iface in ifaces: + _LOGGER.debug(iface) + pl = iface.fetch_pipelines(protocol) + summarizers = iface.get_attribute(pl, "summarizers") + for summarizer in summarizers: + summarizer_abspath = os.path.join( + os.path.dirname(iface.pipe_iface_file), summarizer) + _LOGGER.debug([summarizer_abspath, self.prj.config_file]) + try: + subprocess.call([summarizer_abspath, self.prj.config_file]) + except OSError: + _LOGGER.warn("Summarizer was unable to run: " + str(summarizer)) + + # Produce HTML report + report_builder = HTMLReportBuilder(self.prj) + report_builder(objs, stats) def aggregate_exec_skip_reasons(skip_reasons_sample_pairs): """ @@ -680,7 +690,6 @@ def aggregate_exec_skip_reasons(skip_reasons_sample_pairs): return samples_by_skip_reason - def create_failure_message(reason, samples): """ Explain lack of submission for a single reason, 1 or more samples. """ color = Fore.LIGHTRED_EX @@ -689,7 +698,6 @@ def create_failure_message(reason, samples): return "{}: {}".format(reason_text, samples_text) - def query_yes_no(question, default="no"): """ Ask a yes/no question via raw_input() and return their answer. @@ -726,7 +734,6 @@ def query_yes_no(question, default="no"): "(or 'y' or 'n').\n") - def destroy_sample_results(result_outfolder, args): """ This function will delete all results for this sample @@ -743,7 +750,6 @@ def destroy_sample_results(result_outfolder, args): _LOGGER.info(result_outfolder + " does not exist.") - def uniqify(seq): """ Fast way to uniqify while preserving input order. @@ -754,7 +760,6 @@ def uniqify(seq): return [x for x in seq if not (x in seen or seen_add(x))] - class LooperCounter(object): """ Count samples as you loop through them, and create text for the @@ -792,7 +797,6 @@ def __str__(self): return "LooperCounter of size {}".format(self.total) - def _submission_status_text(curr, total, sample_name, sample_protocol, color): return color + \ "## [{n} of {N}] {sample} ({protocol})".format( @@ -800,7 +804,6 @@ def _submission_status_text(curr, total, sample_name, sample_protocol, color): Style.RESET_ALL - class _VersionInHelpParser(argparse.ArgumentParser): def format_help(self): """ Add version information to help text. """ @@ -808,7 +811,6 @@ def format_help(self): super(_VersionInHelpParser, self).format_help() - def main(): # Parse command-line arguments and establish logger. args, remaining_args = parse_arguments() @@ -839,7 +841,7 @@ def main(): # pipelines directory in project metadata pipelines directory. if not hasattr(prj.metadata, "pipelines_dir") or \ - len(prj.metadata.pipelines_dir) == 0: + len(prj.metadata.pipelines_dir) == 0: raise AttributeError( "Looper requires at least one pipeline(s) location.") @@ -872,7 +874,6 @@ def main(): return Cleaner(prj)(args) - if __name__ == '__main__': try: sys.exit(main()) diff --git a/looper/pipeline_interface.py b/looper/pipeline_interface.py index 50a2b2624..1538b1c32 100644 --- a/looper/pipeline_interface.py +++ b/looper/pipeline_interface.py @@ -150,6 +150,16 @@ def choose_resource_package(self, pipeline_name, file_size): raise ValueError("Attempted selection of resource package for " "negative file size: {}".format(file_size)) + universal_compute = {} + try: + universal_compute = self.select_pipeline(pipeline_name)["compute"] + except KeyError: + msg = "No universal compute settings for pipeline '{}'".format(pipeline_name) + if self.pipe_iface_file is not None: + msg += " in file '{}'".format(self.pipe_iface_file) + _LOGGER.warn(msg) + + try: resources = self.select_pipeline(pipeline_name)["resources"] except KeyError: @@ -208,6 +218,7 @@ def file_size_ante(name, data): msg = "Selected '{}' package with min file size {} Gb for file " \ "of size {} Gb.".format(rp_name, size_ante, file_size) _LOGGER.debug(msg) + rp_data.update(universal_compute) return rp_data @@ -229,7 +240,8 @@ def finalize_pipeline_key_and_paths(self, pipeline_key): """ # The key may contain extra command-line flags; split key from flags. - # The strict key is the script name itself, something like "ATACseq.py" + # The strict key was previously the script name itself, something like + # "ATACseq.py", but now is typically just something like "atacseq". strict_pipeline_key, _, pipeline_key_args = pipeline_key.partition(' ') full_pipe_path = \ diff --git a/looper/project.py b/looper/project.py index e582c995e..0f51e9b9f 100644 --- a/looper/project.py +++ b/looper/project.py @@ -56,24 +56,6 @@ def project_folders(self): return ["output_dir", "results_subdir", "submission_subdir"] - @staticmethod - def infer_name(path_config_file): - """ - Infer project name from config file path. - - The assumption is that the config file lives in a 'metadata' subfolder - within a folder with a name representative of the project. - - :param str path_config_file: path to the project's config file. - :return str: inferred name for project. - """ - import os - metadata_folder_path = os.path.dirname(path_config_file) - proj_root_path, _ = os.path.split(metadata_folder_path) - _, proj_root_name = os.path.split(proj_root_path) - return proj_root_name - - def build_submission_bundles(self, protocol, priority=True): """ Create pipelines to submit for each sample of a particular protocol. @@ -206,7 +188,7 @@ def build_submission_bundles(self, protocol, priority=True): pipe_iface, sample_subtype, strict_pipe_key, full_pipe_path_with_flags) - # Enforce bundle unqiueness for each strict pipeline key. + # Enforce bundle uniqueness for each strict pipeline key. maybe_new_bundle = (full_pipe_path_with_flags, sample_subtype, pipe_iface) old_bundle = bundle_by_strict_pipe_key.setdefault( @@ -241,8 +223,7 @@ def process_pipeline_interfaces(pipeline_interface_locations): :param Iterable[str] pipeline_interface_locations: locations, each of which should be either a directory path or a filepath, that specifies pipeline interface and protocol mappings information. Each such file - should be have a pipelines section and a protocol mappings section - whereas each folder should have a file for each of those sections. + should have a pipelines section and a protocol mappings section. :return Mapping[str, Iterable[PipelineInterfaec]]: mapping from protocol name to interface(s) for which that protocol is mapped """ diff --git a/looper/submission_manager.py b/looper/submission_manager.py index fe181964a..8dc26a994 100644 --- a/looper/submission_manager.py +++ b/looper/submission_manager.py @@ -40,7 +40,7 @@ class SubmissionConductor(object): def __init__(self, pipeline_key, pipeline_interface, cmd_base, prj, dry_run=False, delay=0, sample_subtype=None, extra_args=None, - ignore_flags=False, partition=None, + ignore_flags=False, compute_variables=None, max_cmds=None, max_size=None, automatic=True): """ Create a job submission manager. @@ -70,8 +70,9 @@ def __init__(self, pipeline_key, pipeline_interface, cmd_base, prj, each command within each job generated :param bool ignore_flags: Whether to ignore flag files present in the sample folder for each sample considered for submission - :param str partition: Name of the cluster partition to which job(s) - will be submitted + :param str compute variables: A dict with variables that will be made + available to the compute package. For example, this should include + the name of the cluster partition to which job(s) will be submitted :param int | NoneType max_cmds: Upper bound on number of commands to include in a single job script. :param int | float | NoneType max_size: Upper bound on total file @@ -91,7 +92,7 @@ def __init__(self, pipeline_key, pipeline_interface, cmd_base, prj, self.delay = float(delay) self.sample_subtype = sample_subtype or Sample - self.partition = partition + self.compute_variables = compute_variables if extra_args: self.extra_args_text = "{}".format(" ".join(extra_args)) else: @@ -302,8 +303,9 @@ def submit(self, force=False): "(%.2f Gb)", len(self._pool), self._curr_size) settings = self.pl_iface.choose_resource_package( self.pl_key, self._curr_size) - if self.partition: - settings["partition"] = self.partition + if self.compute_variables: + #settings["partition"] = self.partition + settings.update(self.compute_variables) if self.uses_looper_args: settings.setdefault("cores", 1) looper_argtext = \ @@ -313,7 +315,7 @@ def submit(self, force=False): prj_argtext = self.prj.get_arg_string(self.pl_key) assert all(map(lambda cmd_part: isinstance(cmd_part, str), [self.cmd_base, prj_argtext, looper_argtext])), \ - "Each command component mut be a string." + "Each command component must be a string." # Ensure that each sample is individually represented on disk, # specific to subtype as applicable (should just be a single @@ -441,6 +443,11 @@ def _write_script(self, template_values, prj_argtext, looper_argtext): for k, v in template_values.items(): placeholder = "{" + str(k).upper() + "}" script_data = script_data.replace(placeholder, str(v)) + + keys_left = re.findall(r'!$\{(.+?)\}', script_data) + + if len(keys_left) > 0: + _LOGGER.warn("> Warning: Submission template variables are not all populated: '%s'", str(keys_left)) submission_script = submission_base + ".sub" script_dirpath = os.path.dirname(submission_script) @@ -453,3 +460,4 @@ def _write_script(self, template_values, prj_argtext, looper_argtext): sub_file.write(script_data) return submission_script + diff --git a/setup.py b/setup.py index 46e35009c..11d99fb2c 100644 --- a/setup.py +++ b/setup.py @@ -77,7 +77,7 @@ def get_static(name, condition=None): keywords="bioinformatics, sequencing, ngs", url="https://github.com/epigen/looper", author=u"Nathan Sheffield, Vince Reuter, Johanna Klughammer, Andre Rendeiro", - license="GPL2", + license="BSD2", entry_points={ "console_scripts": [ 'looper = looper.looper:main' diff --git a/update-usage-docs.sh b/update-usage-docs.sh index e15af80aa..2afef7105 100755 --- a/update-usage-docs.sh +++ b/update-usage-docs.sh @@ -2,7 +2,7 @@ cp doc/source/usage.template usage.template #looper --help > USAGE.temp 2>&1 -for cmd in "--help" "run --help" "summarize --help" "destroy --help" "check --help" "clean --help" "--help"; do +for cmd in "--help" "run --help" "summarize --help" "destroy --help" "check --help" "clean --help"; do echo $cmd echo -e "\n\`\`looper $cmd\`\`" > USAGE_header.temp echo -e "----------------------------------" >> USAGE_header.temp