Skip to content

Commit 0d02010

Browse files
Merge remote-tracking branch 'origin/master'
2 parents f4692fa + d4a97b5 commit 0d02010

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+175
-173
lines changed

Diff for: docs/peppy/README.md

-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
[![PEP compatible](http://pepkit.github.io/img/PEP-compatible-green.svg)](http://pepkit.github.io)
21
[![pypi-badge](https://img.shields.io/pypi/v/peppy)](https://pypi.org/project/peppy)
32
[![GitHub source](https://img.shields.io/badge/source-github-354a75?logo=github)](https://github.com/pepkit/peppy)
43

Diff for: docs/pepspec/README.md

-1
This file was deleted.

Diff for: docs/pepspec/howto_geofetch.md

-33
This file was deleted.

Diff for: docs/pepspec/howto_multiple_inputs.md

-87
This file was deleted.
File renamed without changes.
File renamed without changes.

Diff for: docs/pepspec/howto_automatic_groups.md renamed to docs/spec/howto-automatic-groups.md

-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
---
22
title: Create automatic sample groups
3-
redirect_from: "/docs/implied_attributes/"
43
---
54

65
# How to create automatic sample groups

Diff for: docs/pepspec/howto_eliminate_paths.md renamed to docs/spec/howto-eliminate-paths.md

-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
---
22
title: Eliminate paths from sample table
3-
redirect_from: "/docs/derived_attributes/"
43
---
54

65
# How to eliminate paths from a sample table

Diff for: docs/pepspec/howto_genome_id.md renamed to docs/spec/howto-genome-id.md

-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
---
22
title: Remove genome from sample table
3-
redirect_from: "/docs/implied_attributes/"
43
---
54

65
# How to remove genome from a sample table

Diff for: docs/spec/howto-geofetch.md

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
title: "Create a PEP from GEO or SRA"
3+
---
4+
5+
# How to create a PEP from GEO or SRA
6+
7+
`geofetch` is a command-line utility that converts GEO or SRA accessions into PEP projects. You provide an accession (or a spreadsheet with a list of accessions), and `geofetch` with produce the PEP (both project config and sample annotation). `geofetch` can also download the data from SRA, so your project will be ready for direct input into any PEP-compatible tool.
8+
9+
For more information, see [`geofetch` user documentation and vignettes](/geofetch)
10+
11+
File renamed without changes.
File renamed without changes.

Diff for: docs/spec/howto-multi-value-attributes.md

+114
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
---
2+
title: Specify multi-value attributes
3+
---
4+
5+
# How to specify multiple input files
6+
7+
Occasionally, a sample needs to have more than one value for an attribute. For example, you may have multiple input files for one sample, such as a single library that was spread across multiple sequencing lanes. This doesn't fit naturally into a tabular data format because it requires a one-to-many relationship. For these kinds of attributes, the PEP specification provides three possibilities:
8+
9+
1. Use shell expansion characters (like `*` or `[]`) in a `derived.source` definition (good for situations where the multi-valued attributes are file paths)
10+
2. Specify a *subsample table* (infinitely customizable for more complicated merges).
11+
3. Use a multiple rows per sample (useful when a single sample table is required).
12+
13+
To explain how this works, we'll use the most common example case of needing it: a single sample with an attribute that points to a file, but there are multiple input files for that attribute.
14+
15+
## Option 1: wildcards
16+
17+
To use wildcards, just use asterisks in your data source specifications, like this:
18+
19+
```{yaml}
20+
derive:
21+
sources:
22+
data_R1: "${DATA}/{id}_S{nexseq_num}_L00*_R1_001.fastq.gz"
23+
data_R2: "${DATA}/{id}_S{nexseq_num}_L00*_R2_001.fastq.gz"
24+
```
25+
26+
PEP will automatically glob these to whatever files are on disk, and use those as the multiple values for the attribute. This only works if:
27+
28+
1. the attribute with multiple values is a file,
29+
2. the file paths are systematic, computable with a shell wildcard, and
30+
3. the file system is local to the PEP and can be computed at runtime.
31+
32+
## Option 2: the subsample table
33+
34+
A subsample table is a table with *one row per attribute* -- so a single sample appears multiple times in the table. Just provide a subsample table in your project config:
35+
36+
```{yaml}
37+
sample_table: annotation.csv
38+
subsample_table: subsample_table.csv
39+
```
40+
41+
Make sure the `sample_name` column of this table matche the `sample_name` column in your sample_table, and then include any columns that require multiple values. `PEP` will automatically include all of these values as appropriate.
42+
43+
Here's a simple example of a PEP that uses subsamples. If you define `annotation.csv` like this:
44+
45+
```{csv}
46+
sample_name,library
47+
frog_1,anySampleType
48+
frog_2,anySampleType
49+
```
50+
51+
Then point `subsample_table.csv` to the following, which maps `sample_name` to a new column called `file`
52+
53+
```{csv}
54+
sample_name,file
55+
frog_1,data/frog1a_data.txt
56+
frog_1,data/frog1b_data.txt
57+
frog_1,data/frog1c_data.txt
58+
frog_2,data/frog2a_data.txt
59+
frog_2,data/frog2b_data.txt
60+
```
61+
62+
This sets up a simple relational database that maps multiple files to each sample. You can also combine these multi-value attributes with derived columns; columns will first be derived, and then merged. Let's now look at a slightly more complex example that has two multi-value attributes (such as the case with a paired-end sequencing experiment with multiple files for each R1 and R2):
63+
64+
The sample table is:
65+
66+
```{csv}
67+
sample_name library
68+
frog_1 anySampleType
69+
frog_2 anySampleType
70+
```
71+
72+
And the subsample table is:
73+
74+
```{csv}
75+
sample_name,read1,read2
76+
frog_1,data/frog1a_R1.txt,data/frog1a_R2.txt
77+
frog_1,data/frog1b_R1.txt,data/frog1b_R2.txt
78+
frog_1,data/frog1c_R1.txt,data/frog1c_R2.txt
79+
frog_2,data/frog2a_R1.txt,data/frog2a_R2.txt
80+
frog_2,data/frog2b_R1.txt,data/frog2b_R2.txt
81+
```
82+
83+
This yields 2 samples, each with a single-valued attribute of `library`, and multi-valued attributes of `read1` and `read2`.
84+
85+
## Option 3: Multiple rows per sample
86+
87+
In PEP v2.1.0, we relaxed the constraint that each sample must correspond to exactly one row in the sample table, and introduce the possibility of encoding multi-valued samples with multiple rows. The same data as the two-table approach above can be represented like this:
88+
89+
```{csv}
90+
sample_name,library,read1,read2
91+
frog_1,anySampleType,data/frog1a_R1.txt,data/frog1a_R2.txt
92+
frog_1,,data/frog1b_R1.txt,data/frog1b_R2.txt
93+
frog_1,,data/frog1c_R1.txt,data/frog1c_R2.txt
94+
frog_2,anySampleType,data/frog2a_R1.txt,data/frog2a_R2.txt
95+
frog_2,,data/frog2b_R1.txt,data/frog2b_R2.txt
96+
```
97+
98+
Now, this is the `sample_table`, but the `sample_name` column is not unique -- the first row has values for any single-valued attributes (`library` in the example), and then other columns can be specified in multiple rows. The PEP processor will integrate these samples in correctly represent this as two samples.
99+
100+
## Further thoughts
101+
102+
- You can find more examples of multi-valued attributes in the examples at [PEP in practice](pep_in_practice.md).
103+
104+
- There are pros and cons to each of these approaches. Representing a single sample with multiple rows in the sample table can cause conceptual and analytical challenges. There is value in the simplicity of being able to see each row as a separate sample, which is preserved in the subsample table approach. However, this leads to the requirement of an additional file to represent samples, which can be problematic in other settings. One advantage of PEP is that either approach can be used, so you can use the approach that fits your situation best.
105+
106+
- Subsample tables are intended to handle multiple values *of the same type*. To handle different *classes* of input files, like read1 and read2, these are *not* the same type, and are therefore *not* put into a subsample table. Instead, these should be handled as different columns in the main sample annotation sheet (and therefore different arguments to the pipeline). It is possible that you will want to have read1 and read2, and then each of these could have multiple inputs, which would then be placed in the subsample table.
107+
108+
- If your project has some samples with subsample entries, but others without, then you only need to include samples in the subsample table if they have subsample attributes. Other samples can just be included in the primary annotation table. However, this means you'll need to make sure you provide the correct columns in the primary `sample_table` sheet; the simple example above assumes every sample has subsample attributes, so it doesn't need to define `file` in the `sample_table`. If you had samples without attributes specified in the subsample table, you'd need to specify that column in the primary sheet.
109+
110+
- In practice, we've found that most projects can be solved using wildcards, and subsample tables are not necessary. If you start to think about how to use subsample tables, first double-check that you can't solve the problem using a simple wildcard; that makes it much easier to think about, and it should be possible as long as the files are named systematically.
111+
112+
- **Don't combine these approaches**. Using both wildcards and a subsample table simultaneously for the same sample can lead to recursion. Using a subsample table requires your sample table to have unique sample names, which precludes the possibility of multi-row samples. So, just pick an approach and stick to it.
113+
114+
- Keep in mind: **subsample tables is for files of the same type**. A paired-end sequencing experiment, for example, yields two input files for each sample (read1 and read2). These are *not equivalent*, so **you do not use subsample tables to put read1 and read2 files into the same attribute**; instead, you would list in a subsample table all files (perhaps from different lanes) for read1, and then, separately, all files for read2. This is two separate attributes, which each
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

Diff for: docs/pepspec/peppy.md renamed to docs/spec/peppy.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
### Code and documentation
66

7-
* [User documentation and vignettes](http://peppy.databio.org/)
7+
* [User documentation and vignettes](/peppy)
88
* [peppy API](http://peppy.databio.org/en/latest/autodoc_build/peppy/)
99
* [Source code at Github](https://github.com/pepkit/peppy)
1010

Diff for: docs/pepspec/pepr.md renamed to docs/spec/pepr.md

File renamed without changes.
File renamed without changes.

Diff for: docs/pepspec/simple_example.md renamed to docs/spec/simple-example.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
1+
# PEP specification: A simple example
12

2-
# How do I create my own PEP? A simple example
3+
How do I create my own PEP?
34

45
<img src="../img/pep_contents.svg" alt="" style="float:right; margin-left:20px" width="250px">
56

0 commit comments

Comments
 (0)