|
| 1 | +--- |
| 2 | +title: Specify multi-value attributes |
| 3 | +--- |
| 4 | + |
| 5 | +# How to specify multiple input files |
| 6 | + |
| 7 | +Occasionally, a sample needs to have more than one value for an attribute. For example, you may have multiple input files for one sample, such as a single library that was spread across multiple sequencing lanes. This doesn't fit naturally into a tabular data format because it requires a one-to-many relationship. For these kinds of attributes, the PEP specification provides three possibilities: |
| 8 | + |
| 9 | +1. Use shell expansion characters (like `*` or `[]`) in a `derived.source` definition (good for situations where the multi-valued attributes are file paths) |
| 10 | +2. Specify a *subsample table* (infinitely customizable for more complicated merges). |
| 11 | +3. Use a multiple rows per sample (useful when a single sample table is required). |
| 12 | + |
| 13 | +To explain how this works, we'll use the most common example case of needing it: a single sample with an attribute that points to a file, but there are multiple input files for that attribute. |
| 14 | + |
| 15 | +## Option 1: wildcards |
| 16 | + |
| 17 | +To use wildcards, just use asterisks in your data source specifications, like this: |
| 18 | + |
| 19 | +```{yaml} |
| 20 | +derive: |
| 21 | + sources: |
| 22 | + data_R1: "${DATA}/{id}_S{nexseq_num}_L00*_R1_001.fastq.gz" |
| 23 | + data_R2: "${DATA}/{id}_S{nexseq_num}_L00*_R2_001.fastq.gz" |
| 24 | +``` |
| 25 | + |
| 26 | +PEP will automatically glob these to whatever files are on disk, and use those as the multiple values for the attribute. This only works if: |
| 27 | + |
| 28 | +1. the attribute with multiple values is a file, |
| 29 | +2. the file paths are systematic, computable with a shell wildcard, and |
| 30 | +3. the file system is local to the PEP and can be computed at runtime. |
| 31 | + |
| 32 | +## Option 2: the subsample table |
| 33 | + |
| 34 | +A subsample table is a table with *one row per attribute* -- so a single sample appears multiple times in the table. Just provide a subsample table in your project config: |
| 35 | + |
| 36 | +```{yaml} |
| 37 | +sample_table: annotation.csv |
| 38 | +subsample_table: subsample_table.csv |
| 39 | +``` |
| 40 | + |
| 41 | +Make sure the `sample_name` column of this table matche the `sample_name` column in your sample_table, and then include any columns that require multiple values. `PEP` will automatically include all of these values as appropriate. |
| 42 | + |
| 43 | +Here's a simple example of a PEP that uses subsamples. If you define `annotation.csv` like this: |
| 44 | + |
| 45 | +```{csv} |
| 46 | +sample_name,library |
| 47 | +frog_1,anySampleType |
| 48 | +frog_2,anySampleType |
| 49 | +``` |
| 50 | + |
| 51 | +Then point `subsample_table.csv` to the following, which maps `sample_name` to a new column called `file` |
| 52 | + |
| 53 | +```{csv} |
| 54 | +sample_name,file |
| 55 | +frog_1,data/frog1a_data.txt |
| 56 | +frog_1,data/frog1b_data.txt |
| 57 | +frog_1,data/frog1c_data.txt |
| 58 | +frog_2,data/frog2a_data.txt |
| 59 | +frog_2,data/frog2b_data.txt |
| 60 | +``` |
| 61 | + |
| 62 | +This sets up a simple relational database that maps multiple files to each sample. You can also combine these multi-value attributes with derived columns; columns will first be derived, and then merged. Let's now look at a slightly more complex example that has two multi-value attributes (such as the case with a paired-end sequencing experiment with multiple files for each R1 and R2): |
| 63 | + |
| 64 | +The sample table is: |
| 65 | + |
| 66 | +```{csv} |
| 67 | +sample_name library |
| 68 | +frog_1 anySampleType |
| 69 | +frog_2 anySampleType |
| 70 | +``` |
| 71 | + |
| 72 | +And the subsample table is: |
| 73 | + |
| 74 | +```{csv} |
| 75 | +sample_name,read1,read2 |
| 76 | +frog_1,data/frog1a_R1.txt,data/frog1a_R2.txt |
| 77 | +frog_1,data/frog1b_R1.txt,data/frog1b_R2.txt |
| 78 | +frog_1,data/frog1c_R1.txt,data/frog1c_R2.txt |
| 79 | +frog_2,data/frog2a_R1.txt,data/frog2a_R2.txt |
| 80 | +frog_2,data/frog2b_R1.txt,data/frog2b_R2.txt |
| 81 | +``` |
| 82 | + |
| 83 | +This yields 2 samples, each with a single-valued attribute of `library`, and multi-valued attributes of `read1` and `read2`. |
| 84 | + |
| 85 | +## Option 3: Multiple rows per sample |
| 86 | + |
| 87 | +In PEP v2.1.0, we relaxed the constraint that each sample must correspond to exactly one row in the sample table, and introduce the possibility of encoding multi-valued samples with multiple rows. The same data as the two-table approach above can be represented like this: |
| 88 | + |
| 89 | +```{csv} |
| 90 | +sample_name,library,read1,read2 |
| 91 | +frog_1,anySampleType,data/frog1a_R1.txt,data/frog1a_R2.txt |
| 92 | +frog_1,,data/frog1b_R1.txt,data/frog1b_R2.txt |
| 93 | +frog_1,,data/frog1c_R1.txt,data/frog1c_R2.txt |
| 94 | +frog_2,anySampleType,data/frog2a_R1.txt,data/frog2a_R2.txt |
| 95 | +frog_2,,data/frog2b_R1.txt,data/frog2b_R2.txt |
| 96 | +``` |
| 97 | + |
| 98 | +Now, this is the `sample_table`, but the `sample_name` column is not unique -- the first row has values for any single-valued attributes (`library` in the example), and then other columns can be specified in multiple rows. The PEP processor will integrate these samples in correctly represent this as two samples. |
| 99 | + |
| 100 | +## Further thoughts |
| 101 | + |
| 102 | +- You can find more examples of multi-valued attributes in the examples at [PEP in practice](pep_in_practice.md). |
| 103 | + |
| 104 | +- There are pros and cons to each of these approaches. Representing a single sample with multiple rows in the sample table can cause conceptual and analytical challenges. There is value in the simplicity of being able to see each row as a separate sample, which is preserved in the subsample table approach. However, this leads to the requirement of an additional file to represent samples, which can be problematic in other settings. One advantage of PEP is that either approach can be used, so you can use the approach that fits your situation best. |
| 105 | + |
| 106 | +- Subsample tables are intended to handle multiple values *of the same type*. To handle different *classes* of input files, like read1 and read2, these are *not* the same type, and are therefore *not* put into a subsample table. Instead, these should be handled as different columns in the main sample annotation sheet (and therefore different arguments to the pipeline). It is possible that you will want to have read1 and read2, and then each of these could have multiple inputs, which would then be placed in the subsample table. |
| 107 | + |
| 108 | +- If your project has some samples with subsample entries, but others without, then you only need to include samples in the subsample table if they have subsample attributes. Other samples can just be included in the primary annotation table. However, this means you'll need to make sure you provide the correct columns in the primary `sample_table` sheet; the simple example above assumes every sample has subsample attributes, so it doesn't need to define `file` in the `sample_table`. If you had samples without attributes specified in the subsample table, you'd need to specify that column in the primary sheet. |
| 109 | + |
| 110 | +- In practice, we've found that most projects can be solved using wildcards, and subsample tables are not necessary. If you start to think about how to use subsample tables, first double-check that you can't solve the problem using a simple wildcard; that makes it much easier to think about, and it should be possible as long as the files are named systematically. |
| 111 | + |
| 112 | +- **Don't combine these approaches**. Using both wildcards and a subsample table simultaneously for the same sample can lead to recursion. Using a subsample table requires your sample table to have unique sample names, which precludes the possibility of multi-row samples. So, just pick an approach and stick to it. |
| 113 | + |
| 114 | +- Keep in mind: **subsample tables is for files of the same type**. A paired-end sequencing experiment, for example, yields two input files for each sample (read1 and read2). These are *not equivalent*, so **you do not use subsample tables to put read1 and read2 files into the same attribute**; instead, you would list in a subsample table all files (perhaps from different lanes) for read1, and then, separately, all files for read2. This is two separate attributes, which each |
0 commit comments