workflow: give input data files unique names #85

ktmeaton · 2022-06-28T18:13:05Z

Description of proposed changes

I'm running into a problem when running multiple config files on different input data (ex. hmpxv1 vs. mpxv, or Nextstrain vs. LAPIS). Since the input data is hard-coded to data/sequences.fasta and data/metadata.tsv it makes it difficult to run different inputs without conflict.

One option could be to add the {build_name} into input filenames to make them unique. This is an example of the changes I've made:

rule download:
    message: "Downloading sequences and metadata from data.nextstrain.org"
    output:
        sequences = "data/{build_name}_sequences.fasta.xz",
        metadata = "data/{build_name}_metadata.tsv.gz"
	...

Testing

Running the following commands will produce distinct outputs in data and results:

snakemake -c 1 results/mpxv/filtered.fasta --configfile config/config_mpxv.yaml
snakemake -c 1 results/hmpxv1/filtered.fasta --configfile config/config_hmpxv1.yaml

data:
- hmpxv1_metadata.tsv
- hmpxv1_metadata.tsv.gz
- hmpxv1_sequences.fasta
- hmpxv1_sequences.fasta.xz
- mpxv_metadata.tsv
- mpxv_metadata.tsv.gz
- mpxv_sequences.fasta
- mpxv_sequences.fasta.xz
results:
- hmpxv1/
- hmpxv1_metadata.tsv
- mpxv/
- mpxv_metadata.tsv

To compare different data sources, I add the data source into the build name. For example

#config_hmpxv1_nextstrain.yaml
build_name: "hmpxv1_nextstrain"
auspice_name: "monkeypox_hmpxv1_nextstrain"

#config_hmpxv1_lapis.yaml
build_name: "hmpxv1_lapis"
auspice_name: "monkeypox_hmpxv1_lapis"

I quite like this approach, since it mirrors the output structure of https://github.com/nextstrain/ncov. But I would love to know more about how you're implementing multiple "builds", without invoking the full input/build logic from the ncov pipeline. Thanks!

corneliusroemer · 2022-06-28T18:35:09Z

Can you share the workflow in which you're having issues with input files? I think data/ is supposed to contain all sequences. I could imagine naming them lapis_sequences.fasta and gisaid_sequences.fasta etc., but giving them names by builds would be confusing - maybe I don't understand the problem you're having.

ktmeaton · 2022-06-28T22:00:58Z

That actually clarifies things a lot, thanks! Is my understanding of the current workflow correct:

data/sequences.fasta should contain all possible sequences.
- Which might include sequences from lapis, gisaid, local assemblies, etc
A build is specified with config_{build_name}.yaml, and customized with filter options, example:
```
## filter
min_date: 2017
min_length: 10000
filters: "--exclude-where clade!=hMPXV-1"
```
If I just wanted to make a lapis+local sequences build, maybe I could make a data_source column in data/metadata.tsv, and then do something like:
```
min_date: 2017
min_length: 10000
filters: --query "(data_source == 'lapis') | (data_source == 'local')"
```

workflow: give downloaded data files unique names

28b4103

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workflow: give input data files unique names #85

workflow: give input data files unique names #85

ktmeaton commented Jun 28, 2022

corneliusroemer commented Jun 28, 2022

ktmeaton commented Jun 28, 2022

workflow: give input data files unique names #85

Are you sure you want to change the base?

workflow: give input data files unique names #85

Conversation

ktmeaton commented Jun 28, 2022

Description of proposed changes

Testing

corneliusroemer commented Jun 28, 2022

ktmeaton commented Jun 28, 2022