Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflow: give input data files unique names #85

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ktmeaton
Copy link

Description of proposed changes

I'm running into a problem when running multiple config files on different input data (ex. hmpxv1 vs. mpxv, or Nextstrain vs. LAPIS). Since the input data is hard-coded to data/sequences.fasta and data/metadata.tsv it makes it difficult to run different inputs without conflict.

One option could be to add the {build_name} into input filenames to make them unique. This is an example of the changes I've made:

rule download:
    message: "Downloading sequences and metadata from data.nextstrain.org"
    output:
        sequences = "data/{build_name}_sequences.fasta.xz",
        metadata = "data/{build_name}_metadata.tsv.gz"
	...

Testing

Running the following commands will produce distinct outputs in data and results:

snakemake -c 1 results/mpxv/filtered.fasta --configfile config/config_mpxv.yaml
snakemake -c 1 results/hmpxv1/filtered.fasta --configfile config/config_hmpxv1.yaml
  • data:
    • hmpxv1_metadata.tsv
    • hmpxv1_metadata.tsv.gz
    • hmpxv1_sequences.fasta
    • hmpxv1_sequences.fasta.xz
    • mpxv_metadata.tsv
    • mpxv_metadata.tsv.gz
    • mpxv_sequences.fasta
    • mpxv_sequences.fasta.xz
  • results:
    • hmpxv1/
    • hmpxv1_metadata.tsv
    • mpxv/
    • mpxv_metadata.tsv

To compare different data sources, I add the data source into the build name. For example

#config_hmpxv1_nextstrain.yaml
build_name: "hmpxv1_nextstrain"
auspice_name: "monkeypox_hmpxv1_nextstrain"
#config_hmpxv1_lapis.yaml
build_name: "hmpxv1_lapis"
auspice_name: "monkeypox_hmpxv1_lapis"

I quite like this approach, since it mirrors the output structure of https://github.com/nextstrain/ncov. But I would love to know more about how you're implementing multiple "builds", without invoking the full input/build logic from the ncov pipeline. Thanks!

@corneliusroemer
Copy link
Member

Can you share the workflow in which you're having issues with input files? I think data/ is supposed to contain all sequences. I could imagine naming them lapis_sequences.fasta and gisaid_sequences.fasta etc., but giving them names by builds would be confusing - maybe I don't understand the problem you're having.

@ktmeaton
Copy link
Author

That actually clarifies things a lot, thanks! Is my understanding of the current workflow correct:

  • data/sequences.fasta should contain all possible sequences.
    • Which might include sequences from lapis, gisaid, local assemblies, etc
  • A build is specified with config_{build_name}.yaml, and customized with filter options, example:
    ## filter
    min_date: 2017
    min_length: 10000
    filters: "--exclude-where clade!=hMPXV-1"
  • If I just wanted to make a lapis+local sequences build, maybe I could make a data_source column in data/metadata.tsv, and then do something like:
    min_date: 2017
    min_length: 10000
    filters: --query "(data_source == 'lapis') | (data_source == 'local')"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

2 participants