Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2024 STAMPS tutorials - "running k-mer analyses with sourmash" venn & upset diagrams, ordination, and presence/absence diagram #3256

Open
ctb opened this issue Jul 21, 2024 · 3 comments

Comments

@ctb
Copy link
Contributor

ctb commented Jul 21, 2024

https://hackmd.io/9ORFRJGaTOiOdEAY-Aih2A?view


open lab / Sat + Monday , Jul 20 & 22, 2024 / STAMPS 2024

Suggestion for open lab:

  • run the commands below on the "canned" data;
  • diagram out where the information is coming from:
    • sourmash sketch dna creates files containing DNA k-mers;
    • the various comparisons look at overlaps and so on for the provided k-mer collections;
  • skim the docs on sourmash and the betterplot plugin and think about the various options;
  • ask questions about:
    • conda for software installation
    • trying the analyses on other public (or private) data;
    • doing other analyses (overlap, containment, etc.) with the k-mer sketches;

Getting started

Start up RStudio on your instance, and click on Terminal.

Setup (do these once)

Create a working directory

mkdir ~/smash-data/

Install sourmash and various plugins with conda (also see a conda tutorial; you'll need to install miniforge if you're not running conda/mamba on your Jetstream computer.)

Then run:

conda env create -f /opt/shared/sourmash-data/env-smash.yml \
    -n smash

This installs:

Change directory and activate environment (each time you log in)

Change to sourmash working directory and activate sourmash software environment:

cd ~/smash-data/
conda activate smash

Might as well check for updates, sourmash is fast moving ;).

pip install -U sourmash_plugin_betterplot

Convert some metagenomes into k-mer signatures

Use sourmash sketch to turn metagenomes into k-mers:

sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947178_?.fastq.gz \
    -p k=31,abund -o SRR7947178.sig.zip --name SRR7947178

sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947179_?.fastq.gz \
    -p k=31,abund -o SRR7947179.sig.zip --name SRR7947179

sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947181_?.fastq.gz \
    -p k=31,abund -o SRR7947181.sig.zip --name SRR7947181

This will take about 5 minutes and create three SRR*.sig.zip files.

(You can grab any metagenome you want from ENA, e.g. SRR7947181; use curl -JLO <read URL to download the FASTQ file(s) and sketch them as above!)

Sketch some genomes into k-mer signatures

Use sourmash sketch to turn genomes into k-mers:

for i in /opt/shared/sourmash-data/10sketches/*.fa
do
    sourmash sketch dna $i -o genome-$(basename $i .fa).sig.zip \
        --name-from-first
done

sourmash sig cat genome*.sig.zip -o 10sketches.sig.zip

Run various analyses on the k-mer sketches!

Genome comparisons via venn and upset

Run:

sourmash scripts venn genome-{2,47,63}.sig.zip \
    -o genomes-venn.png --ident

to produce a file genomes-venn.png:

image

Run:

sourmash scripts upset genome-{2,47,63}.sig.zip \
    --show-singletons -o genomes-upset.png

to produce a file genomes-upset.png:

image

Genome distance matrix, dendrograms, and ordination

First compare the genomes:

sourmash compare 10sketches.sig.zip -o 10sketches.cmp \
    --labels-to 10sketches.labels_to.csv

Now build a matrix+dendrogram view:

sourmash scripts plot2 10sketches.cmp 10sketches.labels_to.csv \
    -o 10sketches.mat.png

to produce 10sketches.mat.png:

image

You can produce a metric MDS plot too, colored by species:

sourmash scripts mds 10sketches.cmp 10sketches.labels_to.csv \
    -o 10sketches.mds.png \
    -C /opt/shared/sourmash-data/10sketches/10sketches-categories.csv

to produce 10sketches.mds.png:

image

Flat (no abund) metagenome comparisons via venn and upset

Run a venn comparison of metagenomes =>
metag-venn.png.

sourmash scripts venn SRR*.sig.zip \
    -o metag-venn.png

image

Run an upset comparison of metagenomes => metag-upset.png

sourmash scripts upset SRR*.sig.zip \
    --show-singletons -o metag-upset.png

image

Genome presence/absence

Calculate which GTDB genomes are in a metagenome:

sourmash scripts fastgather SRR7947178.sig.zip \
    /opt/shared/sourmash-data/gtdb-rs214-k31.zip \
    -o SRR7947178.x.gtdb-rs214.csv

(will take ~3 minutes)

Make the detection/abundance plot:

sourmash scripts presence_filter SRR7947178.x.gtdb-rs214.csv \
    -o SRR7947178.detection.png

=> SRR7947178.detection.png:

image

sourmash scripts presence_filter SRR7947178.x.gtdb-rs214.csv \
    -o SRR7947178.ani.png \
    --ani

=> SRR7947178.ani.png

image

The three columns being plotted from SRR7947178.x.gtdb-rs214.csv are:

  • f_match_orig - the detection / containment of the match in the metagenome;
  • match_containment_ani - the containment-based ANI of the match;
  • average_abund - abundance of matching k-mers in the metagenome;

Generating taxonomic classifications for metagenomes with sourmash

The basic workflow is to first run sourmash gather as above, and then run sourmash tax metagenome.

Run gather for each metagenome:

sourmash scripts fastgather SRR7947179.sig.zip \
    /opt/shared/sourmash-data/gtdb-rs214-k31.zip \
    -o SRR7947179.x.gtdb-rs214.csv
    
sourmash scripts fastgather SRR7947181.sig.zip \
    /opt/shared/sourmash-data/gtdb-rs214-k31.zip \
    -o SRR7947181.x.gtdb-rs214.csv

Then you can run taxonomic analyses like so:

sourmash tax metagenome -g *.x.gtdb-rs214.csv  \
    -t /opt/shared/sourmash-data/gtdb-rs214.lineages.sqldb \
    -r order -F human

See the sourmash tax documentations for various output options!

Getting % of metagenome covered from gather

If you've run a fastgather above, you will still need to run sourmash gather to get nice human-readable output; you can do this quickly by using the fastgather output as a picklist to limit the results to the previous calculated output:

sourmash gather --picklist SRR7947179.x.gtdb-rs214.csv::prefetch \
    SRR7947179.sig.zip \
    /opt/shared/sourmash-data/gtdb-rs214-k31.zip

and you should then see at the end:

found 849 matches total;
the recovered matches hit 83.1% of the abundance-weighted query.
the recovered matches hit 68.7% of the query k-mers (unweighted).

Running abundhist to generate metagenome abundance histograms

After activating the smash conda environment, you can install the abundhist plugin like so:

pip install sourmash_plugin_abundhist

and run it on any metagenome or mixture that has been sketched with -p abund like so:

sourmash scripts abundhist SRR7947178.sig.zip -M 100 \
    --figure SRR7947178.abundhist.png

Other topics we can discuss!

  • Building your own sourmash database / extending the existing databases with your own private/unpublished genomes

Appendix 1: UNIX Command Line!

Totally new to the command-line, or want to strengthen your foundation? Do this Unix Crash Course.

If you want some experience in other aspects, consider going through these:

Appendix 2: Conda tutorial

Please see my conda tutorial; you'll need to install miniforge if you're not running things on your Jetstream computer.

Appendix 3: examining assembly overlap with k-mers

See my in-class notes.

Contact info

Contact Titus at [email protected].

@ctb
Copy link
Contributor Author

ctb commented Jul 21, 2024

TODO: make setup instructions / files available somewhere!

@ctb
Copy link
Contributor Author

ctb commented Jul 21, 2024

installation stuff/sourmash

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict

conda env create -n smash-foo -f /opt/shared/sourmash-data/env-smash.yml

Appendix: sourmash env file

cat > env-smash.yml <<EOF
name: smash
channels:
    - conda-forge
    - bioconda
    - defaults
dependencies:
    - sourmash>=4.8.10,<5
    - seaborn
    - scikit-learn
    - seaborn
    - sourmash_plugin_branchwater==0.9.5
    - pip
    - pip:
      - sourmash_plugin_betterplot
      - sourmash_plugin_abundhist
EOF

Appendix 2

PatientHNC_06_illumina

curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/008/SRR7947178/SRR7947178_1.fastq.gz
curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/008/SRR7947178/SRR7947178_2.fastq.gz

PatientHNC_03_illumina

curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/009/SRR7947179/SRR7947179_1.fastq.gz
curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/009/SRR7947179/SRR7947179_2.fastq.gz

PatientHNC_10_illumina

curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/001/SRR7947181/SRR7947181_1.fastq.gz
curl -JLO ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR794/001/SRR7947181/SRR7947181_2.fastq.gz

Sketching

sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947178_?.fastq.gz -p k=31,abund -o SRR7947178.sig.zip --name SRR7947178

sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947179_?.fastq.gz -p k=31,abund -o SRR7947179.sig.zip --name SRR7947179

sourmash sketch dna /opt/shared/sourmash-data/ihmp-shotgun-data/SRR7947181_?.fastq.gz -p k=31,abund -o SRR7947181.sig.zip --name SRR7947181

@ctb
Copy link
Contributor Author

ctb commented Jul 23, 2024

thoughts for next year -

  • emphasize "coverage of metagenome by references"
  • ...and then provide alternative suggestions for what to do if coverage is poor :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant