A Kraken2 + Bracken based pipeline for classifying and quantifying microbial reads from GDC TCGA WGS data.
Feature highlights:
- Supports Kraken2 or KrakenUniq for read classification.
- Supports HISAT2 or Bowtie2 for host filtering.
- Automatically builds the latest MicrobialDB nucleotide and protein Kraken2 databases (for KrakenUniq nucleotide only). MicrobialDB consists of archaea, bacteria, viral, human, UniVec_Core, and eukaryotic pathogen genomes (EuPathDBv54) with contaminants removed.
- Includes the (default enabled) option to do a second pass Kraken2 protein translated search of the unclassified reads from the Kraken2 first pass and combining the report results before feeding into Bracken.
- Properly handles TCGA WGS merged BAMs with mixed PE/SE reads and multiple read lengths by splitting them into read group level FASTQs and processing data through the pipeline at read group level before aggregating the Bracken count results back to merged BAM level.
See References for the general basis for this pipeline and more information.
A high-level pipeline summary:
1) GDC TCGA WGS Unmapped Read BAMs
2) Biobambam2 Unmapped Read FASTQs (split by read group when BAM has mixed PE/SE or read lengths)
3) HISAT2 (or Bowtie2) Host Filtering (with T2T-CHM13v2.0)
4) Kraken2 (or KrakenUniq) Nucleotide Read Classification (with MicrobialDB)
5) Kraken2 Translated Search Read Classification of Unclassified Reads (with protein MicrobialDB)
6) KrakenTools Combine Nucleotide and Protein Reports
7) Bracken Read Quantification
8) Aggregate Read Group Level Bracken Counts/Abundances
9) Count/Abundance Data Matrix
The project was developed under GNU Linux and MacOS X and assumes the use of a Unix command line shell. Both Linux and MacOS provide a command line shell by default. Other needed tools will be installed by the instructions below.
Install and set up Miniforge3
Obtain the project source and create a conda environment with the tools needed to run the project:
git clone https://github.com/hermidalc/tcga-wgs-kraken-microbial-quant.git
cd tcga-wgs-kraken-microbial-quant
mamba env create -f envs/tcga-wgs-kraken-microbial-quant.yaml
mamba activate tcga-wgs-kraken-microbial-quant
Test that the installation is working by doing a dry run (if you don't
have a GDC token yet and wish to test your install do
GDC_TOKEN='' snakemake --dry-run
). Below are the job statistics you
would see for an analysis of all TCGA WGS primary tumors using Kraken2
with the optional addtional step of Kraken2 protein translated search of
unclassified reads:
$ snakemake --dry-run
Building DAG of jobs...
Job stats:
job count
----------------------------- -------
all 1
bracken_combined_rg_counts 82
bracken_count_matrix 1
bracken_db_kmer_distr 7
bracken_db_kraken_classif 1
bracken_read_quant 10756
eupathdb_fasta_archive 1
eupathdb_fastas 2
eupathdb_merged_fasta 2
eupathdb_nucl_seqid2taxid_map 1
gdc_rg_unmapped_fastqs 82
gdc_sg_unmapped_fastq_pe 10735
gdc_sg_unmapped_fastq_se 21
gdc_unmapped_bam 10838
host_filtered_fastq_pe 10735
host_filtered_fastq_se 21
host_genome_fasta 1
host_genome_index 1
kraken2_combined_report 10756
kraken2_db 2
kraken2_db_taxonomy 2
kraken2_eupathdb_nucl_library 1
kraken2_eupathdb_prot_library 1
kraken2_nucl_db_library 5
kraken2_nucl_read_classif_pe 10735
kraken2_nucl_read_classif_se 21
kraken2_prot_db_library 4
kraken2_prot_read_classif_pe 10735
kraken2_prot_read_classif_se 21
total 75571
- Note: this Snakemake workflow uses a
checkpoint
because we do not know ahead of time how many read group level unmapped read FASTQs need to be generated from mixed PE/SE or mixed read length unmapped read GDC BAMs. Therefore the dry run job stats above do not reflect these additional jobs determined dynamically at runtime and re-evaluation of the workflow DAG. As of GDC Data Release v40 there were82
mixed PE/SE or mixed read length TCGA BAMs requiring splitting into3,197
read group level FASTQs, resulting in a total13,953
FASTQs processed through the pipeline and94,667
total jobs.
Given the compute intensive nature of this pipeline and the large number of jobs required to execute it we highly recommend running it on an HPC cluster.
Set your GDC controlled access authentication token in the environment
variable GDC_TOKEN
or GDC_TOKEN_FILE
, or the file ~/.gdc_token
so the pipeline can get the token.
Run the workflow:
./scripts/run_snakemake.sh
Run the workflow on a cluster:
./scripts/run_snakemake_slurm.sh \
--workflow-profile workflow/profiles/biowulf \
--sbatch-opts="--time=3-00:00:00 --cpus-per-task=28 --mem=10248"
I've provided a SLURM cluster configuration for the NIH HPC cluster, though it is straightforward to create a Snakemake v8+ cluster config for your particular needs.
The pipeline is configured to not require much storage, as intermediate
files are flagged as temporary and deleted when they are no longer
needed as the pipeline is running, except for the Kraken2 classification
and Bracken quantification results. If you would like to keep intermediate
files for other uses, specifiy the --notemp
snakemake option in the
workflow run command above.
- Lu et al. Metagenome analysis using the Kraken software suite. Nat Protoc. 2022 Dec;17(12):2815-2839. doi: 10.1038/s41596-022-00738-y
- Ge et al. Comprehensive analysis of microbial content in whole-genome sequencing samples from The Cancer Genome Atlas project. bioRxiv 2024.05.24.595788