-
Notifications
You must be signed in to change notification settings - Fork 18
Analyzing PacBio HiFi Mock Community 16S Data with QIIME 2
-
Updated 2022-3-26: First upload
-
Updated 2022-4-13: Example instruction to use other databases
-
This tutorial contains a step-by-step guide on using
QIIME 2
to analyze PacBio's HiFi 16S data. Thepb-16S-nf
pipeline simplify these steps into a single easy-to-use pipeline.
QIIME 2
is traditionally the tool of choice for 16S researcher due to the large amount of plugins and its ease of use. Almost everything that one will need to analyze complex communites based on 16S rRNA sequencing can be done within the QIIME 2 platform. However, with PacBio full-length 16S, the analysis workflow typically requires one to use tool such as DADA2
in R first, then export the results into QIIME 2 format.
Thankfully, the recent release of QIIME 2
version 2022-2 has now integrated a wrapper for DADA2 that allows one to process CCS data directly without the need of using R (Although behind the scene, it's just running DADA2 which is based on R!). For this tutorial, we will use the public demo dataset ATCC-MSA1003 16S sequencing provided on PacBio's website to demonstrate a typical analysis workflow taking FASTQ files to taxonomy barplot.
QIIME 2
documentation page provides many useful and comprehensive tutorials that we highly encourage any 16S users to go through. The tutorial we provide here is created as an example of how you can reconstruct the mock community dataset using a publicly available HiFi 16S dataset.
A mock community ATCC-MSA1003 has been sequenced on a Sequel II system. The same sample is replicated 192 times to simulate a 192-plex multiplexed experiment, commonly done for 16S rRNA sequencing due to the high throughput of Sequel II. This mock community contains 20 strains of bacteria mixed in a staggered fashion, where the bacteria abundance ranges from 0.02% to 18%. For details on the mock community, please refer to ATCC website: https://www.atcc.org/products/msa-1003.
The full 192-plex dataset can be downloaded here: DevNet. We will use 8 of the 192 barcodes pairs to save on the computational time for this demo. We will use these 8 FASTQ files for the tutorial and results below:
demultiplex.16S_For_bc1005--16S_Rev_bc1056.hifi_reads.fastq.gz
demultiplex.16S_For_bc1005--16S_Rev_bc1057.hifi_reads.fastq.gz
demultiplex.16S_For_bc1005--16S_Rev_bc1062.hifi_reads.fastq.gz
demultiplex.16S_For_bc1005--16S_Rev_bc1075.hifi_reads.fastq.gz
demultiplex.16S_For_bc1005--16S_Rev_bc1100.hifi_reads.fastq.gz
demultiplex.16S_For_bc1007--16S_Rev_bc1075.hifi_reads.fastq.gz
demultiplex.16S_For_bc1020--16S_Rev_bc1059.hifi_reads.fastq.gz
demultiplex.16S_For_bc1024--16S_Rev_bc1111.hifi_reads.fastq.gz
where each FASTQ file was the results of demultiplexing the full 192-plex dataset. Create a 8-plex folder and put the FASTQ files inside.
- Additional homework: What is the average quality of the sequences? Can you use tools such as
seqkit
anddatamash
to find out?
QIIME 2
is straightforward to install it via Anaconda. Please refer to QIIME 2
webpage here for installation instructions. After installing QIIME 2
, activate the environment.
To test if QIIME 2
has been installed, type qiime --help
and you will see a block of texts documenting the usage of QIIME 2
and the various functions or plugins available. The first tool we will use is qiime tools import
that allows us to import the FASTQ file into a QIIME 2 qza
format.
In order to import the data, we will need a sample "manifest" file that allows us to give each of the FASTQ file a sample name. For example, to name the file according to the barcodes pair in the file names:
echo -e 'sample-id\tabsolute-filepath' > sample.manifest
for i in 8-plex/*.fastq.gz; do fname=$(basename ${i}); sample=$(echo ${fname} |\
awk '{print gensub(/.*_(bc.*)--.*(bc.*).hifi_reads.fastq.gz/, "\\1_\\2", "g", $0)}');\
echo -e "${sample}\t$(readlink -f ${i})" >> sample.manifest;\
done
You can open the sample manifest file to look at how it is formatted. It's a simple two-columns TSV file giving a sample name for each of the FASTQ file that we are going to analyze. To import the FASTQ file, we can type:
qiime tools import --type 'SampleData[SequencesWithQuality]' \
--input-path sample.manifest \
--output-path samples.qza \
--input-format SingleEndFastqManifestPhred33V2
which tells QIIME 2
to import FASTQ file using the sample manifest file into a file called samples.qza
. It also specifies the import file format and type. This should take only a few seconds to complete. Once completed, you will see an output saying:
Imported sample.manifest as SingleEndFastqManifestPhred33V2 to samples.qza
and if you type ls
now you will see a new file created called samples.qza
. This file can now be used for all the downstream analysis. You can summarize the read statistics of each barcode pair by running:
qiime demux summarize --i-data samples.qza \
--o-visualization samples.demux.summary.qzv
which again, should finish very quickly (< 1min). This will create a file called samples.demux.summary.qzv
that you can download to your local computer, e.g.(on your own computer):
scp [email protected]:/path/to/samples.demux.summary.qzv .
Usually, in an 16S experiment, one will also have multiple conditions for different groups of samples. QIIME 2
can use a file specifying the samples conditions for downstream analysis and visualization later on. Let's create two artificial groups here where one group is called RepA and one is called RepB. The samples are all identical, so this is just artificial grouping:
awk 'FNR==1{print "sample_name","condition"}; FNR>1 && FNR <6 {print $1,"RepA"}; FNR>5 {print $1, "RepB"}' OFS=$'\t' sample.manifest > metadata.tsv
One of the nice things about QIIME 2
is that it comes with a web browser that allows you to visualize many of the plots/statistics produced by the QIIME 2
package (typically ending with a .qzv
extension, similar to what we have just generated above for the reads statistics). You can use your web browser and navigate to https://view.qiime2.org/. Next, drag and drop the file samples.demux.summary.qzv
into the browser and you can now see the read statistics for the dataset we've imported (note that with PacBio data there's no forward and reverse sequence, so all reads will be seen under "forward sequence"). Here, each barcode pair has on average 13k HiFi reads.
As discussed above, the current recommendations for analyzing full-length 16S HiFi reads is to use DADA2
(Original paper which is a tool to infer high resolution amplicon sequence variants (ASV) from the reads. It is able to resolve even down to 1-2 nucleotides difference, and hence is able to take advantage of the high accuracy of HiFi reads. For example, you can run DADA2
using R script following the tutorial provided by DADA2
author, e.g. On the Zymo mock community.
The latest version of QIIME 2
has now incorporated a function called dada2 denoise-ccs
that can be used to process HiFi reads. Let's use this command to infer ASVs from our HiFi reads (This should take ~ 5 to 8 mins with 8 threads):
qiime dada2 denoise-ccs --i-demultiplexed-seqs ./samples.qza \
--o-table dada2-ccs_table.qza \
--o-representative-sequences dada2-ccs_rep.qza \
--o-denoising-stats dada2-ccs_stats.qza \
--p-min-len 1000 --p-max-len 1600 \
--p-front AGRGTTYGATYMTGGCTCAG --p-adapter RGYTACCTTGTTACGACTT \
--p-n-threads 8
A few notable option for dada2 denoise-ccs
(Type qiime dada2 denoise-ccs
for an exhaustive list):
-
--p-min-len 1000
and--p-max-len 1600
specifies the minimum and maximum sequence length. For 16S sequences,QIIME 2
suggests using 1000 and 1600, respectively. -
--p-front AGRGTTYGATYMTGGCTCAG
and--p-adapter RGYTACCTTGTTACGACTT
which are the full length 16S forward (F27) and reverse (R1492) primers. This will allow DADA2 to trim away the primer region. -
--p-n-threads 8
specifies that the software can use up to 8 threads.
After denoising, it is important to check the read statistics to see how many reads have been filtered and what is the amount of chimeric reads (even after filtering). QIIME 2
can do this using:
qiime metadata tabulate --m-input-file ./dada2-ccs_stats.qza \
--o-visualization ./dada2-ccs_stats.qzv
Again, you can use QIIME 2 View
to view the table (filename dada2-ccs_stats.qzv
) in your web browser. You should observe around 70% of reads categorized as non-chimeric reads. This number can vary from sample to sample.
sample-id | input | primer-removed | percentage of input primer-removed | filtered | percentage of input passed filter | denoised | non-chimeric | percentage of input non-chimeric |
---|---|---|---|---|---|---|---|---|
bc1005_bc1056 | 13682 | 12017 | 87.83 | 9823 | 71.8 | 9746 | 9746 | 71.23 |
bc1005_bc1057 | 14009 | 12363 | 88.25 | 10128 | 72.3 | 10042 | 10040 | 71.67 |
bc1005_bc1062 | 13051 | 11523 | 88.29 | 9561 | 73.26 | 9499 | 9497 | 72.77 |
bc1005_bc1075 | 13626 | 12003 | 88.09 | 9825 | 72.1 | 9740 | 9740 | 71.48 |
bc1005_bc1100 | 12494 | 11019 | 88.19 | 9097 | 72.81 | 9025 | 9021 | 72.2 |
bc1007_bc1075 | 14072 | 12497 | 88.81 | 10190 | 72.41 | 10098 | 10098 | 71.76 |
bc1020_bc1059 | 13717 | 12050 | 87.85 | 9860 | 71.88 | 9801 | 9801 | 71.45 |
bc1024_bc1111 | 12887 | 11315 | 87.8 | 9301 | 72.17 | 9215 | 9213 | 71.49 |
Another tables that summarize the features (ASVs here) observed for each sample can be generated using:
qiime feature-table summarize --i-table ./dada2-ccs_table.qza \
--m-sample-metadata-file metadata.tsv \
--o-visualization ./dada2-ccs_table.qzv
which also makes use of the metadata file we generated above. A useful table here would be the table that summarizes the number of reads assigned to each feature.
Frequency | |
---|---|
Minimum frequency | 4 |
1st quartile | 84 |
Median frequency | 147 |
3rd quartile | 2,201.00 |
Maximum frequency | 11,733.00 |
Mean frequency | 1,455.77 |
The second tab of this visualization allows you to interactively select a cut-off where you can see how many samples "drop-out" at a certain cut-off. Finally, you can find a table on the third tab that provides detailed information on how many samples are the ASVs detected in, and how many reads in total for each of these ASVs. For example, the first 5 ASVs here are:
Frequency | # of Samples Observed In | |
---|---|---|
b80a1bd6b19379171cafb03a4ddb26a3 | 11,733 | 8 |
61de72098cbf9f693f09c863d1c5a586 | 11,111 | 8 |
a0530711bdec8d611872863731f96dde | 8,723 | 8 |
bc7d5e82dae59e2e5e8321bb5c5eae4b | 5,208 | 8 |
13bc41e35faa4f40a01de5a972c0c4ac | 4,936 | 8 |
f3cfb383dcfb753cad226072f6665784 | 4,400 | 8 |
7dcace79bc8188077e07b0085aa57b7f | 3,824 | 8 |
09a8fa70427aa68f16934c729b826071 | 2,459 | 8 |
One of the very first analysis for 16S experiment is to plot a rarefaction curve. This would allow us to understand if we're saturating the amount of features (or ASVs) that can be discovered from the sample. Let's do that with a simple command here:
qiime diversity alpha-rarefaction --i-table dada2-ccs_table.qza \
--m-metadata-file metadata.tsv \
--o-visualization ./alpha-rarefaction-curves.qzv \
--p-min-depth 10 --p-max-depth 10000
At what sequencing depth do you think we are saturating the experiment? It is important to note that we have a very simple mock community here. In real samples, the diversity and number of ASVs are often much higher than what we observe here. In addition, there's another plot that allows us to check for the number of samples that drop-out beyond a certain depth (useful for QC if you have a large amount of samples!).
A popular choice of classifier for 16S sequences is the naive Bayesian based qiime feature-classifier classify-sklearn
. However, in our experience, this is not very sensitive for our full-length mock community ASVs. DADA2
comes with a tool called assignTaxonomy
that uses a similar naive Bayesian classifier approach which works well (in the mock community), but requires one to use DADA2
in R.
With QIIME 2
, we are going to use VSEARCH
to classify the taxonomy of each of the ASVs from DADA2. VSEARCH
is a significant faster alternative to BLAST
and is able to classify almost all of the species in our mock community here. We will use the SILVA
version 138 database here for classification. It is debatable which database is the best, as they can all perform differently depending on the communities. We chose SILVA
here as it's continuously updated. The database formatted for QIIME 2
can be downloaded from QIIME 2
resources webpage.
qiime feature-classifier classify-consensus-vsearch --o-classification ./taxonomy.vsearch.qza \
--i-query dada2-ccs_rep.qza \
--i-reference-reads references/silva-138-99-seqs.qza \
--i-reference-taxonomy references/silva-138-99-tax.qza \
--p-threads 8 \
--p-maxrejects 100 --p-maxaccepts 10 --p-perc-identity 0.97
This should take < 1 min with 8 threads. Three important parameters that we changed for the classifier are --p-maxrejects 100 --p-maxaccepts 10 --p-perc-identity 0.97
. --p-max-rejects 100
specify that the search for taxonomy hit will stop after 100 non-matching sequences, and --p-maxaccepts 10
will stop the search if there's 10 taxonomy hits. In combination with --p-min-consensus
which has a default value of 0.51, more than half of the "hits" (set to be 10 here) need to have the same assignment at any specific taxonomy level for the sequences to be classified confidently. Finally, --p-perc-identity 0.97
specifies that the taxonomy hits need to be at least 97% identical. These options ensure that VSEARCH
runs very fast (Certainly much faster than the traditional BLAST
approach) and keeps only highly identical hits.
Let's look at what are the taxonomies assigned. You can use the first command to export it into the usual qzv
form for QIIME 2 View
, or you can use the second command to export it into a TSV format (useful for programmatic manipulation with R/Python etc):
qiime metadata tabulate --m-input-file ./taxonomy.vsearch.qza \
--o-visualization taxonomy.vsearch.qzv
qiime tools export --input-path taxonomy.vsearch.qza \
--output-path tax_export
The top 5 rows in QIIME 2 View
for example showed the ASVs IDs and the taxonomy of the ASV, in addition to the confidence of the assignment based on the number of top 10 hits (discussed above). As you can see, some ASVs cannot be assigned to species level and will end at the genus level for its taxonomic classification.
Feature ID | Taxon | Consensus |
---|---|---|
#q2:types | categorical | categorical |
021408f68ac3d943f7e16261373b2a61 | d__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Clostridiaceae; g__Clostridium_sensu_stricto_1 | 1 |
04d3de55f475a0888205c9507901a096 | d__Bacteria; p__Actinobacteriota; c__Actinobacteria; o__Actinomycetales; f__Actinomycetaceae; g__Actinomyces | 1 |
070e52443d30c5d9e99c8cb06412db1a | d__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Clostridiaceae; g__Clostridium_sensu_stricto_1; s__Clostridium_beijerinckii | 0.6 |
07b2978940b50886f3866b04b7f982cf | d__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Streptococcaceae; g__Streptococcus; s__Streptococcus_agalactiae | 1 |
09a8fa70427aa68f16934c729b826071 | d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia-Shigella; s__Escherichia_coli | 0.7 |
0bf3e0641c88ec588bc66b567c3fa63d | d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia-Shigella; s__Escherichia_coli | 0.8 |
0ccaf57fc10e3ca943b9911e91c05de1 | d__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Clostridiaceae; g__Clostridium_sensu_stricto_1; s__Clostridium_beijerinckii | 0.8 |
0ed2d53fd823b2ef922b6f349cade1cd | d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Burkholderiales; f__Neisseriaceae; g__Neisseria; s__Neisseria_meningitidis | 1 |
- Additional knowledge: If you use DADA2 to classify, the default recommendation for species assignment is to use
assignSpecies
which requires perfect match between the ASVs and the reference sequences. While this is important for shorter amplicons such as V3-V4 to reduce false-positives, it may not work well with full-length 16S sequencing. See this GitHub issue for example: https://github.com/benjjneb/dada2/issues/990 - Tips:
qiime feature-table filter-features
can be used to filter out features (ASVs) that have extremely low abundance (e.g. singletons) - Bonus subject: If you wish to use other database such as GTDB for classification, this is an example of importing the GTDB genome FASTA and taxonomy file into
qza
format to use withVSEARCH
. Note that the instruction below was tested with GTDB r202, so it may require changes if you wish to use other versions of the database.
# Here we will use the all genomes FASTA file from GTDB. You can also use
# the GTDB representative sequences but the amount of sequences are already
# collapsed in that set of data, so the consensus method of VSEARCH
# classification may have low "confidence" due to low number of hits and
# will require you to optimize the parameter.
# Extract the taxonomy sequences from the GTDB FASTA header using awk. Take
# a look at how the taxonomies are formatted and you can similarly
# generate such a taxonomy file for any database.
grep "^>" ssu_all_r202.fna | \
awk '{print gensub(/^>(.*)/, "\\1", "g", $1),gensub(/^>.*\ (d__.*)\ \[.*\[.*\[.*/, "\\1", "g", $0)}' OFS=$'\t' > ssu_all_r202.taxonomy.tsv
# Import the taxonomy
qiime tools import --type 'FeatureData[Taxonomy]' \
--input-path ssu_all_r202.taxonomy.tsv \
--output-path ssu_all_r202.taxonomy.qza \
--input-format HeaderlessTSVTaxonomyFormat
# Import the genome FASTA
qiime tools import --type 'FeatureData[Sequence]' \
--input-path ssu_all_r202.fna \
--output-path ssu_all_r202.qza
# Classify with VSEARCH. You may have to play around with the parameters to find out what
# works best for your sample
qiime feature-classifier classify-consensus-vsearch \
--o-classification ./taxonomy.vsearch.qza \
--i-query ../dada2-ccs_rep.qza \
--i-reference-reads bac120_ssu_reps_r202.qza \
--i-reference-taxonomy bac120_taxonomy.qza \
--p-threads 8 \
--p-maxrejects 100 --p-maxaccepts 20 \
--p-perc-identity 0.97
Another one of the most common plots we can see in 16S studies is the taxonomy bar plot. We can generate this using the QIIME 2
command:
qiime taxa barplot --i-table dada2-ccs_table.qza \
--i-taxonomy taxonomy.vsearch.qza \
--m-metadata-file ./metadata.tsv \
--o-visualization ./taxa_barplot.qzv
Visualize this in QIIME 2 View
, and tinker around with the functions available to look at the different taxonomies in your sample. Did we find all the taxonomies in ATCC-MSA1003? How many ASVs were assigned to species level?
To evaluate how well the results compared to the ground truth in a more systematic way, you can make use of the results from QIIME 2
using your favourite data analysis language. Here's a plot (Plotted using R
with the results here) showing the abundance comparing to the true abundance from ATCC-MSA1003. Note that the abundance for all 8 samples are averaged for this plot. From the QIIME 2 View
barplots visualization, you should also see that from sample to sample, the variation in abundance is minimal.
It looks like we did really well except for two missing bacteria, Bifidobacterium adolescentis and Schaalia odolyntica. At 0.02%, there's only on average 2 reads out of 10k reads that come from those species. Coupled with uneven sequencing of different species (but not uncommon due to the complexity of microbes in a complex community), it is not surprising to miss them at this sequencing depth. In fact, if you combine the full 192-plex datasets, you will be able to detect these two species. The plot also showed extremely good correlation of abundance calculated using the number of CCS reads assigned by DADA2
to each ASV compared to the true abundance across different range of abundances.
The power of QIIME 2
, as mentioned, is that it can carry out many useful analysis routinely done in 16S experiment. For example, we can construct a phylogenetic tree of all the ASVs generated by using:
qiime phylogeny align-to-tree-mafft-fasttree --i-sequences dada2-ccs_rep.qza \
--output-dir phylogeny-align-to-tree-mafft-fasttree
Note that by default, qiime phylogeny align-to-tree-mafft-fasttree
will create both an unrooted and a rooted (midpoint rooting) tree. The .qza
tree format can be visualized directly in iToL website: https://itol.embl.de/upload.cgi. After uploading the phylogenetic tree in .qza
format, you can also upload the taxonomy file from QIIME 2
generated from above (taxonomy.vsearch.qza
) to annotate the tree with the taxonomy (top right corner, Datasets
followed by Upload annotation
). Here's a tree generated with our data:
Often, we want to understand the diversity of the community we're sequencing and alpha/beta diversity are two common approaches. In addition, a PCoA (Principal coordinates analysis) based on these diversity metrics can be used for visualization purpose. Lucky for us, QIIME 2
provides a convenient one-liner to generate all we need:
qiime diversity core-metrics-phylogenetic --i-table ./dada2-ccs_table.qza \
--i-phylogeny ./phylogeny-align-to-tree-mafft-fasttree/rooted_tree.qza \
--m-metadata-file ./metadata.tsv \
--p-sampling-depth 9000 \
--output-dir ./core-metrics-results
This command will create a folder called core-metrics-results
containing various metrics of diversity. The "Moving Pictures" tutorial from QIIME 2
is a great resource for understanding what each of these metrics represent. We know our samples are all replicates of each other and we do not expect the groups that we created in the "metadata.tsv" file to be meaningful. Let's test this by comparing the beta-diversity between the two groups using Bray-Curtis distance:
qiime diversity beta-group-significance --i-distance-matrix core-metrics-results/bray_curtis_distance_matrix.qza \
--m-metadata-file metadata.tsv --m-metadata-column condition \
--o-visualization core-metrics-results/bray_curtis_bet_div_comp.qzv \
--p-pairwise
The --p-pairwise
parameter asks QIIME 2
to compare between all groups (in our case there's only two groups) in the column condition
in our metadata TSV file. The results show a p-value that's around 0.08. Purely by chance, it appears that our random label almost resulted into two groups that's a little different! (If you look at the rarefaction curve just now, one group appears to resolve into more ASVs).
PERMANOVA results | |
---|---|
method name | PERMANOVA |
test statistic name | pseudo-F |
sample size | 8 |
number of groups | 2 |
test statistic | 1.60154 |
p-value | 0.079 |
number of permutations | 999 |
Question: How did we determine --p-sampling-depth
?