You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/taxonomy.rst
+57-29
Original file line number
Diff line number
Diff line change
@@ -86,80 +86,108 @@ Taxonomy databases are built with the ``amptk database`` command. This command
86
86
87
87
**Fungal ITS DB**
88
88
89
-
These databases were created from Unite v7.2.2 (released June 28th, 2017), first downloading two databases from the UNITE website. First the General FASTA release of the DB `here <https://unite.ut.ee/sh_files/sh_general_release_28.06.2017.zip>`_, and `here <https://unite.ut.ee/sh_files/sh_general_release_s_28.06.2017.zip>`_. Then the Full UNITE+INSD database `here <https://unite.ut.ee/sh_files/UNITE_public_28.06.2017.fasta.zip>`_. For the general FASTA releases, the 'developer' fasta files are used. The taxonomy information is then reformated and databases produced as follows:
89
+
These databases were created from Unite v8.0, first downloading two databases from the UNITE website. First the General FASTA release of the DB `here <https://unite.ut.ee/sh_files/sh_general_release_28.06.2017.zip>`_, and `here <https://unite.ut.ee/sh_files/sh_general_release_s_28.06.2017.zip>`_. Then the Full UNITE+INSD database `here <https://unite.ut.ee/sh_files/UNITE_public_28.06.2017.fasta.zip>`_. For the general FASTA releases, the 'developer' fasta files are used. The taxonomy information is then reformated and databases produced as follows:
90
90
91
91
.. code-block:: none
92
92
93
93
#Create full length ITS USEARCH Database, convert taxonomy, and create USEARCH database
--install --source UNITE:8.0 --primer_required for
109
114
110
115
**Arthropod/Chordate mtCOI DB**
111
116
112
117
These data were pulled from the `BOLDv4 database <http://v4.boldsystems.org>`_ Since most studies using mtCOI regions are interested in identification of insects in the diets of animals, the BOLD database was queried as follows. All Chordata sequences were downloaded by querying the `BIN database using the search term Chordata <http://v4.boldsystems.org/index.php/Public_BINSearch?query=Chordata&searchBIN=Search+BINs>`_. Similarly, the Arthropods were searched by querying the `BIN databases using the search term Arthropoda <http://v4.boldsystems.org/index.php/Public_BINSearch?query=Arthropoda&searchBIN=Search+BINs>`_. All data was then downloaded as TSV output.
113
118
114
119
The TSV output files (~ 6GB) where then each formatted using the following method, which reformats the taxonomy information and pulls sequences that are annotated in BINS and then clusters sequences in each bin to 99%.
115
120
121
+
Since it can literally take days to download the arthropod dataset, if you'd like to experiment with the data you can get a copy here: `chordates <https://osf.io/9bh2f/download?version=1>`_ and `arthropods <https://osf.io/aqrey/download?version=1>`_.
--derep_fulllength --format off --primer_mismatch 4 -o COI --min_len 200 --create_db usearch \
135
+
--install --source BOLD:20190219
128
136
129
-
The data is then further processed with a second script that will search for priming sites and then randomly subsample the data down to a number of records that can be used to train UTAX and then database was created.
137
+
The second set of output files from `bold2utax.py` are named with `.BIN-consensus.fa` which are the result of 99% clustering for each BIN. We will combine those for the two datasets and then use those data to generate the SINTAX and UTAX databases.
130
138
131
139
.. code-block:: none
132
140
133
-
#searches for priming sites and subsamples to 90,000 records
The fungal 28S database (LSU) was downloaded from `RDP <http://rdp.cme.msu.edu/download/current_Fungi_unaligned.fa.gz>`_. The sequences were then converted into AMPtk databases as follows:
To generate a training set for UTAX, the sequences were first dereplicated, and clustered at 97% to get representative sequences for training. This training set was then converted to a UTAX database:
150
171
151
172
.. code-block:: none
152
173
153
-
amptk database -i fungi.trimmed.fa -o LSU_UTAX --format off \
This is downloaded from `R. Edgar's website <http://drive5.com/utax/data/rdp_v16.tar.gz>`_ and then formatted for AMPtk. Note there is room for substantial improvement here, I just don't typically work on 16S - so please let me know if you want some suggestions on what to do here.
178
+
This is downloaded from `R. Edgar's website <http://drive5.com/utax/data/rdp_v16.tar.gz>`_ and then formatted for AMPtk. Note there is room for substantial improvement here, I just don't typically work on 16S - so please let me know if you want some suggestions on what to do here. Here I reformatted the "domain" taxonomy level to "kingdom" for simplicity (even though I know it is taxonomically incorrect).
0 commit comments