Skip to content

Commit fa2788c

Browse files
author
Jon Palmer
committed
final update for v1.4.0...hopefully
1 parent ed76cc7 commit fa2788c

File tree

3 files changed

+67
-170
lines changed

3 files changed

+67
-170
lines changed

docs/taxonomy.rst

+57-29
Original file line numberDiff line numberDiff line change
@@ -86,80 +86,108 @@ Taxonomy databases are built with the ``amptk database`` command. This command
8686

8787
**Fungal ITS DB**
8888

89-
These databases were created from Unite v7.2.2 (released June 28th, 2017), first downloading two databases from the UNITE website. First the General FASTA release of the DB `here <https://unite.ut.ee/sh_files/sh_general_release_28.06.2017.zip>`_, and `here <https://unite.ut.ee/sh_files/sh_general_release_s_28.06.2017.zip>`_. Then the Full UNITE+INSD database `here <https://unite.ut.ee/sh_files/UNITE_public_28.06.2017.fasta.zip>`_. For the general FASTA releases, the 'developer' fasta files are used. The taxonomy information is then reformated and databases produced as follows:
89+
These databases were created from Unite v8.0, first downloading two databases from the UNITE website. First the General FASTA release of the DB `here <https://unite.ut.ee/sh_files/sh_general_release_28.06.2017.zip>`_, and `here <https://unite.ut.ee/sh_files/sh_general_release_s_28.06.2017.zip>`_. Then the Full UNITE+INSD database `here <https://unite.ut.ee/sh_files/UNITE_public_28.06.2017.fasta.zip>`_. For the general FASTA releases, the 'developer' fasta files are used. The taxonomy information is then reformated and databases produced as follows:
9090

9191
.. code-block:: none
9292
9393
#Create full length ITS USEARCH Database, convert taxonomy, and create USEARCH database
9494
amptk database -i UNITE_public_all_02.02.2019.fasta -f ITS1-F -r ITS4 \
95-
--primer_required none -o ITS --create_db usearch --install
95+
--primer_required none -o ITS --create_db usearch --install --source UNITE:8.0
96+
97+
#create SINTAX database
98+
amptk database -i sh_general_release_dynamic_all_02.02.2019_dev.fasta \
99+
-o ITS_SINTAX --create_db utax -f ITS1-F -r ITS4 --derep_fulllength \
100+
--install --source UNITE:8.0 --primer_required none
96101
97102
#Create UTAX Databases
98-
amptk database -i sh_general_release_dynamic_28.06.2017_dev.fasta \
99-
-o ITS_UTAX --create_db utax -f ITS1-F -r ITS4 --keep_all
100-
--derep_fulllength --lca --install
103+
amptk database -i sh_general_release_dynamic_all_02.02.2019_dev.fasta \
104+
-o ITS_UTAX --create_db utax -f ITS1-F -r ITS4 \
105+
--derep_fulllength --install --source UNITE:8.0 --primer_required none
101106
102-
amptk database -i sh_general_release_dynamic_s_28.06.2017_dev.fasta \
103-
-o ITS1_UTAX --create_db utax -f ITS1-F -r ITS2 --keep_all
104-
--derep_fulllength --lca --install
107+
amptk database -i sh_general_release_dynamic_all_02.02.2019_dev.fasta \
108+
-o ITS1_UTAX -f ITS1-F -r ITS2 --primer_required rev --derep_fulllength \
109+
--create_db utax --install --subsample 65000 --source UNITE:8.0
105110
106-
amptk database -i sh_general_release_dynamic_s_28.06.2017_dev.fasta \
111+
amptk database -i sh_general_release_dynamic_all_02.02.2019_dev.fasta \
107112
-o ITS2_UTAX --create_db utax -f fITS7 -r ITS4 --derep_fulllength \
108-
--lca --install
113+
--install --source UNITE:8.0 --primer_required for
109114
110115
**Arthropod/Chordate mtCOI DB**
111116

112117
These data were pulled from the `BOLDv4 database <http://v4.boldsystems.org>`_ Since most studies using mtCOI regions are interested in identification of insects in the diets of animals, the BOLD database was queried as follows. All Chordata sequences were downloaded by querying the `BIN database using the search term Chordata <http://v4.boldsystems.org/index.php/Public_BINSearch?query=Chordata&searchBIN=Search+BINs>`_. Similarly, the Arthropods were searched by querying the `BIN databases using the search term Arthropoda <http://v4.boldsystems.org/index.php/Public_BINSearch?query=Arthropoda&searchBIN=Search+BINs>`_. All data was then downloaded as TSV output.
113118

114119
The TSV output files (~ 6GB) where then each formatted using the following method, which reformats the taxonomy information and pulls sequences that are annotated in BINS and then clusters sequences in each bin to 99%.
115120

121+
Since it can literally take days to download the arthropod dataset, if you'd like to experiment with the data you can get a copy here: `chordates <https://osf.io/9bh2f/download?version=1>`_ and `arthropods <https://osf.io/aqrey/download?version=1>`_.
122+
116123
.. code-block:: none
117124
118125
#reformat taxonomy
119-
bold2utax.py -i Arthropoda_bold_data.txt -o arthropoda.bold.bins.fa
120-
bold2utax.py -i Chordata_bold_data.txt -o chordata.bold.bins.fa
126+
bold2utax.py -i arthropods.bold.02092019.txt -o chordates --cluster 99 --drop_suppressed
127+
bold2utax.py -i arthropods.bold.02092019.txt -o arthropods --cluster 99 --drop_suppressed
121128
122-
#combine datasets
123-
cat arthropoda.bold.bins.fa chordata.bold.bins.fa > all.data.bins.fa
129+
#combine datasets for usearch
130+
cat arthropods.bold-reformated.fa chordates.bold-reformated.fa > arth-chord.bold-reformated.fasta
124131
125132
#generate global alignment database
126-
amptk database -i all.data.bins.fa --skip_trimming --keep_all --min_len 125 \
127-
--derep_fulllength --create_db usearch -o COI --format off --install
133+
amptk database -i arth-chord.bold.reformated.fasta -f LCO1490 -r mlCOIintR --primer_required none \
134+
--derep_fulllength --format off --primer_mismatch 4 -o COI --min_len 200 --create_db usearch \
135+
--install --source BOLD:20190219
128136
129-
The data is then further processed with a second script that will search for priming sites and then randomly subsample the data down to a number of records that can be used to train UTAX and then database was created.
137+
The second set of output files from `bold2utax.py` are named with `.BIN-consensus.fa` which are the result of 99% clustering for each BIN. We will combine those for the two datasets and then use those data to generate the SINTAX and UTAX databases.
130138

131139
.. code-block:: none
132140
133-
#searches for priming sites and subsamples to 90,000 records
134-
bold2amptk.py -i all.data.bins.fa -o arthropods.chordates
141+
#combine datasets
142+
cat arthropods.BIN-consensus.fa chordates.BIN-consensus.fa > arth-chord.bold.BIN-consensus.fasta
135143
136-
#generate utax database
137-
amptk database -i arthropods.chordates.genus4utax.fa -o COI_UTAX \
138-
--format off --create_db utax --skip_trimming --install
144+
#generate SINTAX database
145+
amptk database -i arth-chord.bold.BIN-consensus.fasta -f LCO1490 -r mlCOIintR --primer_required none \
146+
--derep_fulllength --format off --primer_mismatch 4 -o COI_SINTAX --min_len 200 --create_db sintax \
147+
--install --source BOLD:20190219
148+
149+
#generate UTAX database, need to subsample for memory issues with 32 bit usearch and we require rev primer match here
150+
amptk database -i arth-chord.bold.BIN-consensus.fasta -f LCO1490 -r mlCOIintR --primer_required rev \
151+
--derep_fulllength --format off --subsample 30000 --primer_mismatch 4 -o COI_UTAX --min_len 200 \
152+
--create_db utax --install --source BOLD:20190219
139153
140154
**LSU database**
141155

142156
The fungal 28S database (LSU) was downloaded from `RDP <http://rdp.cme.msu.edu/download/current_Fungi_unaligned.fa.gz>`_. The sequences were then converted into AMPtk databases as follows:
143157

144158
.. code-block:: none
145159
146-
amptk database -i fungi.unaligned.fa -o LSU --format rdp2utax \
147-
--skip_trimming --create_db usearch --derep_fulllength --keep_all --install
160+
amptk database -i RDP_v8.0_fungi.fa -o LSU --format rdp2utax --primer_required none \
161+
--skip_trimming --create_db usearch --derep_fulllength --install --source RDP:8
162+
163+
amptk database -i RDP_v8.0_fungi.fa -o LSU_SINTAX --format rdp2utax --primer_required none \
164+
--skip_trimming --create_db sintax --derep_fulllength --install --source RDP:8
148165
166+
amptk database -i RDP_v8.0_fungi.fa -o LSU_UTAX --format rdp2utax --primer_required none \
167+
--skip_trimming --create_db utax --derep_fulllength --install --source RDP:8 --subsample 4500
168+
169+
149170
To generate a training set for UTAX, the sequences were first dereplicated, and clustered at 97% to get representative sequences for training. This training set was then converted to a UTAX database:
150171

151172
.. code-block:: none
152173
153-
amptk database -i fungi.trimmed.fa -o LSU_UTAX --format off \
154-
--skip_trimming --create_db utax --keep_all --install
174+
amptk database -i fungi.trimmed.fa -o LSU_UTAX --format off \
175+
--skip_trimming --create_db utax --keep_all --install
155176
156177
**16S database**
157-
This is downloaded from `R. Edgar's website <http://drive5.com/utax/data/rdp_v16.tar.gz>`_ and then formatted for AMPtk. Note there is room for substantial improvement here, I just don't typically work on 16S - so please let me know if you want some suggestions on what to do here.
178+
This is downloaded from `R. Edgar's website <http://drive5.com/utax/data/rdp_v16.tar.gz>`_ and then formatted for AMPtk. Note there is room for substantial improvement here, I just don't typically work on 16S - so please let me know if you want some suggestions on what to do here. Here I reformatted the "domain" taxonomy level to "kingdom" for simplicity (even though I know it is taxonomically incorrect).
158179

159180
.. code-block:: none
160181
161-
amptk database -i rdp_v16.fa -o 16S --format off --create_db utax \
162-
--skip_trimming --keep_all --install
182+
amptk database -i rdp_16s_v16_sp.kingdom.fa -o 16S --format off --create_db usearch \
183+
--skip_trimming --install --primer_required none --derep_fulllength
184+
185+
amptk database -i rdp_16s_v16_sp.kingdom.fa -o 16S --format off --create_db sintax \
186+
-f 515FB -r 806RB --install --primer_required for --derep_fulllength
187+
188+
189+
190+
163191
164192
Checking Installed Databases
165193
-------------------------------------

scripts/bold2amptk.py

-137
This file was deleted.

scripts/bold2utax.py

+10-4
Original file line numberDiff line numberDiff line change
@@ -112,8 +112,8 @@ def __init__(self,prog):
112112
else:
113113
continue
114114
if args.drop_suppressed:
115-
if 'SUPPRESSED' in GB:
116-
continue
115+
if 'SUPPRESSED' in GB:
116+
continue
117117
#clean up sequence, remove any gaps, remove terminal N's
118118
Seq = col[seqid].replace('-', '')
119119
Seq = Seq.strip('N')
@@ -168,11 +168,15 @@ def __init__(self,prog):
168168
print("Updating taxonomy")
169169
#finally loop through centroids and get taxonomy from dictionary
170170
finalcount = 0
171+
seen = set()
171172
with open(args.out+'.BIN-consensus.fa', 'w') as outputfile:
172173
for file in os.listdir(tmp):
173174
if file.endswith('.consensus.fa'):
174175
for record in FastaIterator(open(os.path.join(tmp, file))):
175-
record.id = record.id.replace('consensus=', '')
176+
if 'consensus=' in record.id:
177+
record.id = record.id.replace('consensus=', '')
178+
elif 'centroid=' in record.id:
179+
record.id = record.id.replace('centroid=', '')
176180
finalcount += 1
177181
fullname = record.id.split(';')[0]
178182
ID = fullname.split('_')[0]
@@ -181,7 +185,9 @@ def __init__(self,prog):
181185
else:
182186
print('{:} not found in taxonomy dictionary'.format(ID))
183187
continue
184-
outputfile.write('>{:};tax={:}\n{:}\n'.format(ID, tax, amptklib.softwrap(str(record.seq))))
188+
if not fullname in seen:
189+
outputfile.write('>{:};tax={:}\n{:}\n'.format(fullname, tax, amptklib.softwrap(str(record.seq))))
190+
seen.add(fullname)
185191

186192
print("Wrote %i consensus seqs for each BIN to %s" % (finalcount, args.out+'.BIN-consensus.fa'))
187193
shutil.rmtree(tmp)

0 commit comments

Comments
 (0)