final update for v1.4.0...hopefully

Jon Palmer · Jon Palmer · commit fa2788cb4d48 · 2019-02-20T05:14:55.000-08:00
diff --git a/docs/taxonomy.rst b/docs/taxonomy.rst
@@ -86,80 +86,108 @@ Taxonomy databases are built with the ``amptk database`` command.  This command
 
 **Fungal ITS DB**
 
-These databases were created from Unite v7.2.2 (released June 28th, 2017), first downloading two databases from the UNITE website.  First the General FASTA release of the DB `here <https://unite.ut.ee/sh_files/sh_general_release_28.06.2017.zip>`_, and `here <https://unite.ut.ee/sh_files/sh_general_release_s_28.06.2017.zip>`_.  Then the Full UNITE+INSD database `here <https://unite.ut.ee/sh_files/UNITE_public_28.06.2017.fasta.zip>`_.  For the general FASTA releases, the 'developer' fasta files are used. The taxonomy information is then reformated and databases produced as follows:
+These databases were created from Unite v8.0, first downloading two databases from the UNITE website.  First the General FASTA release of the DB `here <https://unite.ut.ee/sh_files/sh_general_release_28.06.2017.zip>`_, and `here <https://unite.ut.ee/sh_files/sh_general_release_s_28.06.2017.zip>`_.  Then the Full UNITE+INSD database `here <https://unite.ut.ee/sh_files/UNITE_public_28.06.2017.fasta.zip>`_.  For the general FASTA releases, the 'developer' fasta files are used. The taxonomy information is then reformated and databases produced as follows:
 
 .. code-block:: none
 
     #Create full length ITS USEARCH Database, convert taxonomy, and create USEARCH database
 	amptk database -i UNITE_public_all_02.02.2019.fasta -f ITS1-F -r ITS4 \
-		--primer_required none -o ITS --create_db usearch --install
+		--primer_required none -o ITS --create_db usearch --install --source UNITE:8.0
+		
+	#create SINTAX database
+    amptk database -i sh_general_release_dynamic_all_02.02.2019_dev.fasta \
+        -o ITS_SINTAX --create_db utax -f ITS1-F -r ITS4 --derep_fulllength \
+         --install --source UNITE:8.0 --primer_required none	
 
     #Create UTAX Databases
-    amptk database -i sh_general_release_dynamic_28.06.2017_dev.fasta  \
-        -o ITS_UTAX --create_db utax -f ITS1-F -r ITS4 --keep_all
-        --derep_fulllength --lca --install 
+    amptk database -i sh_general_release_dynamic_all_02.02.2019_dev.fasta  \
+        -o ITS_UTAX --create_db utax -f ITS1-F -r ITS4 \
+        --derep_fulllength --install --source UNITE:8.0 --primer_required none
         
-    amptk database -i sh_general_release_dynamic_s_28.06.2017_dev.fasta \
-        -o ITS1_UTAX --create_db utax -f ITS1-F -r ITS2 --keep_all
-        --derep_fulllength --lca --install 
+	amptk database -i sh_general_release_dynamic_all_02.02.2019_dev.fasta \
+		-o ITS1_UTAX -f ITS1-F -r ITS2 --primer_required rev --derep_fulllength \
+		--create_db utax --install --subsample 65000 --source UNITE:8.0
         
-    amptk database -i sh_general_release_dynamic_s_28.06.2017_dev.fasta \
+    amptk database -i sh_general_release_dynamic_all_02.02.2019_dev.fasta \
         -o ITS2_UTAX --create_db utax -f fITS7 -r ITS4 --derep_fulllength \
-        --lca --install 
+         --install --source UNITE:8.0 --primer_required for
 
 **Arthropod/Chordate mtCOI DB**
 
 These data were pulled from the `BOLDv4 database <http://v4.boldsystems.org>`_  Since most studies using mtCOI regions are interested in identification of insects in the diets of animals, the BOLD database was queried as follows.  All Chordata sequences were downloaded by querying the `BIN database using the search term Chordata <http://v4.boldsystems.org/index.php/Public_BINSearch?query=Chordata&searchBIN=Search+BINs>`_.  Similarly, the Arthropods were searched by querying the `BIN databases using the search term Arthropoda <http://v4.boldsystems.org/index.php/Public_BINSearch?query=Arthropoda&searchBIN=Search+BINs>`_.  All data was then downloaded as TSV output.
 
 The TSV output files (~ 6GB) where then each formatted using the following method, which reformats the taxonomy information and pulls sequences that are annotated in BINS and then clusters sequences in each bin to 99%.
 
+Since it can literally take days to download the arthropod dataset, if you'd like to experiment with the data you can get a copy here: `chordates <https://osf.io/9bh2f/download?version=1>`_ and `arthropods <https://osf.io/aqrey/download?version=1>`_.
+
 .. code-block:: none
 
     #reformat taxonomy
-    bold2utax.py -i Arthropoda_bold_data.txt -o arthropoda.bold.bins.fa
-    bold2utax.py -i Chordata_bold_data.txt -o chordata.bold.bins.fa
+    bold2utax.py -i arthropods.bold.02092019.txt -o chordates --cluster 99 --drop_suppressed
+    bold2utax.py -i arthropods.bold.02092019.txt -o arthropods --cluster 99 --drop_suppressed
 
-    #combine datasets
-    cat arthropoda.bold.bins.fa chordata.bold.bins.fa > all.data.bins.fa
+    #combine datasets for usearch
+    cat arthropods.bold-reformated.fa chordates.bold-reformated.fa > arth-chord.bold-reformated.fasta
     
     #generate global alignment database
-    amptk database -i all.data.bins.fa --skip_trimming --keep_all --min_len 125 \
-        --derep_fulllength --create_db usearch -o COI --format off --install
+	amptk database -i arth-chord.bold.reformated.fasta -f LCO1490 -r mlCOIintR --primer_required none \
+		--derep_fulllength --format off --primer_mismatch 4 -o COI --min_len 200 --create_db usearch \
+		--install --source BOLD:20190219
 
-The data is then further processed with a second script that will search for priming sites and then randomly subsample the data down to a number of records that can be used to train UTAX and then database was created.
+The second set of output files from `bold2utax.py` are named with `.BIN-consensus.fa` which are the result of 99% clustering for each BIN. We will combine those for the two datasets and then use those data to generate the SINTAX and UTAX databases.
 
 .. code-block:: none
 
- #searches for priming sites and subsamples to 90,000 records
- bold2amptk.py -i all.data.bins.fa -o arthropods.chordates
+ 	#combine datasets
+ 	cat arthropods.BIN-consensus.fa chordates.BIN-consensus.fa > arth-chord.bold.BIN-consensus.fasta
  
- #generate utax database
- amptk database -i arthropods.chordates.genus4utax.fa -o COI_UTAX \
-    --format off --create_db utax --skip_trimming --install
+ 	#generate SINTAX database
+	amptk database -i arth-chord.bold.BIN-consensus.fasta -f LCO1490 -r mlCOIintR --primer_required none \
+  		--derep_fulllength --format off --primer_mismatch 4 -o COI_SINTAX --min_len 200 --create_db sintax \
+  		--install --source BOLD:20190219
+  		
+ 	#generate UTAX database, need to subsample for memory issues with 32 bit usearch and we require rev primer match here
+	amptk database -i arth-chord.bold.BIN-consensus.fasta -f LCO1490 -r mlCOIintR --primer_required rev \
+		--derep_fulllength --format off --subsample 30000 --primer_mismatch 4 -o COI_UTAX --min_len 200 \
+		--create_db utax --install --source BOLD:20190219
 
 **LSU database**
 
 The fungal 28S database (LSU) was downloaded from `RDP <http://rdp.cme.msu.edu/download/current_Fungi_unaligned.fa.gz>`_.  The sequences were then converted into AMPtk databases as follows:
 
 .. code-block:: none
 
- amptk database -i fungi.unaligned.fa -o LSU --format rdp2utax \
-    --skip_trimming --create_db usearch --derep_fulllength --keep_all --install
+ 	amptk database -i RDP_v8.0_fungi.fa -o LSU --format rdp2utax --primer_required none \
+    	--skip_trimming --create_db usearch --derep_fulllength --install --source RDP:8
+
+ 	amptk database -i RDP_v8.0_fungi.fa -o LSU_SINTAX --format rdp2utax --primer_required none \
+    	--skip_trimming --create_db sintax --derep_fulllength --install --source RDP:8
 
+ 	amptk database -i RDP_v8.0_fungi.fa -o LSU_UTAX --format rdp2utax --primer_required none \
+    	--skip_trimming --create_db utax --derep_fulllength --install --source RDP:8 --subsample 4500
+    	
+    	  	
 To generate a training set for UTAX, the sequences were first dereplicated, and clustered at 97% to get representative sequences for training.  This training set was then converted to a UTAX database:
 
 .. code-block:: none
 
- amptk database -i fungi.trimmed.fa -o LSU_UTAX --format off \
-    --skip_trimming --create_db utax --keep_all --install
+ 	amptk database -i fungi.trimmed.fa -o LSU_UTAX --format off \
+    	--skip_trimming --create_db utax --keep_all --install
 
 **16S database**
-This is downloaded from `R. Edgar's website <http://drive5.com/utax/data/rdp_v16.tar.gz>`_ and then formatted for AMPtk.  Note there is room for substantial improvement here, I just don't typically work on 16S - so please let me know if you want some suggestions on what to do here.
+This is downloaded from `R. Edgar's website <http://drive5.com/utax/data/rdp_v16.tar.gz>`_ and then formatted for AMPtk.  Note there is room for substantial improvement here, I just don't typically work on 16S - so please let me know if you want some suggestions on what to do here.  Here I reformatted the "domain" taxonomy level to "kingdom" for simplicity (even though I know it is taxonomically incorrect).
 
 .. code-block:: none
 
- amptk database -i rdp_v16.fa -o 16S --format off --create_db utax \
-    --skip_trimming --keep_all --install
+ 	amptk database -i rdp_16s_v16_sp.kingdom.fa -o 16S --format off --create_db usearch \
+    	--skip_trimming --install --primer_required none --derep_fulllength
+    	
+    amptk database -i rdp_16s_v16_sp.kingdom.fa -o 16S --format off --create_db sintax \
+    	-f 515FB -r 806RB --install --primer_required for --derep_fulllength
+    	
+    
+    
+
 
 Checking Installed Databases
 -------------------------------------
diff --git a/scripts/bold2amptk.py b/scripts/bold2amptk.py
diff --git a/scripts/bold2utax.py b/scripts/bold2utax.py
@@ -112,8 +112,8 @@ def __init__(self,prog):
                     else:
                         continue
                 if args.drop_suppressed:
-                	if 'SUPPRESSED' in GB:
-                		continue
+                    if 'SUPPRESSED' in GB:
+                        continue
                 #clean up sequence, remove any gaps, remove terminal N's
                 Seq = col[seqid].replace('-', '')
                 Seq = Seq.strip('N')
@@ -168,11 +168,15 @@ def __init__(self,prog):
         print("Updating taxonomy")
         #finally loop through centroids and get taxonomy from dictionary
         finalcount = 0
+        seen = set()
         with open(args.out+'.BIN-consensus.fa', 'w') as outputfile:
             for file in os.listdir(tmp):
                 if file.endswith('.consensus.fa'):
                     for record in FastaIterator(open(os.path.join(tmp, file))):
-                        record.id = record.id.replace('consensus=', '')
+                        if 'consensus=' in record.id:
+                            record.id = record.id.replace('consensus=', '')
+                        elif 'centroid=' in record.id:
+                            record.id = record.id.replace('centroid=', '')
                         finalcount += 1
                         fullname = record.id.split(';')[0]
                         ID = fullname.split('_')[0]
@@ -181,7 +185,9 @@ def __init__(self,prog):
                         else:
                             print('{:} not found in taxonomy dictionary'.format(ID))
                             continue
-                        outputfile.write('>{:};tax={:}\n{:}\n'.format(ID, tax, amptklib.softwrap(str(record.seq))))
+                        if not fullname in seen:
+                        	outputfile.write('>{:};tax={:}\n{:}\n'.format(fullname, tax, amptklib.softwrap(str(record.seq))))
+                        	seen.add(fullname)
 
         print("Wrote %i consensus seqs for each BIN to %s" % (finalcount, args.out+'.BIN-consensus.fa'))
     shutil.rmtree(tmp)