Skip to content

Commit 5001dc8

Browse files
committed
major code refactoring
1 parent 12fc9a2 commit 5001dc8

34 files changed

+6447
-2608
lines changed

DAS_Tool

+2-435
Large diffs are not rendered by default.

README.md

+59-78
Original file line numberDiff line numberDiff line change
@@ -11,65 +11,61 @@ Christian M. K. Sieber, Alexander J. Probst, Allison Sharrar, Brian C. Thomas, M
1111
# Usage
1212

1313
```
14-
DAS_Tool -i methodA.scaffolds2bin,...,methodN.scaffolds2bin
15-
-l methodA,...,methodN -c contigs.fa -o myOutput
16-
17-
-i, --bins Comma separated list of tab separated scaffolds to bin tables.
18-
-c, --contigs Contigs in fasta format.
19-
-o, --outputbasename Basename of output files.
20-
-l, --labels Comma separated list of binning prediction names. (optional)
21-
--search_engine Engine used for single copy gene identification [blast/diamond/usearch].
22-
(default: usearch)
23-
--write_bin_evals Write evaluation for each input bin set [0/1]. (default: 1)
24-
--create_plots Create binning performance plots [0/1]. (default: 1)
25-
--write_bins Export bins as fasta files [0/1]. (default: 0)
26-
--proteins Predicted proteins in prodigal fasta format (>scaffoldID_geneNo).
27-
Gene prediction step will be skipped if given. (optional)
28-
--score_threshold Score threshold until selection algorithm will keep selecting bins [0..1].
29-
(default: 0.5)
30-
--duplicate_penalty Penalty for duplicate single copy genes per bin (weight b).
31-
Only change if you know what you're doing. [0..3]
32-
(default: 0.6)
33-
--megabin_penalty Penalty for megabins (weight c). Only change if you know what you're doing. [0..3]
34-
(default: 0.5)
35-
--db_directory Directory of single copy gene database. (default: install_dir/db)
36-
--resume Use existing predicted single copy gene files from a previous run [0/1]. (default: 0)
37-
--debug Write debug information to log file.
38-
-t, --threads Number of threads to use. (default: 1)
39-
-v, --version Print version number and exit.
40-
-h, --help Show this message.
14+
DAS_Tool [options] -i <contig2bin> -c <contigs_fasta> -o <outputbasename>
15+
16+
Options:
17+
-i --bins=<contig2bin> Comma separated list of tab separated contigs to bin tables.
18+
-c --contigs=<contigs> Contigs in fasta format.
19+
-o --outputbasename=<outputbasename> Basename of output files.
20+
-l --labels=<labels> Comma separated list of binning prediction names.
21+
--search_engine=<search_engine> Engine used for single copy gene identification (blast/diamond/usearch) [default: diamond].
22+
-p --proteins=<proteins> Predicted proteins (optional) in prodigal fasta format (>contigID_geneNo).
23+
Gene prediction step will be skipped.
24+
--write_bin_evals Write evaluation of input bin sets.
25+
--write_bins Export bins as fasta files.
26+
--write_unbinned Export unbinned contigs as fasta file (--write_bins needs to be set).
27+
-t --threads=<threads> Number of threads to use [default: 1].
28+
--score_threshold=<score_threshold> Score threshold until selection algorithm will keep selecting bins (0..1) [default: 0.5].
29+
--duplicate_penalty=<duplicate_penalty> Penalty for duplicate single copy genes per bin (weight b).
30+
Only change if you know what you are doing (0..3) [default: 0.6].
31+
--megabin_penalty=<megabin_penalty> Penalty for megabins (weight c). Only change if you know what you are doing (0..3) [default: 0.5].
32+
--dbDirectory=<dbDirectory> Directory of single copy gene database [default: db].
33+
--resume Use existing predicted single copy gene files from a previous run.
34+
--debug Write debug information to log file.
35+
-v --version Print version number and exit.
36+
-h --help Show this.
4137
4238
```
4339

4440

4541
### Input file format
46-
- Bins [\--bins, -i]: Tab separated files of scaffold-IDs and bin-IDs.
47-
Scaffolds to bin file example:
42+
- Bins [\--bins, -i]: Tab separated files of contig-IDs and bin-IDs.
43+
Contigs to bin file example:
4844
```
49-
Scaffold_1 bin.01
50-
Scaffold_8 bin.01
51-
Scaffold_42 bin.02
52-
Scaffold_49 bin.03
45+
Contig_1 bin.01
46+
Contig_8 bin.01
47+
Contig_42 bin.02
48+
Contig_49 bin.03
5349
```
5450
- Contigs [\--contigs, -c]: Assembled contigs in fasta format:
5551
```
56-
>Scaffold_1
52+
>Contig_1
5753
ATCATCGTCCGCATCGACGAATTCGGCGAACGAGTACCCCTGACCATCTCCGATTA...
58-
>Scaffold_2
54+
>Contig_2
5955
GATCGTCACGCAGGCTATCGGAGCCTCGACCCGCAAGCTCTGCGCCTTGGAGCAGG...
6056
```
6157

62-
- Proteins (optional) [\--proteins]: Predicted proteins in prodigal fasta format. Header contains scaffold-ID and gene number:
58+
- Proteins (optional) [\--proteins]: Predicted proteins in prodigal fasta format. Header contains contig-ID and gene number:
6359
```
64-
>Scaffold_1_1
60+
>Contig_1_1
6561
MPRKNKKLPRHLLVIRTSAMGDVAMLPHALRALKEAYPEVKVTVATKSLFHPFFEG...
66-
>Scaffold_1_2
62+
>Contig_1_2
6763
MANKIPRVPVREQDPKVRATNFEEVCYGYNVEEATLEASRCLNCKNPRCVAACPVN...
6864
```
6965

7066
### Output files
7167
- Summary of output bins including quality and completeness estimates (DASTool_summary.txt).
72-
- Scaffolds to bin file of output bins (DASTool_scaffolds2bin.txt).
68+
- Contigs to bin file of output bins (DASTool_contigs2bin.txt).
7369
- Quality and completeness estimates of input bin sets, if ```--write_bin_evals 1``` is set ([method].eval).
7470
- Plots showing the amount of high quality bins and score distribution of bins per method, if ```--create_plots 1``` is set (DASTool_hqBins.pdf, DASTool_scores.pdf).
7571
- Bins in fasta format if ```--write_bins 1``` is set (DASTool_bins).
@@ -80,26 +76,26 @@ MANKIPRVPVREQDPKVRATNFEEVCYGYNVEEATLEASRCLNCKNPRCVAACPVN...
8076

8177
**Example 1:** Run DAS Tool on binning predictions of MetaBAT, MaxBin, CONCOCT and tetraESOMs. Output files will start with the prefix *DASToolRun1*:
8278
```
83-
$ ./DAS_Tool -i sample_data/sample.human.gut_concoct_scaffolds2bin.tsv,
84-
sample_data/sample.human.gut_maxbin2_scaffolds2bin.tsv,
85-
sample_data/sample.human.gut_metabat_scaffolds2bin.tsv,
86-
sample_data/sample.human.gut_tetraESOM_scaffolds2bin.tsv
79+
$ ./DAS_Tool -i sample_data/sample.human.gut_concoct_contigs2bin.tsv,
80+
sample_data/sample.human.gut_maxbin2_contigs2bin.tsv,
81+
sample_data/sample.human.gut_metabat_contigs2bin.tsv,
82+
sample_data/sample.human.gut_tetraESOM_contigs2bin.tsv
8783
-l concoct,maxbin,metabat,tetraESOM
8884
-c sample_data/sample.human.gut_contigs.fa
8985
-o sample_output/DASToolRun1
9086
```
9187

92-
**Example 2:** Run DAS Tool again with different parameters. Use the proteins predicted in Example 1 to skip the gene prediction step, disable writing of bin evaluations, set the number of threads to 2 and score threshold to 0.6. Output files will start with the prefix *DASToolRun2*:
88+
**Example 2:** Run DAS Tool again with different parameters. Use the proteins predicted in Example 1 to skip the gene prediction step, output evaluations of input bins, set the number of threads to 2 and score threshold to 0.6. Output files will start with the prefix *DASToolRun2*:
9389
```
94-
$ ./DAS_Tool -i sample_data/sample.human.gut_concoct_scaffolds2bin.tsv,
95-
sample_data/sample.human.gut_maxbin2_scaffolds2bin.tsv,
96-
sample_data/sample.human.gut_metabat_scaffolds2bin.tsv,
97-
sample_data/sample.human.gut_tetraESOM_scaffolds2bin.tsv
90+
$ ./DAS_Tool -i sample_data/sample.human.gut_concoct_contigs2bin.tsv,
91+
sample_data/sample.human.gut_maxbin2_contigs2bin.tsv,
92+
sample_data/sample.human.gut_metabat_contigs2bin.tsv,
93+
sample_data/sample.human.gut_tetraESOM_contigs2bin.tsv
9894
-l concoct,maxbin,metabat,tetraESOM
9995
-c sample_data/sample.human.gut_contigs.fa
10096
-o sample_output/DASToolRun2
10197
--proteins sample_output/DASToolRun1_proteins.faa
102-
--write_bin_evals 0
98+
--write_bin_evals
10399
--threads 2
104100
--score_threshold 0.6
105101
```
@@ -108,15 +104,15 @@ $ ./DAS_Tool -i sample_data/sample.human.gut_concoct_scaffolds2bin.tsv,
108104
# Dependencies
109105

110106
- R (>= 3.2.3): https://www.r-project.org
111-
- R-packages: data.table (>= 1.9.6), doMC (>= 1.3.4), ggplot2 (>= 2.1.0)
107+
- R-packages: data.table (>= 1.9.6), magrittr (>= 2.0.1), docopt (>= 0.7.1)
112108
- ruby (>= v2.3.1): https://www.ruby-lang.org
113109
- Pullseq (>= 1.0.2): https://github.com/bcthomas/pullseq
114110
- Prodigal (>= 2.6.3): https://github.com/hyattpd/Prodigal
115111
- coreutils (only macOS/ OS X): https://www.gnu.org/software/coreutils
116112
- One of the following search engines:
117-
- USEARCH* (>= 8.1): http://www.drive5.com/usearch/download.html
118113
- DIAMOND (>= 0.9.14): https://ab.inf.uni-tuebingen.de/software/diamond
119114
- BLAST+ (>= 2.5.0): https://blast.ncbi.nlm.nih.gov/Blast.cgi
115+
- USEARCH* (>= 8.1): http://www.drive5.com/usearch/download.html
120116

121117
\*) The free version of USEARCH only can use up to 4Gb RAM. Therefore, the use of DIAMOND or BLAST+ is recommended for big datasets.
122118

@@ -128,9 +124,6 @@ $ ./DAS_Tool -i sample_data/sample.human.gut_concoct_scaffolds2bin.tsv,
128124
unzip DAS_Tool-1.x.x.zip
129125
cd ./DAS_Tool-1.x.x
130126
131-
# Install R-packages:
132-
R CMD INSTALL ./package/DASTool_1.x.x.tar.gz
133-
134127
# Unzip SCG database:
135128
unzip ./db.zip -d db
136129
@@ -143,24 +136,12 @@ Installation of dependent R-packages:
143136
```
144137
$ R
145138
> repo='http://cran.us.r-project.org' #select a repository
146-
> install.packages('doMC', repos=repo, dependencies = T)
147139
> install.packages('data.table', repos=repo, dependencies = T)
148-
> install.packages('ggplot2', repos=repo, dependencies = T)
149-
> q() #quit R-session
150-
```
151-
152-
After installing all dependent R-packages, the DAS Tool R-functions can be installed in a bash terminal:
153-
```
154-
$ R CMD INSTALL ./package/DASTool_1.x.x.tar.gz
155-
```
156-
...or in an R-session:
157-
```
158-
$ R
159-
> install.packages('package/DASTool_1.x.x.tar.gz')
140+
> install.packages('magrittr', repos=repo, dependencies = T)
141+
> install.packages('docopt', repos=repo, dependencies = T)
160142
> q() #quit R-session
161143
```
162144

163-
164145
# Installation using conda or homebrew
165146
DAS Tool now can also be installed via bioconda and homebrew.
166147

@@ -191,46 +172,46 @@ brew install brewsci/bio/das_tool
191172

192173
# Preparation of input files
193174

194-
Not all binning tools provide results in a tab separated file of scaffold-IDs and bin-IDs. A helper script can be used to convert a set of bins in fasta format to tabular scaffolds2bin file, which can be used as input for DAS Tool: `src/Fasta_to_Scaffolds2Bin.sh -h`.
175+
Not all binning tools provide results in a tab separated file of contig-IDs and bin-IDs. A helper script can be used to convert a set of bins in fasta format to tabular contigs2bin file, which can be used as input for DAS Tool: `src/Fasta_to_Contigs2Bin.sh -h`.
195176

196177
### Usage:
197178
```
198-
Fasta_to_Scaffolds2Bin: Converts genome bins in fasta format to scaffolds-to-bin table.
179+
Fasta_to_Contigs2Bin: Converts genome bins in fasta format to contigs-to-bin table.
199180
200-
Usage: Fasta_to_Scaffolds2Bin.sh -e fasta > my_scaffolds2bin.tsv
181+
Usage: Fasta_to_Contigs2Bin.sh -e fasta > my_contigs2bin.tsv
201182
202183
-e, --extension Extension of fasta files. (default: fasta)
203184
-i, --input_folder Folder with bins in fasta format. (default: ./)
204185
-h, --help Show this message.
205186
```
206187

207-
### Example: Converting MaxBin fasta output into tab separated scaffolds2bin file:
188+
### Example: Converting MaxBin fasta output into tab separated contigs2bin file:
208189
```
209190
$ ls /maxbin/output/folder
210191
maxbin.001.fasta maxbin.002.fasta maxbin.003.fasta...
211192
212-
$ src/Fasta_to_Scaffolds2Bin.sh -i /maxbin/output/folder -e fasta > maxbin.scaffolds2bin.tsv
193+
$ src/Fasta_to_Contigs2Bin.sh -i /maxbin/output/folder -e fasta > maxbin.contigs2bin.tsv
213194
214-
$ head gut_maxbin2_scaffolds2bin.tsv
195+
$ head gut_maxbin2_contigs2bin.tsv
215196
NODE_10_length_127450_cov_375.783524 maxbin.001
216197
NODE_27_length_95143_cov_427.155298 maxbin.001
217198
NODE_51_length_78315_cov_504.322425 maxbin.001
218199
NODE_84_length_66931_cov_376.684775 maxbin.001
219200
NODE_87_length_65653_cov_460.202156 maxbin.001
220201
```
221202

222-
Some binning tools (such as CONCOCT) provide a comma separated tabular output. To convert a comma separated file into a tab separated file a one liner can be used: `perl -pe "s/,/\t/g;" scaffolds2bin.csv > scaffolds2bin.tsv`.
203+
Some binning tools (such as CONCOCT) provide a comma separated tabular output. To convert a comma separated file into a tab separated file a one liner can be used: `perl -pe "s/,/\t/g;" contigs2bin.csv > contigs2bin.tsv`.
223204

224-
### Example: Converting CONCOCT csv output into tab separated scaffolds2bin file:
205+
### Example: Converting CONCOCT csv output into tab separated contigs2bin file:
225206
```
226207
$ head concoct_clustering_gt1000.csv
227208
NODE_2_length_147519_cov_33.166976,42
228209
NODE_3_length_141012_cov_38.678171,42
229210
NODE_4_length_139685_cov_35.741896,42
230211
231-
$ perl -pe "s/,/\tconcoct./g;" concoct_clustering_gt1000.csv > concoct.scaffolds2bin.tsv
212+
$ perl -pe "s/,/\tconcoct./g;" concoct_clustering_gt1000.csv > concoct.contigs2bin.tsv
232213
233-
$ head concoct.scaffolds2bin.tsv
214+
$ head concoct.contigs2bin.tsv
234215
NODE_2_length_147519_cov_33.166976 concoct.42
235216
NODE_3_length_141012_cov_38.678171 concoct.42
236217
NODE_4_length_139685_cov_35.741896 concoct.42

package/DASTool_1.1.2.tar.gz

-8.77 KB
Binary file not shown.

sample_data/sample.human.gut_metabat_scaffolds2bin.tsv sample_data/sample.human.gut_metabat_contigs2bin.tsv

+38
Original file line numberDiff line numberDiff line change
@@ -280,3 +280,41 @@ Ley3_66761_scaffold_13347 metabat.85
280280
Ley3_66761_scaffold_14239 metabat.85
281281
Ley3_66761_scaffold_16210 metabat.7
282282
Ley3_66761_scaffold_23267 metabat.8
283+
Ley3_66761_scaffold_1663 TESTETEST.17
284+
Ley3_66761_scaffold_1761 TESTETEST.17
285+
Ley3_66761_scaffold_1820 TESTETEST.17
286+
Ley3_66761_scaffold_1855 TESTETEST.17
287+
Ley3_66761_scaffold_2133 TESTETEST.17
288+
Ley3_66761_scaffold_2244 TESTETEST.17
289+
Ley3_66761_scaffold_2271 TESTETEST.17
290+
Ley3_66761_scaffold_2442 TESTETEST.17
291+
Ley3_66761_scaffold_2621 TESTETEST.17
292+
Ley3_66761_scaffold_2637 TESTETEST.17
293+
Ley3_66761_scaffold_2738 TESTETEST.17
294+
Ley3_66761_scaffold_2826 TESTETEST.17
295+
Ley3_66761_scaffold_2847 TESTETEST.17
296+
Ley3_66761_scaffold_2910 TESTETEST.17
297+
Ley3_66761_scaffold_3497 TESTETEST.17
298+
Ley3_66761_scaffold_3760 TESTETEST.17
299+
Ley3_66761_scaffold_3927 TESTETEST.17
300+
Ley3_66761_scaffold_4346 TESTETEST.17
301+
Ley3_66761_scaffold_4858 TESTETEST.17
302+
Ley3_66761_scaffold_4971 TESTETEST.17
303+
Ley3_66761_scaffold_5019 TESTETEST.17
304+
Ley3_66761_scaffold_5117 TESTETEST.17
305+
Ley3_66761_scaffold_5790 TESTETEST.17
306+
Ley3_66761_scaffold_7034 TESTETEST.17
307+
Ley3_66761_scaffold_8624 TESTETEST.17
308+
Ley3_66761_scaffold_8800 TESTETEST.17
309+
Ley3_66761_scaffold_9985 TESTETEST.17
310+
Ley3_66761_scaffold_11091 TESTETEST.17
311+
Ley3_66761_scaffold_8 TESTETEST.83
312+
Ley3_66761_scaffold_42 TESTETEST.83
313+
Ley3_66761_scaffold_99 TESTETEST.83
314+
Ley3_66761_scaffold_127 TESTETEST.83
315+
Ley3_66761_scaffold_194 TESTETEST.83
316+
Ley3_66761_scaffold_215 TESTETEST.83
317+
Ley3_66761_scaffold_226 TESTETEST.83
318+
Ley3_66761_scaffold_359 TESTETEST.83
319+
Ley3_66761_scaffold_376 TESTETEST.83
320+
Ley3_66761_scaffold_386 TESTETEST.83

sample_output/DASToolRun1_DASTool.log

+37
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
DAS Tool 1.1
2+
3+
4+
Parameters:
5+
--bins sample_data/sample.human.gut_concoct_scaffolds2bin.tsv,sample_data/sample.human.gut_maxbin2_scaffolds2bin.tsv,sample_data/sample.human.gut_metabat_scaffolds2bin.tsv,sample_data/sample.human.gut_tetraESOM_scaffolds2bin.tsv
6+
--contigs sample_data/sample.human.gut_contigs.fa
7+
--outputbasename sample_output/DASToolRun1
8+
--labels concoct,maxbin,metabat,tetraESOM
9+
--search_engine diamond
10+
--proteins NULL
11+
--write_bin_evals FALSE
12+
--write_bins FALSE
13+
--write_unbinned FALSE
14+
--threads 1
15+
--score_threshold 0.5
16+
--duplicate_penalty 0.6
17+
--megabin_penalty 0.5
18+
--dbDirectory db
19+
--resume FALSE
20+
--debug FALSE
21+
--version FALSE
22+
--help FALSE
23+
--create_plots FALSE
24+
25+
26+
Dependencies:
27+
prodigal /usr/bin/prodigal
28+
diamond /usr/bin/diamond
29+
pullseq /usr/bin/pullseq
30+
ruby /usr/bin/ruby
31+
usearch
32+
blastp
33+
34+
Analyzing assembly
35+
Predicting genes
36+
Annotating single copy genes
37+
Dereplicating, aggregating, and scoring bins

0 commit comments

Comments
 (0)