Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmseqs databases GTDB setup fails #561

Closed
AroneyS opened this issue May 9, 2022 · 14 comments
Closed

mmseqs databases GTDB setup fails #561

AroneyS opened this issue May 9, 2022 · 14 comments

Comments

@AroneyS
Copy link

AroneyS commented May 9, 2022

Expected Behavior

Completes databases workflow, creating GTDB database.

Current Behavior

Error occurs near the end of the workflow.
Files created: gtdb gtdb.dbtype gtdb_h gtdb_h.dbtype gtdb_h.index gtdb.index gtdb.source tmp

Steps to Reproduce (for bugs)

mmseqs databases GTDB gtdb tmp

MMseqs Output (for bugs)

Create directory tmp
databases GTDB gtdb tmp 

MMseqs Version:              	13.45111
Force restart with latest tmp	false
Remove temporary files       	false
Compressed                   	0
Threads                      	128
Verbosity                    	3

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 43.1M  100 43.1M    0     0  4930k      0  0:00:08  0:00:08 --:--:-- 6278k

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   334  100   334    0     0   6185      0 --:--:-- --:--:-- --:--:--  6185
tar2db tmp/18048483634160780024/gtdb.tar.gz tmp/18048483634160780024/tardb --tar-include faa$ --threads 128 -v 3 

Time for merging to tardb: 0h 0m 0s 113ms
Time for merging to tardb.lookup: 0h 0m 0s 110ms
Time for processing: 0h 0m 0s 443ms
createdb tmp/18048483634160780024/tardb gtdb --compressed 0 -v 3 

Converting sequences

Time for merging to gtdb_h: 0h 0m 0s 25ms
Time for merging to gtdb: 0h 0m 0s 24ms
Database type: Nucleotide
The input files have no entry:  - tmp/18048483634160780024/tardb
Please check your input files. Only files in fasta/fastq[.gz|bz2] are supported
Error: createdb died
The process 'mmseqs' has failed.

Context

Downloading GTDB db.

Your Environment

Using conda installation of mmseqs (MMseqs2 Version: 13.45111)
128cpu/1000GB mem. Support for AVX2

@dariogf
Copy link

dariogf commented May 9, 2022

Same here, has tracked it down to this command:

mmseqs tar2db /localscratch/users/latest/gtdb.tar.gz /localscratch/users/latest/tardb --tar-include faa$

The problem comes with the regular expresion used in the option --tar-include, I cannot find why, but if you set it with the $ at the end, it doesn't works. If you remove it it works but obviously it is going to accept more files than desired.

I have tried with different regular expressions, scaping the $, quoting it, etc... no one works, it seems like the tar2db is silently failing when you use there any regular expression, but works when you use a simple string.

@dariogf
Copy link

dariogf commented May 9, 2022

Listing the gtdb.tar.gz It seems that removing the $ will not affect the number of files used in the tardb because all files are .faa, so I have edited the download.sh script and I am building the database in this way while a fix is released:

/localscratch/users/latest # diff download.sh ../*bkp/download.sh
6c6
< MMSEQS=mmseqs # add this or define it as an ENV variable.
---
> 
374,376c374
<         #"${MMSEQS}" tar2db "${TMP_PATH}/gtdb.tar.gz" "${TMP_PATH}/tardb" --tar-include 'faa$' ${THREADS_PAR} \
< 
<         "${MMSEQS}" tar2db "${TMP_PATH}/gtdb.tar.gz" "${TMP_PATH}/tardb" --tar-include 'faa' ${THREADS_PAR} \
---
>         "${MMSEQS}" tar2db "${TMP_PATH}/gtdb.tar.gz" "${TMP_PATH}/tardb" --tar-include 'faa$' ${THREADS_PAR} \

Also it seems to fail in servers with 512 GB of RAM due a some mapping of the files into memory, so I have used one with 2TB, it seems that the program is using

Using MMseqs2 Version: 92deb92

@milot-mirdita
Copy link
Member

What linux distribution and version are you using? I am trying to get 'faa$' to fail, however can't manage on my machine.

@milot-mirdita
Copy link
Member

I think I found the issue, its mostly unrelated to the regex itself.

When an entry is skipped, we don't correctly update the data offset for the next tar entry.

@dariogf
Copy link

dariogf commented Jul 4, 2022

Ok, anyway we use Suse LEAP 15.2.

milot-mirdita added a commit that referenced this issue Jul 4, 2022
We were not correctly updating the position in the tar file, when files were skipped
@mmpust
Copy link

mmpust commented Aug 4, 2022

I have the same problem.

Expected Behavior

I want to download and use the GTDB database with

mkdir databases
mmseqs databases GTDB databases/gtdb databases/tmp_gtdb

Current Behavior

The process is killed and the output remains empty.

total 12K
-rw-r--r-- 1 user user    0 Aug  4 15:19 gtdb
-rw-r--r-- 1 user user    4 Aug  4 15:19 gtdb.dbtype
-rw-r--r-- 1 user user    0 Aug  4 15:19 gtdb_h
-rw-r--r-- 1 user user    4 Aug  4 15:19 gtdb_h.dbtype
-rw-r--r-- 1 user user    0 Aug  4 15:19 gtdb_h.index
-rw-r--r-- 1 user user    0 Aug  4 15:19 gtdb.index
-rw-r--r-- 1 user user    0 Aug  4 15:19 gtdb.source
drwxr-xr-x 3 user user 4.0K Aug  4 13:37 tmp_gtdb

MMseqs Output (for bugs)

'''
Create directory databases/tmp_gtdb
databases GTDB databases/gtdb databases/tmp_gtdb

MMseqs Version: 13.45111
Force restart with latest tmp false
Remove temporary files false
Compressed 0
Threads 32
Verbosity 3

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 30 100 30 0 0 21 0 0:00:01 0:00:01 --:--:-- 21
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 40.2G 100 40.2G 0 0 6909k 0 1:41:44 1:41:44 --:--:-- 5539k
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 43.1M 100 43.1M 0 0 4473k 0 0:00:09 0:00:09 --:--:-- 8160k
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 334 100 334 0 0 190 0 0:00:01 0:00:01 --:--:-- 190
tar2db databases/tmp_gtdb/14203905869062371748/gtdb.tar.gz databases/tmp_gtdb/14203905869062371748/tardb --tar-include faa$ --threads 32 -v 3

Time for merging to tardb: 0h 0m 0s 1ms
Time for merging to tardb.lookup: 0h 0m 0s 2ms
Time for processing: 0h 0m 0s 36ms
createdb databases/tmp_gtdb/14203905869062371748/tardb databases/gtdb --compressed 0 -v 3

Converting sequences

Time for merging to gtdb_h: 0h 0m 0s 2ms
Time for merging to gtdb: 0h 0m 0s 1ms
Database type: Nucleotide
The input files have no entry: - databases/tmp_gtdb/14203905869062371748/tardb
Please check your input files. Only files in fasta/fastq[.gz|bz2] are supported
Error: createdb died
'''

Your Environment

  • Debian 11 Bullseye
  • 32 vCPU
  • 212992 MiB

@milot-mirdita
Copy link
Member

Please download the latest precompiled static binaries from https://mmseqs.com/latest

GTDB download should work with these

@mmpust
Copy link

mmpust commented Aug 5, 2022

It works like a charm with the latest precompiled static binaries, thanks.

@martin-steinegger
Copy link
Member

This should be now available in our newest release: https://github.com/soedinglab/MMseqs2/releases/tag/14-7e284

@shaodongyan
Copy link

It can't work with mmseqs2 390457d

@Biofarmer
Copy link

Biofarmer commented Jun 13, 2023

Hi, it still can't work with mmseqs2 version 14.7e284. Additionally, the VERSION file cannot be downloaded successfully. It can be downloaded manually, suggesting the internet should be fine.

@Biofarmer
Copy link

Any suggestions? Thanks.

@csm276
Copy link

csm276 commented Aug 8, 2023

Hi,
This article might help you a little:https://blog.csdn.net/Desinty_/article/details/132166492

@milot-mirdita
Copy link
Member

This should work again in MMseqs2 release 15. Please open a new issue if its failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants