Clusterupdate : clustering of deleted sequences and conversion to tsv file #272

ApollineBruley · 2020-02-10T16:35:16Z

Expected Behavior

I want to update my clusters after a database update (in which I add new sequences but also delete sequences compared to the old database).
The clusterupdate command works, but when I try to convert the cluster database to a tsv file, I have an error message related to the index (see below).

I tried the same thing on a new database where I just added sequences and it worked perfectly, so I assume the problem comes from the fact that I remove sequences from the old database?

Current Behavior

Error when trying to generate the tsv file.
In the cluster database obtained after clusterupdate ('CLU_updated') the removed sequences still appear, but they are absent of the updated sequence database ('DB_updated').

Steps to Reproduce (for bugs)

Creation of old DB (oldDB.fa : 17 amino acid sequences)
mmseqs createdb oldDB.fa DB_old
Clustering of old DB
mmseqs cluster DB_old CLU_old tmp
Creation of new DB (newDB.fa : 13 sequences are identical with the old DB, 4 were removed, 4 were added)
mmseqs createdb newDB.fa DB_new
Cluster update
mmseqs clusterupdate DB_old DB_new CLU_old DB_updated CLU_updated tmp
No error there, but even though sequences of numeric identifiers 12 , 11 , 16 , 15 in the old db have been removed, they appear in the CLU_updated file. They do not appear in the DB_updated files.
Conversion of cluster DB in tsv :
mmseqs createtsv DB_updated DB_updated CLU_updated clusters.tsv
=> Error message, generation of empty files : clusters.tsv.1 ... clusters.tsv.7 and clusters.tsv.index.1 ... clusters.tsv.index.7

MMseqs Output (for bugs)

Program call:
createtsv DB_updated DB_updated CLU_updated clusters.tsv 

MMseqs Version:                  	2f66ae897fc813450fa5ef0c78123bd3c41c4717
first sequence as respresentative	false
Target column                    	1
Add Full Header                  	false
Database Output                  	false
Threads                          	8
Compressed                       	0
Verbosity                        	3

Query database: DB_updated
Touch data file DB_updated_h ... Done.
Result database: CLU_updated
Start writing to clusters.tsv
Invalid database read for database data file=DB_updated_h, database index=DB_updated_h.index
getData: local id (4294967295) >= db size (17)

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Git commit used: 2f66ae8
Which MMseqs version was used: Compilation from source
Cmake versions used: cmake version 3.5.1
Operating system and version: Ubuntu 16.04 LTS

Thank you in advance for your help :)

The text was updated successfully, but these errors were encountered:

milot-mirdita · 2020-02-13T14:20:27Z

Sorry for the delay, would you mind uploading the two FASTA files (with the 17 seq each)?

I started refactoring the code and think I know whats wrong.

ApollineBruley · 2020-02-13T15:42:33Z

Here are the fastas .

Thanks for your help

fastas.zip

milot-mirdita · 2020-02-13T16:06:04Z

Thanks, I can reproduce the issue. I'll have to investigate whats going wrong.

Meanwhile, if you want a set of stickers (see https://twitter.com/thesteinegger/status/1201076220957315074), send me your postal address to milot at mirdita de.

milot-mirdita · 2020-02-17T16:04:49Z

I've found out that we are not dealing well with deleted sequences.
Their presence in the clustering is resulting in the error you are seeing. I am refactoring some code, but it turns out to be a bit more work.

jessicaparks · 2020-05-07T18:17:59Z

@milot-mirdita I'm running into this same issue. Any update on progress?

nick-youngblut · 2020-07-22T14:22:31Z

I appear to be getting a similar error:

$ mmseqs clusterupdate --min-seq-id 0.9 -c 0.8           ../tests/output_n10/genes/cluster/genes_db /ebio/abt3_scratch/nyoungblut/user_genes/genes_db ../tests/output_n10/genes/cluster/clusters_db           /ebio/abt3_scratch/nyoungblut/cluster_updated/genes_db /ebio/abt3_scratch/nyoungblut/cluster_updated/clusters_db.0 /ebio/abt3_scratch/nyoungblut/cluster_update
clusterupdate --min-seq-id 0.9 -c 0.8 ../tests/output_n10/genes/cluster/genes_db /ebio/abt3_scratch/nyoungblut/user_genes/genes_db ../tests/output_n10/genes/cluster/clusters_db /ebio/abt3_scratch/nyoungblut/cluster_updated/genes_db /ebio/abt3_scratch/nyoungblut/cluster_updated/clusters_db.0 /ebio/abt3_scratch/nyoungblut/cluster_update

MMseqs Version:                     	11.e1a1c
Seq. id. threshold                  	0.9
Coverage threshold                  	0.8

===================================================
=== Update the new sequences with the old keys ====
===================================================
===================================================
====== Filter out the new from old sequences ======
===================================================
===================================================
======= Extract representative sequences ==========
===================================================
result2repseq /ebio/abt3_projects/software/dev/Struo2/tests/output_n10/genes/cluster/genes_db /ebio/abt3_projects/software/dev/Struo2/tests/output_n10/genes/cluster/clusters_db /ebio/abt3_scratch/nyoungblut/Struo2_122419461619/cluster_update/7316799743718053916/OLDDB.repSeq

Invalid database read for database data file=/ebio/abt3_projects/software/dev/Struo2/tests/output_n10/genes/cluster/clusters_db, database index=/ebio/abt3_projects/software/dev/Struo2/tests/output_n10/genes/cluster/clusters_db.index
31mInvalid database read for database data file=/ebio/abt3_projects/software/dev/Struo2/tests/output_n10/genes/cluster/clusters_db, database index=/ebio/abt3_projects/software/dev/Struo2/tests/output_n10/genes/cluster/clusters_db.index

[... a lot of output ...]

31mSize of data: 363542
mRequested offset: 412570
Requested offset: 399738
31mRequested offset: 367758
Requested offset: 408364
31mRequested offset: 386682
39mRequested offset: 393723
mRequested offset: 403458
Requested offset: 381782
39mRequested offset: 413970
mRequested offset: 406964
m31Requested offset: 398528
Requested offset: 367053
mRequested offset: 415370
Error: result2repseq died

conda env:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       0_gnu    conda-forge
bzip2                     1.0.8                h516909a_2    conda-forge
ca-certificates           2020.6.20            hecda079_0    conda-forge
certifi                   2020.6.20        py38h32f6830_0    conda-forge
fasta-splitter            0.2.6                         0    bioconda
gawk                      5.1.0                h516909a_0    conda-forge
gettext                   0.19.8.1          hc5be6a0_1002    conda-forge
ld_impl_linux-64          2.34                 h53a641e_7    conda-forge
libblas                   3.8.0               17_openblas    conda-forge
libcblas                  3.8.0               17_openblas    conda-forge
libffi                    3.2.1             he1b5a44_1007    conda-forge
libgcc-ng                 9.2.0                h24d8f2e_2    conda-forge
libgfortran-ng            7.5.0                hdf63c60_6    conda-forge
libgomp                   9.2.0                h24d8f2e_2    conda-forge
libidn2                   2.3.0                h516909a_0    conda-forge
liblapack                 3.8.0               17_openblas    conda-forge
libopenblas               0.3.10          pthreads_hb3c22a3_3    conda-forge
libstdcxx-ng              9.2.0                hdf63c60_2    conda-forge
libunistring              0.9.10               h14c3975_0    conda-forge
llvm-openmp               8.0.1                hc9558a2_0    conda-forge
mmseqs2                   11.e1a1c             h2d02072_0    bioconda
ncurses                   6.2                  he1b5a44_1    conda-forge
numpy                     1.19.0           py38h8854b6b_0    conda-forge
openmp                    8.0.1                         0    conda-forge
openssl                   1.1.1g               h516909a_0    conda-forge
perl                      5.26.2            h516909a_1006    conda-forge
perl-constant             1.33                    pl526_1    bioconda
perl-exporter             5.72                    pl526_1    bioconda
perl-file-util            4.161950                pl526_3    bioconda
perl-lib                  0.63                    pl526_1    bioconda
pigz                      2.3.4                hed695b0_1    conda-forge
pip                       20.1.1                     py_1    conda-forge
prodigal                  2.6.3                h516909a_2    bioconda
python                    3.8.4           cpython_h425cb1d_0    conda-forge
python_abi                3.8                      1_cp38    conda-forge
readline                  8.0                  he28a2e2_2    conda-forge
seqkit                    0.13.2                        0    bioconda
setuptools                49.2.0           py38h32f6830_0    conda-forge
sqlite                    3.32.3               hcee41ef_1    conda-forge
tk                        8.6.10               hed695b0_0    conda-forge
vsearch                   2.15.0               h2d02072_0    bioconda
wget                      1.20.1               h22169c7_0    conda-forge
wheel                     0.34.2                     py_1    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zlib                      1.2.11            h516909a_1006    conda-forge

milot-mirdita · 2020-07-22T14:40:14Z

Dealing with deleted sequences is currently still broken. I had begun working on it, but didn't have time to finish up the work.

nick-youngblut · 2020-07-22T14:45:29Z

Thanks for the quick update! FYI: there doesn't seem to be any documentation about the differences between concatdbs and mergedbs

milot-mirdita · 2020-08-31T22:55:01Z

This took some time, but dealing with deleted sequences should hopefully work correctly now.

ApollineBruley changed the title ~~Generation of tsv file after updating clusters~~ Clusterupdate : clustering of deleted sequences and conversion to tsv file Feb 12, 2020

martin-steinegger assigned milot-mirdita May 8, 2020

milot-mirdita closed this as completed in b5a0883 Aug 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clusterupdate : clustering of deleted sequences and conversion to tsv file #272

Clusterupdate : clustering of deleted sequences and conversion to tsv file #272

ApollineBruley commented Feb 10, 2020 •

edited

Loading

milot-mirdita commented Feb 13, 2020

ApollineBruley commented Feb 13, 2020

milot-mirdita commented Feb 13, 2020

milot-mirdita commented Feb 17, 2020

jessicaparks commented May 7, 2020

nick-youngblut commented Jul 22, 2020

milot-mirdita commented Jul 22, 2020

nick-youngblut commented Jul 22, 2020

milot-mirdita commented Aug 31, 2020

Clusterupdate : clustering of deleted sequences and conversion to tsv file #272

Clusterupdate : clustering of deleted sequences and conversion to tsv file #272

Comments

ApollineBruley commented Feb 10, 2020 • edited Loading

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

MMseqs Output (for bugs)

Context

Your Environment

milot-mirdita commented Feb 13, 2020

ApollineBruley commented Feb 13, 2020

milot-mirdita commented Feb 13, 2020

milot-mirdita commented Feb 17, 2020

jessicaparks commented May 7, 2020

nick-youngblut commented Jul 22, 2020

milot-mirdita commented Jul 22, 2020

nick-youngblut commented Jul 22, 2020

milot-mirdita commented Aug 31, 2020

ApollineBruley commented Feb 10, 2020 •

edited

Loading