Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clusterupdate : clustering of deleted sequences and conversion to tsv file #272

Closed
ApollineBruley opened this issue Feb 10, 2020 · 9 comments
Assignees

Comments

@ApollineBruley
Copy link

ApollineBruley commented Feb 10, 2020

Expected Behavior

I want to update my clusters after a database update (in which I add new sequences but also delete sequences compared to the old database).
The clusterupdate command works, but when I try to convert the cluster database to a tsv file, I have an error message related to the index (see below).

I tried the same thing on a new database where I just added sequences and it worked perfectly, so I assume the problem comes from the fact that I remove sequences from the old database?

Current Behavior

Error when trying to generate the tsv file.
In the cluster database obtained after clusterupdate ('CLU_updated') the removed sequences still appear, but they are absent of the updated sequence database ('DB_updated').

Steps to Reproduce (for bugs)

  1. Creation of old DB (oldDB.fa : 17 amino acid sequences)
    mmseqs createdb oldDB.fa DB_old

  2. Clustering of old DB
    mmseqs cluster DB_old CLU_old tmp

  3. Creation of new DB (newDB.fa : 13 sequences are identical with the old DB, 4 were removed, 4 were added)
    mmseqs createdb newDB.fa DB_new

  4. Cluster update
    mmseqs clusterupdate DB_old DB_new CLU_old DB_updated CLU_updated tmp
    No error there, but even though sequences of numeric identifiers 12 , 11 , 16 , 15 in the old db have been removed, they appear in the CLU_updated file. They do not appear in the DB_updated files.

  5. Conversion of cluster DB in tsv :
    mmseqs createtsv DB_updated DB_updated CLU_updated clusters.tsv
    => Error message, generation of empty files : clusters.tsv.1 ... clusters.tsv.7 and clusters.tsv.index.1 ... clusters.tsv.index.7

MMseqs Output (for bugs)

Program call:
createtsv DB_updated DB_updated CLU_updated clusters.tsv 

MMseqs Version:                  	2f66ae897fc813450fa5ef0c78123bd3c41c4717
first sequence as respresentative	false
Target column                    	1
Add Full Header                  	false
Database Output                  	false
Threads                          	8
Compressed                       	0
Verbosity                        	3

Query database: DB_updated
Touch data file DB_updated_h ... Done.
Result database: CLU_updated
Start writing to clusters.tsv
Invalid database read for database data file=DB_updated_h, database index=DB_updated_h.index
getData: local id (4294967295) >= db size (17)

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

  • Git commit used: 2f66ae8
  • Which MMseqs version was used: Compilation from source
  • Cmake versions used: cmake version 3.5.1
  • Operating system and version: Ubuntu 16.04 LTS

Thank you in advance for your help :)

@ApollineBruley ApollineBruley changed the title Generation of tsv file after updating clusters Clusterupdate : clustering of deleted sequences and conversion to tsv file Feb 12, 2020
@milot-mirdita
Copy link
Member

Sorry for the delay, would you mind uploading the two FASTA files (with the 17 seq each)?

I started refactoring the code and think I know whats wrong.

@ApollineBruley
Copy link
Author

Here are the fastas .

Thanks for your help

fastas.zip

@milot-mirdita
Copy link
Member

Thanks, I can reproduce the issue. I'll have to investigate whats going wrong.

Meanwhile, if you want a set of stickers (see https://twitter.com/thesteinegger/status/1201076220957315074), send me your postal address to milot at mirdita de.

@milot-mirdita
Copy link
Member

I've found out that we are not dealing well with deleted sequences.
Their presence in the clustering is resulting in the error you are seeing. I am refactoring some code, but it turns out to be a bit more work.

@jessicaparks
Copy link

@milot-mirdita I'm running into this same issue. Any update on progress?

@nick-youngblut
Copy link

I appear to be getting a similar error:

$ mmseqs clusterupdate --min-seq-id 0.9 -c 0.8           ../tests/output_n10/genes/cluster/genes_db /ebio/abt3_scratch/nyoungblut/user_genes/genes_db ../tests/output_n10/genes/cluster/clusters_db           /ebio/abt3_scratch/nyoungblut/cluster_updated/genes_db /ebio/abt3_scratch/nyoungblut/cluster_updated/clusters_db.0 /ebio/abt3_scratch/nyoungblut/cluster_update
clusterupdate --min-seq-id 0.9 -c 0.8 ../tests/output_n10/genes/cluster/genes_db /ebio/abt3_scratch/nyoungblut/user_genes/genes_db ../tests/output_n10/genes/cluster/clusters_db /ebio/abt3_scratch/nyoungblut/cluster_updated/genes_db /ebio/abt3_scratch/nyoungblut/cluster_updated/clusters_db.0 /ebio/abt3_scratch/nyoungblut/cluster_update

MMseqs Version:                     	11.e1a1c
Seq. id. threshold                  	0.9
Coverage threshold                  	0.8

===================================================
=== Update the new sequences with the old keys ====
===================================================
===================================================
====== Filter out the new from old sequences ======
===================================================
===================================================
======= Extract representative sequences ==========
===================================================
result2repseq /ebio/abt3_projects/software/dev/Struo2/tests/output_n10/genes/cluster/genes_db /ebio/abt3_projects/software/dev/Struo2/tests/output_n10/genes/cluster/clusters_db /ebio/abt3_scratch/nyoungblut/Struo2_122419461619/cluster_update/7316799743718053916/OLDDB.repSeq

Invalid database read for database data file=/ebio/abt3_projects/software/dev/Struo2/tests/output_n10/genes/cluster/clusters_db, database index=/ebio/abt3_projects/software/dev/Struo2/tests/output_n10/genes/cluster/clusters_db.index
31mInvalid database read for database data file=/ebio/abt3_projects/software/dev/Struo2/tests/output_n10/genes/cluster/clusters_db, database index=/ebio/abt3_projects/software/dev/Struo2/tests/output_n10/genes/cluster/clusters_db.index

[... a lot of output ...]

31mSize of data: 363542
mRequested offset: 412570
Requested offset: 399738
31mRequested offset: 367758
Requested offset: 408364
31mRequested offset: 386682
39mRequested offset: 393723
mRequested offset: 403458
Requested offset: 381782
39mRequested offset: 413970
mRequested offset: 406964
m31Requested offset: 398528
Requested offset: 367053
mRequested offset: 415370
Error: result2repseq died

conda env:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       0_gnu    conda-forge
bzip2                     1.0.8                h516909a_2    conda-forge
ca-certificates           2020.6.20            hecda079_0    conda-forge
certifi                   2020.6.20        py38h32f6830_0    conda-forge
fasta-splitter            0.2.6                         0    bioconda
gawk                      5.1.0                h516909a_0    conda-forge
gettext                   0.19.8.1          hc5be6a0_1002    conda-forge
ld_impl_linux-64          2.34                 h53a641e_7    conda-forge
libblas                   3.8.0               17_openblas    conda-forge
libcblas                  3.8.0               17_openblas    conda-forge
libffi                    3.2.1             he1b5a44_1007    conda-forge
libgcc-ng                 9.2.0                h24d8f2e_2    conda-forge
libgfortran-ng            7.5.0                hdf63c60_6    conda-forge
libgomp                   9.2.0                h24d8f2e_2    conda-forge
libidn2                   2.3.0                h516909a_0    conda-forge
liblapack                 3.8.0               17_openblas    conda-forge
libopenblas               0.3.10          pthreads_hb3c22a3_3    conda-forge
libstdcxx-ng              9.2.0                hdf63c60_2    conda-forge
libunistring              0.9.10               h14c3975_0    conda-forge
llvm-openmp               8.0.1                hc9558a2_0    conda-forge
mmseqs2                   11.e1a1c             h2d02072_0    bioconda
ncurses                   6.2                  he1b5a44_1    conda-forge
numpy                     1.19.0           py38h8854b6b_0    conda-forge
openmp                    8.0.1                         0    conda-forge
openssl                   1.1.1g               h516909a_0    conda-forge
perl                      5.26.2            h516909a_1006    conda-forge
perl-constant             1.33                    pl526_1    bioconda
perl-exporter             5.72                    pl526_1    bioconda
perl-file-util            4.161950                pl526_3    bioconda
perl-lib                  0.63                    pl526_1    bioconda
pigz                      2.3.4                hed695b0_1    conda-forge
pip                       20.1.1                     py_1    conda-forge
prodigal                  2.6.3                h516909a_2    bioconda
python                    3.8.4           cpython_h425cb1d_0    conda-forge
python_abi                3.8                      1_cp38    conda-forge
readline                  8.0                  he28a2e2_2    conda-forge
seqkit                    0.13.2                        0    bioconda
setuptools                49.2.0           py38h32f6830_0    conda-forge
sqlite                    3.32.3               hcee41ef_1    conda-forge
tk                        8.6.10               hed695b0_0    conda-forge
vsearch                   2.15.0               h2d02072_0    bioconda
wget                      1.20.1               h22169c7_0    conda-forge
wheel                     0.34.2                     py_1    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zlib                      1.2.11            h516909a_1006    conda-forge

@milot-mirdita
Copy link
Member

Dealing with deleted sequences is currently still broken. I had begun working on it, but didn't have time to finish up the work.

@nick-youngblut
Copy link

Thanks for the quick update! FYI: there doesn't seem to be any documentation about the differences between concatdbs and mergedbs

@milot-mirdita
Copy link
Member

This took some time, but dealing with deleted sequences should hopefully work correctly now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants