Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persistent result2flat died segmentation fault in many fastas #617

Closed
xonq opened this issue Oct 6, 2022 · 3 comments
Closed

persistent result2flat died segmentation fault in many fastas #617

xonq opened this issue Oct 6, 2022 · 3 comments

Comments

@xonq
Copy link

xonq commented Oct 6, 2022

Expected Behavior

Cluster a fasta input using easy-cluster because linclust sometimes removes important sequences

Current Behavior

> 50% tested fastas die with a result2flat error

<BASEDIR>/tmp//13463384132153814128/easycluster.sh: line 48: 36628 Segmentation fault      (core dumped) "$MMSEQS" result2flat "${TMP_PATH}/input" "${TMP_PATH}/input" "${TMP_PATH}/clu_seqs" "${TMP_PATH}/all_seqs.fasta" ${VERBOSITY_PAR}
Error: result2flat died

Steps to Reproduce (for bugs)

Cluster a fasta (link) with easy-cluster via a python subprocess. Full paths changed to <BASE_DIR> in log.
In this specific case:
easy-cluster <BASE_DIR>/cormil2.1_9109.fa <BASE_DIR>/working/cormil2.1_9109_c0.4_v0.65 <BASE_DIR>/working/tmp/ --min-seq-id 0.65 --threads 1 --compressed 1 --cov-mode 0 -c 0.4 -e 0.1 -s 7.5

MMseqs Output (for bugs)

Error log

Context

I'm running a pipeline that calls on easy-cluster to truncate large fastas for phylogenetic reconstruction. >50% of these runs fail with easy-cluster. I don't want to use linclust because I've observed that it throws out important sequences from clusters here and there.

Your Environment

  • Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): MMseqs2 Version: 13.45111
  • Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): Compiled from miniconda via Bioconda channel
  • Server specifications (especially CPU support for AVX2/SSE and amount of system memory): CPU: 12 cores requested, Intel® Xeon® 'Cascade Lake'/'Skylake' RAM: 192GB/node, 57.6GB reserved for job. Supports AVX512, but "you must set the correct compiler flags to take advantage of it"
  • Operating system and version: Red Hat Enterprise Linux Server 7.9, Kernel 3.10.0-1160.71.1.el7.x86_64
@milot-mirdita
Copy link
Member

I found out whats wrong. It is a speed optimization gone wrong. The tldr is that your input FASTA file should end with a newline.

Why this is happening:

When a FASTA file is not in multiline format. E.g.:

>1
A...\NEWLINE
G

and entries are in the single line format (>1\newlineA...G) we just symlink the FASTA instead of creating a whole new database (thus potentially saving a lot of disk space for large input`).

Without this optimization we always ensure that there is a new line character at the end of every sequence. Now we skipped it and break some other assumptions in the code.

We'll try to figure out some fix, until then please make sure that your files end with a newline or call easy-cluster with --createdb-mode 0.

milot-mirdita added a commit that referenced this issue Oct 12, 2022
@milot-mirdita
Copy link
Member

Should be fixed in the newest release.

@xonq
Copy link
Author

xonq commented Oct 13, 2022

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants