persistent result2flat died segmentation fault in many fastas #617

xonq · 2022-10-06T15:47:10Z

Expected Behavior

Cluster a fasta input using easy-cluster because linclust sometimes removes important sequences

Current Behavior

> 50% tested fastas die with a result2flat error

<BASEDIR>/tmp//13463384132153814128/easycluster.sh: line 48: 36628 Segmentation fault      (core dumped) "$MMSEQS" result2flat "${TMP_PATH}/input" "${TMP_PATH}/input" "${TMP_PATH}/clu_seqs" "${TMP_PATH}/all_seqs.fasta" ${VERBOSITY_PAR}
Error: result2flat died

Steps to Reproduce (for bugs)

Cluster a fasta (link) with easy-cluster via a python subprocess. Full paths changed to <BASE_DIR> in log.
In this specific case:
easy-cluster <BASE_DIR>/cormil2.1_9109.fa <BASE_DIR>/working/cormil2.1_9109_c0.4_v0.65 <BASE_DIR>/working/tmp/ --min-seq-id 0.65 --threads 1 --compressed 1 --cov-mode 0 -c 0.4 -e 0.1 -s 7.5

MMseqs Output (for bugs)

Error log

Context

I'm running a pipeline that calls on easy-cluster to truncate large fastas for phylogenetic reconstruction. >50% of these runs fail with easy-cluster. I don't want to use linclust because I've observed that it throws out important sequences from clusters here and there.

Your Environment

Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): MMseqs2 Version: 13.45111
Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): Compiled from miniconda via Bioconda channel
Server specifications (especially CPU support for AVX2/SSE and amount of system memory): CPU: 12 cores requested, Intel® Xeon® 'Cascade Lake'/'Skylake' RAM: 192GB/node, 57.6GB reserved for job. Supports AVX512, but "you must set the correct compiler flags to take advantage of it"
Operating system and version: Red Hat Enterprise Linux Server 7.9, Kernel 3.10.0-1160.71.1.el7.x86_64

The text was updated successfully, but these errors were encountered:

milot-mirdita · 2022-10-11T08:51:17Z

I found out whats wrong. It is a speed optimization gone wrong. The tldr is that your input FASTA file should end with a newline.

Why this is happening:

When a FASTA file is not in multiline format. E.g.:

>1
A...\NEWLINE
G

and entries are in the single line format (>1\newlineA...G) we just symlink the FASTA instead of creating a whole new database (thus potentially saving a lot of disk space for large input`).

Without this optimization we always ensure that there is a new line character at the end of every sequence. Now we skipped it and break some other assumptions in the code.

We'll try to figure out some fix, until then please make sure that your files end with a newline or call easy-cluster with --createdb-mode 0.

…ce db with --createdb-mode 1 (#617)

milot-mirdita · 2022-10-13T16:24:33Z

Should be fixed in the newest release.

xonq · 2022-10-13T18:35:19Z

thanks!

milot-mirdita added a commit that referenced this issue Oct 12, 2022

Fix FASTA input not ending with a newline resulting in invalid sequen…

28b0088

…ce db with --createdb-mode 1 (#617)

milot-mirdita closed this as completed Oct 13, 2022

clb21565 mentioned this issue Oct 27, 2022

persisten segmentation fault error across different contig datasets #632

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

persistent result2flat died segmentation fault in many fastas #617

persistent result2flat died segmentation fault in many fastas #617

xonq commented Oct 6, 2022 •

edited

Loading

milot-mirdita commented Oct 11, 2022

milot-mirdita commented Oct 13, 2022

xonq commented Oct 13, 2022

persistent result2flat died segmentation fault in many fastas #617

persistent result2flat died segmentation fault in many fastas #617

Comments

xonq commented Oct 6, 2022 • edited Loading

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

MMseqs Output (for bugs)

Context

Your Environment

milot-mirdita commented Oct 11, 2022

milot-mirdita commented Oct 13, 2022

xonq commented Oct 13, 2022

xonq commented Oct 6, 2022 •

edited

Loading