Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seqtk mergepe produces invalid FASTQ output on empty input sequences #109

Closed
tskir opened this issue Mar 14, 2018 · 5 comments
Closed

seqtk mergepe produces invalid FASTQ output on empty input sequences #109

tskir opened this issue Mar 14, 2018 · 5 comments

Comments

@tskir
Copy link

tskir commented Mar 14, 2018

When one of the sequences to merge is empty (has zero length), mergepe produces output which isn't a valid FASTQ file. Example:

1.fq

@D00723:158:CAJ17ANXX:6:1103:3343:29327 1:N:0:GAGATTCC+CCTATCCT
GTGGGCACCTGTCATCCCAGCTACTTGGGAGGCTGAGTCAGGAGAATCACTGGAACCTAGGAGTTGGAGTTTGTAGTGAGCCAAGATCGCACCACTGCACTCCAGCCTGGGCGACAGAATGAGAC
+
</<<B<FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFBF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFF<FBFFFFF<
@D00723:158:CAJ17ANXX:6:1103:3362:29333 1:N:0:GAGATTCC+CCTATCCT
GGAAGGGGAAAGACAAGAAAGAGGGAGACGGAGAGAGAGAGAGAAGGAAAAGGAAGGATGACAAGACAGGAAGACAGAGAAAAGAAAAAGACTGAAGGAGGGAAGGAAGGAAAGAAGGGAAGG
+
B/<<BFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFBFFFFFFFFF/FFFFBFFFFFFFFFF<
@D00723:158:CAJ17ANXX:6:1103:3606:29320 1:N:0:GAGATTCC+CCTATCCT
GGCCCACTCTGTCTGGCTCCCAACTTCCCTTTGAGGCTTCCCGGTCCTGGGGTAACAGCTGGCTTAGGGGAGCCAGTCGGGGCTGCCTTCCAGGCTTCCAGGTCCACTTTAAGGAGGGATGGGG
+
B<BBBFFFFFFFFFFFFFFFFBFFFFFFFFFFBB/<BFFFFFFFBFFFFFFFBFFF//FFFFFFFFFF/BFFFBFFFFBFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFF7/B<<FFFFFB

2.fq

@D00723:158:CAJ17ANXX:6:1103:3343:29327 2:N:0:GAGATTCC+CCTATCCT
GTTGTGCAGCCCTCAGCGCCACCCACCACCAGAACTCTCCCCACCTTTGACGCACGTGTGCAGAACTGAAGGCCACGCTGGAATTGTAGCCCACGTGCCACACTGCTGGCTGTGTCATGGATATC
+
BBBBBFFFFFFFFFFFFFFBFFFFFFFBFFFFFFFFFFFFFF/FFBFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFF<B//B<FFFFFFFFFFFFFFFFBFFFFF/BB/F/B/B7/<FFFFF<
@D00723:158:CAJ17ANXX:6:1103:3362:29333 2:N:0:GAGATTCC+CCTATCCT

+

@D00723:158:CAJ17ANXX:6:1103:3606:29320 2:N:0:GAGATTCC+CCTATCCT
GAGGACCAGCAGCTGGACATCCAGGTCATGGCAGAGGCCAGAGAGTCCTGGGACCTGGGCCTACAGGAGCAGGAGGGGCGGTACACACCTCTGCCCTTGGGCGGGAACAAGGAGCAAGCCATCT
+
/BBBBFFFFFFFFFFBFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFF<FFFFFBFFFFBFFFFFFFFFF<FFBFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFBBFFFFFFFF<FFFFFFB/

Output of seqtk mergepe 1.fq 2.fq

@D00723:158:CAJ17ANXX:6:1103:3343:29327 1:N:0:GAGATTCC+CCTATCCT
GTGGGCACCTGTCATCCCAGCTACTTGGGAGGCTGAGTCAGGAGAATCACTGGAACCTAGGAGTTGGAGTTTGTAGTGAGCCAAGATCGCACCACTGCACTCCAGCCTGGGCGACAGAATGAGAC
+
</<<B<FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFBF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFF<FBFFFFF<
@D00723:158:CAJ17ANXX:6:1103:3343:29327 2:N:0:GAGATTCC+CCTATCCT
GTTGTGCAGCCCTCAGCGCCACCCACCACCAGAACTCTCCCCACCTTTGACGCACGTGTGCAGAACTGAAGGCCACGCTGGAATTGTAGCCCACGTGCCACACTGCTGGCTGTGTCATGGATATC
+
BBBBBFFFFFFFFFFFFFFBFFFFFFFBFFFFFFFFFFFFFF/FFBFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFF<B//B<FFFFFFFFFFFFFFFFBFFFFF/BB/F/B/B7/<FFFFF<
@D00723:158:CAJ17ANXX:6:1103:3362:29333 1:N:0:GAGATTCC+CCTATCCT
GGAAGGGGAAAGACAAGAAAGAGGGAGACGGAGAGAGAGAGAGAAGGAAAAGGAAGGATGACAAGACAGGAAGACAGAGAAAAGAAAAAGACTGAAGGAGGGAAGGAAGGAAAGAAGGGAAGG
+
B/<<BFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFBFFFFFFFFF/FFFFBFFFFFFFFFF<
>D00723:158:CAJ17ANXX:6:1103:3362:29333 2:N:0:GAGATTCC+CCTATCCT

@D00723:158:CAJ17ANXX:6:1103:3606:29320 1:N:0:GAGATTCC+CCTATCCT
GGCCCACTCTGTCTGGCTCCCAACTTCCCTTTGAGGCTTCCCGGTCCTGGGGTAACAGCTGGCTTAGGGGAGCCAGTCGGGGCTGCCTTCCAGGCTTCCAGGTCCACTTTAAGGAGGGATGGGG
+
B<BBBFFFFFFFFFFFFFFFFBFFFFFFFFFFBB/<BFFFFFFFBFFFFFFFBFFF//FFFFFFFFFF/BFFFBFFFFBFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFF7/B<<FFFFFB
@D00723:158:CAJ17ANXX:6:1103:3606:29320 2:N:0:GAGATTCC+CCTATCCT
GAGGACCAGCAGCTGGACATCCAGGTCATGGCAGAGGCCAGAGAGTCCTGGGACCTGGGCCTACAGGAGCAGGAGGGGCGGTACACACCTCTGCCCTTGGGCGGGAACAAGGAGCAAGCCATCT
+
/BBBBFFFFFFFFFFBFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFF<FFFFFBFFFFBFFFFFFFFFF<FFBFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFBBFFFFFFFF<FFFFFFB/

Note how @D00723:158:CAJ17ANXX:6:1103:3362:29333 2:N:0:GAGATTCC+CCTATCCT from the second file was transformed into >D00723:158:CAJ17ANXX:6:1103:3362:29333 2:N:0:GAGATTCC+CCTATCCT in the output.

@tskir
Copy link
Author

tskir commented Mar 14, 2018

I tested this on the latest release, seqtk-1.2-r94, but the latest master appears to be affected by this as well.

@lh3
Copy link
Owner

lh3 commented Mar 14, 2018

seqtk doesn't and won't work with zero-length records.

@lh3 lh3 closed this as completed Mar 14, 2018
@tskir
Copy link
Author

tskir commented Mar 15, 2018

@lh3, OK, this is your right as the author of the software. However I just want to point out two things for the record:

  1. I did not tamper with the reads, they were taken directly out of an Illumina HiSeq (not sure which model, but one of the older ones) output after internal demultiplexing. Which means that seqtk doesn't support some real-world FASTQ files.
  2. When encountering zero-length records, seqtk doesn't crash or warn of what it considers to be invalid input, it just quietly produces invalid output which will break any downstream processing. This might not be ideal.

@shenwei356
Copy link

shenwei356 commented Mar 15, 2018

Since seqtk is open source, we can freely modify the code. If the change helps, a pull request is welcome to contribute it.

I don't write C/C++, but the seqtk source code is easy to read. Adding one line in front of https://github.com/lh3/seqtk/blob/master/seqtk.c#L1412 can easily solve this , by discarding reads with zero-length seq in one pair:

if (seq[0]->seq.l == 0 || seq[1]->seq.l == 0) continue; 

@lh3
Copy link
Owner

lh3 commented Mar 15, 2018

@tskir zero-length records directly from Illumina pipeline are extremely rare. This is also mostly Illumina's issue. I didn't say zero-length records are "invalid". Seqtk actually parsed the record, though outputted it as FASTA. Such output is still compatible with seqtk/bwa or other tools that use the kseq.h parser, as long as they work with 0-length sequences.

@shenwei356 thanks a lot for the suggestion, but this discards data and does not address the same issue in other part of seqtk. It is in theory possible to implement a more proper fix with some significant changes. However, given that this is a rare issue, I am not convinced that the efforts will be paid off. Generally, I fix issues that may affect a lot of users. I don't have the bandwidth to take care of every corner cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants