-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
seqtk mergepe produces invalid FASTQ output on empty input sequences #109
Comments
I tested this on the latest release, |
seqtk doesn't and won't work with zero-length records. |
@lh3, OK, this is your right as the author of the software. However I just want to point out two things for the record:
|
Since seqtk is open source, we can freely modify the code. If the change helps, a pull request is welcome to contribute it. I don't write C/C++, but the seqtk source code is easy to read. Adding one line in front of https://github.com/lh3/seqtk/blob/master/seqtk.c#L1412 can easily solve this , by discarding reads with zero-length seq in one pair:
|
@tskir zero-length records directly from Illumina pipeline are extremely rare. This is also mostly Illumina's issue. I didn't say zero-length records are "invalid". Seqtk actually parsed the record, though outputted it as FASTA. Such output is still compatible with seqtk/bwa or other tools that use the kseq.h parser, as long as they work with 0-length sequences. @shenwei356 thanks a lot for the suggestion, but this discards data and does not address the same issue in other part of seqtk. It is in theory possible to implement a more proper fix with some significant changes. However, given that this is a rare issue, I am not convinced that the efforts will be paid off. Generally, I fix issues that may affect a lot of users. I don't have the bandwidth to take care of every corner cases. |
When one of the sequences to merge is empty (has zero length),
mergepe
produces output which isn't a valid FASTQ file. Example:1.fq
2.fq
Output of
seqtk mergepe 1.fq 2.fq
Note how
@D00723:158:CAJ17ANXX:6:1103:3362:29333 2:N:0:GAGATTCC+CCTATCCT
from the second file was transformed into>D00723:158:CAJ17ANXX:6:1103:3362:29333 2:N:0:GAGATTCC+CCTATCCT
in the output.The text was updated successfully, but these errors were encountered: