-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: do not require sorting .hap
lines by line type
#208
Conversation
I'm marking this as a draft pull request because a warning message appears when I run the new test that I added.
I'm not sure how to resolve this yet. |
this seems to be some sort of bug in bgzip b/c it moves the comment line to the end of the file, which causes problems for tabix
@mlamkin7 does all of this look good to you? I think your you might first need to update the Update (after 6b3f6e7) |
Background
The first column of a
.hap
file stores the type of each line (ex:#
,H
, orV
). Up until now, we've always told our users to sort lines in a.hap
files by their line type (followed by the chrom, start, and end fields) -- even thoughtabix
would only look at the chrom, start, and end fields and ignore the line type field. This was done to ensure that header lines in the.hap
file were kept at the top of the file during sorting. The command for sorting wasThe problem
Unfortunately, that requirement has led to larger problems. For example, we've wanted to add tandem repeats to
.hap
files for a while (see #164). But when we tried to add tandem repeats as a new line type, it made our.hap
files un-indexable. Consider this sorted.hap
file, where a repeat is represented with anR
line type:Currently,
tabix
would complain that this file is not properly sorted because STR_1 follows H2 in the file. If we switch those two lines, then the third field of the file (ie the start position) becomes properly sorted fortabix
.In essence, our sorting requirements have made it impossible for us to add new line types! And the flexibility of adding new line types was one of the first reasons we created this file format, so needless to say, this was all pretty disappointing.
The solution
I've updated the sorting command in the documentation so that it no longer sorts by the line type field, and I've removed the requirement that lines in a
.hap
file be sorted by line type. Unfortunately, I had to use some ugly awk magic to get the header lines to still appear at the beginning of the file. Here's the new sorting command that we recommend:awk '$0 ~ /^#/ {print; next} {print | "sort -k2,4"}'
Testing
This change shouldn't really affect much of our code. Most of our code already didn't assume that the input
.hap
file was sorted by line type because we needed to be able to handle unsorted.hap
files. It's only when our code tries to take advantage of indexing of the.hap
file that this becomes a concern. That's why I added a test to ensure that our code still works for indexed files that aren't sorted by line type.