Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: do not require sorting .hap lines by line type #208

Merged
merged 9 commits into from
Apr 14, 2023

Conversation

aryarm
Copy link
Member

@aryarm aryarm commented Apr 14, 2023

Background

The first column of a .hap file stores the type of each line (ex: #, H, or V). Up until now, we've always told our users to sort lines in a .hap files by their line type (followed by the chrom, start, and end fields) -- even though tabix would only look at the chrom, start, and end fields and ignore the line type field. This was done to ensure that header lines in the .hap file were kept at the top of the file during sorting. The command for sorting was

sort -k1,4

The problem

Unfortunately, that requirement has led to larger problems. For example, we've wanted to add tandem repeats to .hap files for a while (see #164). But when we tried to add tandem repeats as a new line type, it made our .hap files un-indexable. Consider this sorted .hap file, where a repeat is represented with an R line type:

H	chr1	1000	1001	H1
H	chr1	1010	1011	H2
R	chr1	1005	1009	STR_1
V	H1	1000	1001	rs1	A
V	H1	1010	1011	rs2	G

Currently, tabix would complain that this file is not properly sorted because STR_1 follows H2 in the file. If we switch those two lines, then the third field of the file (ie the start position) becomes properly sorted for tabix.

In essence, our sorting requirements have made it impossible for us to add new line types! And the flexibility of adding new line types was one of the first reasons we created this file format, so needless to say, this was all pretty disappointing.

The solution

I've updated the sorting command in the documentation so that it no longer sorts by the line type field, and I've removed the requirement that lines in a .hap file be sorted by line type. Unfortunately, I had to use some ugly awk magic to get the header lines to still appear at the beginning of the file. Here's the new sorting command that we recommend:

awk '$0 ~ /^#/ {print; next} {print | "sort -k2,4"}'

Testing

This change shouldn't really affect much of our code. Most of our code already didn't assume that the input .hap file was sorted by line type because we needed to be able to handle unsorted .hap files. It's only when our code tries to take advantage of indexing of the .hap file that this becomes a concern. That's why I added a test to ensure that our code still works for indexed files that aren't sorted by line type.

@aryarm
Copy link
Member Author

aryarm commented Apr 14, 2023

I'm marking this as a draft pull request because a warning message appears when I run the new test that I added.

[E::get_intv] Failed to parse TBX_GENERIC, was wrong -p [type] used?
The offending line was: "# this comment should be ignored"

I'm not sure how to resolve this yet.

this seems to be some sort of bug in bgzip b/c it moves the comment line to the end of the file, which causes problems for tabix
@aryarm aryarm requested a review from mlamkin7 April 14, 2023 15:41
@aryarm aryarm marked this pull request as ready for review April 14, 2023 15:42
@aryarm
Copy link
Member Author

aryarm commented Apr 14, 2023

@mlamkin7 does all of this look good to you? I think your .hap files with repeat lines should be tabix indexable now! 🤞

you might first need to update the sort() and to_str() methods of the Haplotypes class to ensure that the R and H lines get interleaved properly

Update (after 6b3f6e7)
I refactored some of the code in the Haplotypes class to make it easier to add new line types to the class. There is a new type_ids property that provides the indices of each line type within the data property. (It's a dictionary mapping strings to lists.) Whenever we reference Haplotype objects stored within the class, we now reference them first by accessing them through the type_ids object. (And a new index() method ensures that this type_ids property is set correctly.) This way, we can easily add 'R' lines to this dictionary later on.
In the future, we should make sure to document that an H line can never have the same ID as an R line, but an H (or R) line can have the same ID as a V line.

@mlamkin7 mlamkin7 merged commit f221397 into main Apr 14, 2023
@mlamkin7 mlamkin7 deleted the feat/hap-file-order branch April 14, 2023 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants