feat: do not require sorting `.hap` lines by line type #208

aryarm · 2023-04-14T06:02:20Z

Background

The first column of a .hap file stores the type of each line (ex: #, H, or V). Up until now, we've always told our users to sort lines in a .hap files by their line type (followed by the chrom, start, and end fields) -- even though tabix would only look at the chrom, start, and end fields and ignore the line type field. This was done to ensure that header lines in the .hap file were kept at the top of the file during sorting. The command for sorting was

sort -k1,4

The problem

Unfortunately, that requirement has led to larger problems. For example, we've wanted to add tandem repeats to .hap files for a while (see #164). But when we tried to add tandem repeats as a new line type, it made our .hap files un-indexable. Consider this sorted .hap file, where a repeat is represented with an R line type:

H	chr1	1000	1001	H1
H	chr1	1010	1011	H2
R	chr1	1005	1009	STR_1
V	H1	1000	1001	rs1	A
V	H1	1010	1011	rs2	G

Currently, tabix would complain that this file is not properly sorted because STR_1 follows H2 in the file. If we switch those two lines, then the third field of the file (ie the start position) becomes properly sorted for tabix.

In essence, our sorting requirements have made it impossible for us to add new line types! And the flexibility of adding new line types was one of the first reasons we created this file format, so needless to say, this was all pretty disappointing.

The solution

I've updated the sorting command in the documentation so that it no longer sorts by the line type field, and I've removed the requirement that lines in a .hap file be sorted by line type. Unfortunately, I had to use some ugly awk magic to get the header lines to still appear at the beginning of the file. Here's the new sorting command that we recommend:

awk '$0 ~ /^#/ {print; next} {print | "sort -k2,4"}'

Testing

This change shouldn't really affect much of our code. Most of our code already didn't assume that the input .hap file was sorted by line type because we needed to be able to handle unsorted .hap files. It's only when our code tries to take advantage of indexing of the .hap file that this becomes a concern. That's why I added a test to ensure that our code still works for indexed files that aren't sorted by line type.

aryarm · 2023-04-14T06:03:46Z

I'm marking this as a draft pull request because a warning message appears when I run the new test that I added.

[E::get_intv] Failed to parse TBX_GENERIC, was wrong -p [type] used?
The offending line was: "# this comment should be ignored"

I'm not sure how to resolve this yet.

this seems to be some sort of bug in bgzip b/c it moves the comment line to the end of the file, which causes problems for tabix

aryarm · 2023-04-14T15:43:00Z

@mlamkin7 does all of this look good to you? I think your .hap files with repeat lines should be tabix indexable now! 🤞

you might first need to update the sort() and to_str() methods of the Haplotypes class to ensure that the R and H lines get interleaved properly

Update (after 6b3f6e7)
I refactored some of the code in the Haplotypes class to make it easier to add new line types to the class. There is a new type_ids property that provides the indices of each line type within the data property. (It's a dictionary mapping strings to lists.) Whenever we reference Haplotype objects stored within the class, we now reference them first by accessing them through the type_ids object. (And a new index() method ensures that this type_ids property is set correctly.) This way, we can easily add 'R' lines to this dictionary later on.
In the future, we should make sure to document that an H line can never have the same ID as an R line, but an H (or R) line can have the same ID as a V line.

aryarm added 5 commits April 13, 2023 16:22

remove the requirement that .hap files be sorted by their first field

0e5b64f

ensure header lines are kept at the beginning of the file

96ee23c

combine sort and bgzip lines

0a9157e

add test for unordered line type field

008186b

handle potentially unordered line type field

30eb8d2

remove comment line from test file to silence warning

cf4126f

this seems to be some sort of bug in bgzip b/c it moves the comment line to the end of the file, which causes problems for tabix

aryarm requested a review from mlamkin7 April 14, 2023 15:41

aryarm marked this pull request as ready for review April 14, 2023 15:42

aryarm added 3 commits April 14, 2023 11:36

index haplotypes classes via a mapping between line type and ID

bc5e5ba

copy edits to the Haplotypes class over to HaplotypesAncestry

5f805ee

reindex Haplotypes after making edits to data in the ld command

6b3f6e7

mlamkin7 merged commit f221397 into main Apr 14, 2023

mlamkin7 deleted the feat/hap-file-order branch April 14, 2023 18:41

github-actions bot mentioned this pull request Apr 14, 2023

chore(main): release 0.3.0 #202

Merged

aryarm mentioned this pull request Apr 17, 2023

support for a TR-based Haplotype class #164

Closed

aryarm added the refactor label Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: do not require sorting `.hap` lines by line type #208

feat: do not require sorting `.hap` lines by line type #208

aryarm commented Apr 14, 2023 •

edited

Loading

aryarm commented Apr 14, 2023 •

edited

Loading

aryarm commented Apr 14, 2023 •

edited

Loading

feat: do not require sorting .hap lines by line type #208

feat: do not require sorting .hap lines by line type #208

Conversation

aryarm commented Apr 14, 2023 • edited Loading

Background

The problem

The solution

Testing

aryarm commented Apr 14, 2023 • edited Loading

aryarm commented Apr 14, 2023 • edited Loading

feat: do not require sorting `.hap` lines by line type #208

feat: do not require sorting `.hap` lines by line type #208

aryarm commented Apr 14, 2023 •

edited

Loading

aryarm commented Apr 14, 2023 •

edited

Loading

aryarm commented Apr 14, 2023 •

edited

Loading