feat: `.hap` file format IO #43

aryarm · 2022-04-22T19:00:41Z

closes #25; see also #38
note: please merge this branch after #45 !

Overview

This PR adds support for reading and writing files that follow the .hap file format specification. This functionality is primarily handled by a new Haplotypes class.
Because it implements the Data abstract class, the Haplotypes class contains a method read() for reading the contents of the file into a data property of the class. There is also a write() method for writing the contents of the data property to the file.
Lines from the file are parsed into instances of the new Haplotype and Variant classes within the data module. These classes can be extended (subclassed) to support "extra" fields appended to the ends of each line.

Usage and docs

I've added documentation for the .hap format specification here in a new section of the docs called "File Formats".

Usage of the Haplotypes class is documented in the API docs here. Examples showing how to sub-class the Haplotype and Variant classes are also in the API docs here and here, respectively.

As a concrete example, I've also started extending the Haplotype class to support the "local ancestry" and "effect size" fields within the haptools/haplotype.py module.

Details

The PR itself is quite large, so I'm going to try to break it up here and describe the relevant bits:

The `Haplotypes` class

Reading a file
Parsing a basic .hap file without any extra fields is as simple as it gets:

haplotypes = Haplotypes.load('tests/data/basic.hap')
haplotypes.data # returns a dictionary of Haplotype objects

The load() method initializes an instance of the Haplotypes class and calls the Haplotypes.read() method, but if the .hap file contains extra fields, you'll need to call the read() method manually. You'll also need to create Haplotype and Variant subclasses that support the extra fields and then specify the names of the classes when you initialize the Haplotypes object:

haplotypes = Haplotypes('tests/data/basic.hap', Haplotype, Variant)
haplotypes.read()
haplotypes.data # returns a dictionary of Haplotype objects

Both the load() and read() methods support region and haplotypes parameters that allow you to request a specific region or set of haplotype IDs to read from the file. These parameters only work if the file is indexed, since in that case, the read() method can take advantage of the indexing to parse the file a bit faster. Otherwise, if the file isn't indexed, the read() method will assume the file could be unsorted and simply reads each line one-by-one. Although I haven't tested it yet, streams like stdin should be supported by this case.
Iterating over a file
If you're worried that the contents of the .hap file will be large, you may opt to parse the file line-by-line instead of loading it all into memory at once. In cases like these, you can use the __iter__() method in a for-loop:

haplotypes = Haplotypes('tests/data/basic.hap')
for line in haplotypes:
    print(line)

You'll have to call __iter()__ manually if you want to specify any function parameters:

haplotypes = Haplotypes('tests/data/basic.hap')
for line in haplotypes.__iter__(region='21:26928472-26941960', haplotypes={"chr21.q.3365*1"}):
    print(line)

Writing a file
To write to a .hap file, you must first initialize a Haplotypes object and then fill out the data property:

haplotypes = Haplotypes('tests/data/basic.hap')
haplotypes.data = {'H1': Haplotype('chr1', 0, 10, 'H1')}
haplotypes.write()

After discussing with @gymreklab, I've started considering a write_line() method that would allow for writing a file line-by-line but I haven't thought of a clean way of doing this that could be compatible with the data module as a whole. (See both #44 and my comment about the Genotypes class accepting TextIO objects in #19 for more details).

The `Haplotype` class

The Haplotype class stores haplotype lines from the .hap file. Each property in the object is a field in the line. A separate variants property stores a tuple of Variant objects belonging to this haplotype.
The Haplotypes class will initialize Haplotype objects in its read() and __iter__() methods. It uses a few methods within the Haplotype class for this:

Haplotype.from_hap_spec() - this static method initializes a Haplotype object from a line in the .hap file.
Haplotype.to_hap_spec() - this method converts a Haplotype object into a line in the .hap file
To read "extra" fields from a .hap file, one needs only extend (sub-class) the base Haplotype class and add the extra properties.

The `Variant` class

The Variant class stores variant lines from the .hap file. Each property in the object is a field in the line.
The Haplotypes class will initialize Variant objects in its read() and __iter__() methods. It uses a few methods within the Variant class for this:

Variant.from_hap_spec() - this static method initializes a Variant object from a line in the .hap file.
Variant.to_hap_spec() - this method converts a Variant object into a line in the .hap file
To read "extra" fields from a .hap file, one needs only extend (sub-class) the base Variant class and add the extra properties.

Changes to the example data in `tests/`

I moved the ID field so that it is positioned after the first three fields of each line. That way, sorting the file is as simple as sorting the first four fields: sort -k1,4
I added header lines to each .hap file
I added a few non-indexed files so we can make sure we're reading those correctly, too
I added a few smaller example files, since those were easier to test
I added betas to the files that had ancestry info
I renamed some of the files to have shorter names

Other miscellanea

Most of these are just small formatting things or changes to the documentation. For example, I changed the "Execution" section of the docs into a "File formats" section, and I added a few more files from the data module (covariates.py and haplotypes.py) to the API docs. I also renamed the iterate() method of each class in the data module to __iter__() to be consistent with the standards for that.

Testing

I added the following tests to a new TestHaplotypes class in tests/test_data.py:

test_load()
Try the Haplotypes.load() method on a really basic file. This will also test the basic read functionality.
test_read_subset()
Try the Haplotypes.read() method with the region and/or haplotypes parameters specified.
test_read_extras()
Try to read a .hap file that has extras similar to those that would be needed for simphenotype
test_read_extras_large()
Try to read a larger file with extras.
test_write()
Try to write a .hap file.
test_write_extras()
Try to write a .hap file that has extras similar to those that would be needed for simphenotype.

Future work

Next, I hope to integrate this code into the existing classes for the simphenotype subcommand.

I also want to add a few subcommands to support the new file format:

I think it would be convenient for users if there was an index command that simply wraps the sort, bgzip, and tabix commands that people would normally use to index the file. That way, they won't have to remember what the commands are.
In order to implement this, we may have to define __lt__() methods for the Variant and Haplotype classes so they can be sorted like this:
```
haplotypes.data = dict(sorted(haplotypes.data.items(), key=lambda item: item[1]))
```
and
```
haplotype.variant = tuple(sorted(haplotype.variant))
```
It would be useful to have a validate command that simply validates the .hap file, ensuring it follows the specification. An optional parameter to this command could turn on messages about best practices.

still need to account for extra fields and fix Haplotypes.write

see #25 (comment)

see #25 and #38

still need to account for extra fields and fix Haplotypes.write

see #25 (comment)

see #25 and #38

… feat/haplotypes

so that the Haplotype class can be immutable and hashable

aryarm · 2022-05-11T02:49:38Z

ok, so I think this PR should be ready to go soon! I just want to merge #45 first.

In the meantime, please let me know if y'all have any other comments!

feat: `transform` subcommand

aryarm added 30 commits April 13, 2022 18:21

copy variant module from happler

dc3c3ce

start on work for haplotype parser

9a0fa20

continue implementing Haplotypes.read and Haplotypes.iterate methods

1391b0c

still need to account for extra fields and fix Haplotypes.write

create Haplotype and Variant classes for storing lines from .haps files

c51467b

see #25 (comment)

create specific section in docs for file formats

c8428fa

fix issues with commands not appearing in toc of docs

2015215

add docs for .hap haplotypes file format

1879e04

see #25 and #38

copy variant module from happler

0a2af60

start on work for haplotype parser

7c8d182

continue implementing Haplotypes.read and Haplotypes.iterate methods

99071f8

still need to account for extra fields and fix Haplotypes.write

create Haplotype and Variant classes for storing lines from .haps files

c5500fe

see #25 (comment)

create specific section in docs for file formats

32bd815

fix issues with commands not appearing in toc of docs

600e032

add docs for .hap haplotypes file format

8cb274a

see #25 and #38

Merge branch 'feat/haplotypes' of github.com:gymrek-lab/haptools into…

dc63ed2

… feat/haplotypes

rename hap data files

8f856b0

create new example hap files with beta added

91856b4

change allele to str in hap format spec

5aa0deb

correct type-hinting of return of Haplotypes.iterate

a62a03b

use fname property in Haplotypes.write

a784e6b

start handling extras in Haplotypes class

cf82d4f

store variants as tuple intead of list in Haplotype class

555deba

so that the Haplotype class can be immutable and hashable

rewrite from_hap_spec to automatically use properties from subclasses

ec69ae7

define new haplotype class for haptools

0eff78d

check header lines in Haplotypes.read

5ba8f78

add docs for usage of the .hap file

54a0617

fmt with black

4f7c7fa

rebuild api docs with haplotypes.py

b196736

add examples for Haplotypes class

3e2a426

validate that all extras are there in Haplotypes.check_ex_header

7e86aaf

aryarm added 12 commits May 9, 2022 17:04

retest genotypes module after changes

daaeadf

create transform subcommand

eb641c0

create TestGenotypes class in testing module

95a1619

test variant selection in Genotypes class

2e33f4a

refmt with black

e54143a

create Data.unset() to check if data is unset

60bda2b

add variants param to Genotypes.load()

56ea690

output from a file path in transform subcommand

e830119

create Genotypes class that also stores REF/ALT

c1b55ff

create Haplotype.transform function

db74659

create Haplotypes.transform function and add tests

b72d1d3

write Haplotypes to a VCF

e084ea8

aryarm mentioned this pull request May 11, 2022

feat: transform subcommand #45

Merged

aryarm added 11 commits May 11, 2022 10:54

refmt with black and get rid of HaplotypesGT class

2c1dc3c

clean up transform docs

9e83254

warn against importing at the top of __main__

6bad9d8

clean up duplicated code in Genotypes class

4384cb8

add Genotypes._prephased attr to ignore phasing while debugging

259aaee

allow for discarding samples that are missing genotypes

e72f2d3

add more docs and messages to Genotypes and Haplotypes classes

13c06e7

require GenotypeRefAlt instance as input to Haplotypes.transform

1410315

refmt with black

8ccb7d2

prelim code for other gts readers

75e75be

Merge pull request #45 from gymrek-lab/feat/transform

34a839d

feat: `transform` subcommand

aryarm merged commit 9b4393a into main May 14, 2022

aryarm deleted the feat/haplotypes branch May 14, 2022 05:24

This was referenced Jun 10, 2022

feat: a validate subcommand to check whether a .hap file is valid #47

Open

feat: an index subcommand to sort and index a .hap file #48

Closed

aryarm mentioned this pull request Jun 17, 2022

documentation for the data module #50

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: `.hap` file format IO #43

feat: `.hap` file format IO #43

aryarm commented Apr 22, 2022 •

edited

Loading

aryarm commented May 11, 2022 •

edited

Loading

feat: .hap file format IO #43

feat: .hap file format IO #43

Conversation

aryarm commented Apr 22, 2022 • edited Loading

Overview

Usage and docs

Details

The Haplotypes class

The Haplotype class

The Variant class

Changes to the example data in tests/

Other miscellanea

Testing

Future work

aryarm commented May 11, 2022 • edited Loading

feat: `.hap` file format IO #43

feat: `.hap` file format IO #43

aryarm commented Apr 22, 2022 •

edited

Loading

The `Haplotypes` class

The `Haplotype` class

The `Variant` class

Changes to the example data in `tests/`

aryarm commented May 11, 2022 •

edited

Loading