Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: .hap file format IO #43

Merged
merged 71 commits into from
May 14, 2022
Merged

feat: .hap file format IO #43

merged 71 commits into from
May 14, 2022

Conversation

aryarm
Copy link
Member

@aryarm aryarm commented Apr 22, 2022

closes #25; see also #38
note: please merge this branch after #45 !

Overview

This PR adds support for reading and writing files that follow the .hap file format specification. This functionality is primarily handled by a new Haplotypes class.
Because it implements the Data abstract class, the Haplotypes class contains a method read() for reading the contents of the file into a data property of the class. There is also a write() method for writing the contents of the data property to the file.
Lines from the file are parsed into instances of the new Haplotype and Variant classes within the data module. These classes can be extended (subclassed) to support "extra" fields appended to the ends of each line.

Usage and docs

I've added documentation for the .hap format specification here in a new section of the docs called "File Formats".

Usage of the Haplotypes class is documented in the API docs here. Examples showing how to sub-class the Haplotype and Variant classes are also in the API docs here and here, respectively.

As a concrete example, I've also started extending the Haplotype class to support the "local ancestry" and "effect size" fields within the haptools/haplotype.py module.

Details

The PR itself is quite large, so I'm going to try to break it up here and describe the relevant bits:

The Haplotypes class

Reading a file
Parsing a basic .hap file without any extra fields is as simple as it gets:

haplotypes = Haplotypes.load('tests/data/basic.hap')
haplotypes.data # returns a dictionary of Haplotype objects

The load() method initializes an instance of the Haplotypes class and calls the Haplotypes.read() method, but if the .hap file contains extra fields, you'll need to call the read() method manually. You'll also need to create Haplotype and Variant subclasses that support the extra fields and then specify the names of the classes when you initialize the Haplotypes object:

haplotypes = Haplotypes('tests/data/basic.hap', Haplotype, Variant)
haplotypes.read()
haplotypes.data # returns a dictionary of Haplotype objects

Both the load() and read() methods support region and haplotypes parameters that allow you to request a specific region or set of haplotype IDs to read from the file. These parameters only work if the file is indexed, since in that case, the read() method can take advantage of the indexing to parse the file a bit faster. Otherwise, if the file isn't indexed, the read() method will assume the file could be unsorted and simply reads each line one-by-one. Although I haven't tested it yet, streams like stdin should be supported by this case.
Iterating over a file
If you're worried that the contents of the .hap file will be large, you may opt to parse the file line-by-line instead of loading it all into memory at once. In cases like these, you can use the __iter__() method in a for-loop:

haplotypes = Haplotypes('tests/data/basic.hap')
for line in haplotypes:
    print(line)

You'll have to call __iter()__ manually if you want to specify any function parameters:

haplotypes = Haplotypes('tests/data/basic.hap')
for line in haplotypes.__iter__(region='21:26928472-26941960', haplotypes={"chr21.q.3365*1"}):
    print(line)

Writing a file
To write to a .hap file, you must first initialize a Haplotypes object and then fill out the data property:

haplotypes = Haplotypes('tests/data/basic.hap')
haplotypes.data = {'H1': Haplotype('chr1', 0, 10, 'H1')}
haplotypes.write()

After discussing with @gymreklab, I've started considering a write_line() method that would allow for writing a file line-by-line but I haven't thought of a clean way of doing this that could be compatible with the data module as a whole. (See both #44 and my comment about the Genotypes class accepting TextIO objects in #19 for more details).

The Haplotype class

The Haplotype class stores haplotype lines from the .hap file. Each property in the object is a field in the line. A separate variants property stores a tuple of Variant objects belonging to this haplotype.
The Haplotypes class will initialize Haplotype objects in its read() and __iter__() methods. It uses a few methods within the Haplotype class for this:

  1. Haplotype.from_hap_spec() - this static method initializes a Haplotype object from a line in the .hap file.
  2. Haplotype.to_hap_spec() - this method converts a Haplotype object into a line in the .hap file
    To read "extra" fields from a .hap file, one needs only extend (sub-class) the base Haplotype class and add the extra properties.

The Variant class

The Variant class stores variant lines from the .hap file. Each property in the object is a field in the line.
The Haplotypes class will initialize Variant objects in its read() and __iter__() methods. It uses a few methods within the Variant class for this:

  1. Variant.from_hap_spec() - this static method initializes a Variant object from a line in the .hap file.
  2. Variant.to_hap_spec() - this method converts a Variant object into a line in the .hap file
    To read "extra" fields from a .hap file, one needs only extend (sub-class) the base Variant class and add the extra properties.

Changes to the example data in tests/

  1. I moved the ID field so that it is positioned after the first three fields of each line. That way, sorting the file is as simple as sorting the first four fields: sort -k1,4
  2. I added header lines to each .hap file
  3. I added a few non-indexed files so we can make sure we're reading those correctly, too
  4. I added a few smaller example files, since those were easier to test
  5. I added betas to the files that had ancestry info
  6. I renamed some of the files to have shorter names

Other miscellanea

Most of these are just small formatting things or changes to the documentation. For example, I changed the "Execution" section of the docs into a "File formats" section, and I added a few more files from the data module (covariates.py and haplotypes.py) to the API docs. I also renamed the iterate() method of each class in the data module to __iter__() to be consistent with the standards for that.

Testing

I added the following tests to a new TestHaplotypes class in tests/test_data.py:

  1. test_load()
    Try the Haplotypes.load() method on a really basic file. This will also test the basic read functionality.
  2. test_read_subset()
    Try the Haplotypes.read() method with the region and/or haplotypes parameters specified.
  3. test_read_extras()
    Try to read a .hap file that has extras similar to those that would be needed for simphenotype
  4. test_read_extras_large()
    Try to read a larger file with extras.
  5. test_write()
    Try to write a .hap file.
  6. test_write_extras()
    Try to write a .hap file that has extras similar to those that would be needed for simphenotype.

Future work

Next, I hope to integrate this code into the existing classes for the simphenotype subcommand.

I also want to add a few subcommands to support the new file format:

  1. I think it would be convenient for users if there was an index command that simply wraps the sort, bgzip, and tabix commands that people would normally use to index the file. That way, they won't have to remember what the commands are.
    In order to implement this, we may have to define __lt__() methods for the Variant and Haplotype classes so they can be sorted like this:
    haplotypes.data = dict(sorted(haplotypes.data.items(), key=lambda item: item[1]))
    
    and
    haplotype.variant = tuple(sorted(haplotype.variant))
    
  2. It would be useful to have a validate command that simply validates the .hap file, ensuring it follows the specification. An optional parameter to this command could turn on messages about best practices.

aryarm added 30 commits April 13, 2022 18:21
still need to account for extra fields and fix Haplotypes.write
still need to account for extra fields and fix Haplotypes.write
so that the Haplotype class can be immutable and hashable
@aryarm aryarm mentioned this pull request May 11, 2022
@aryarm
Copy link
Member Author

aryarm commented May 11, 2022

ok, so I think this PR should be ready to go soon! I just want to merge #45 first.

In the meantime, please let me know if y'all have any other comments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

can we merge the happler and haptools format specs for haplotypes?
2 participants