-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: .hap
file format IO
#43
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
still need to account for extra fields and fix Haplotypes.write
still need to account for extra fields and fix Haplotypes.write
so that the Haplotype class can be immutable and hashable
ok, so I think this PR should be ready to go soon! I just want to merge #45 first. In the meantime, please let me know if y'all have any other comments! |
feat: `transform` subcommand
This was referenced Jun 10, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
closes #25; see also #38
note: please merge this branch after #45 !
Overview
This PR adds support for reading and writing files that follow the
.hap
file format specification. This functionality is primarily handled by a newHaplotypes
class.Because it implements the
Data
abstract class, theHaplotypes
class contains a methodread()
for reading the contents of the file into adata
property of the class. There is also awrite()
method for writing the contents of thedata
property to the file.Lines from the file are parsed into instances of the new
Haplotype
andVariant
classes within thedata
module. These classes can be extended (subclassed) to support "extra" fields appended to the ends of each line.Usage and docs
I've added documentation for the
.hap
format specification here in a new section of the docs called "File Formats".Usage of the
Haplotypes
class is documented in the API docs here. Examples showing how to sub-class the Haplotype and Variant classes are also in the API docs here and here, respectively.As a concrete example, I've also started extending the
Haplotype
class to support the "local ancestry" and "effect size" fields within thehaptools/haplotype.py
module.Details
The PR itself is quite large, so I'm going to try to break it up here and describe the relevant bits:
The
Haplotypes
classReading a file
Parsing a basic
.hap
file without any extra fields is as simple as it gets:The
load()
method initializes an instance of the Haplotypes class and calls theHaplotypes.read()
method, but if the.hap
file contains extra fields, you'll need to call theread()
method manually. You'll also need to create Haplotype and Variant subclasses that support the extra fields and then specify the names of the classes when you initialize theHaplotypes
object:Both the
load()
andread()
methods supportregion
andhaplotypes
parameters that allow you to request a specific region or set of haplotype IDs to read from the file. These parameters only work if the file is indexed, since in that case, theread()
method can take advantage of the indexing to parse the file a bit faster. Otherwise, if the file isn't indexed, theread()
method will assume the file could be unsorted and simply reads each line one-by-one. Although I haven't tested it yet, streams like stdin should be supported by this case.Iterating over a file
If you're worried that the contents of the
.hap
file will be large, you may opt to parse the file line-by-line instead of loading it all into memory at once. In cases like these, you can use the__iter__()
method in a for-loop:You'll have to call
__iter()__
manually if you want to specify any function parameters:Writing a file
To write to a
.hap
file, you must first initialize aHaplotypes
object and then fill out the data property:After discussing with @gymreklab, I've started considering a
write_line()
method that would allow for writing a file line-by-line but I haven't thought of a clean way of doing this that could be compatible with thedata
module as a whole. (See both #44 and my comment about the Genotypes class accepting TextIO objects in #19 for more details).The
Haplotype
classThe
Haplotype
class stores haplotype lines from the.hap
file. Each property in the object is a field in the line. A separatevariants
property stores a tuple ofVariant
objects belonging to this haplotype.The
Haplotypes
class will initializeHaplotype
objects in itsread()
and__iter__()
methods. It uses a few methods within theHaplotype
class for this:Haplotype.from_hap_spec()
- this static method initializes a Haplotype object from a line in the.hap
file.Haplotype.to_hap_spec()
- this method converts a Haplotype object into a line in the.hap
fileTo read "extra" fields from a
.hap
file, one needs only extend (sub-class) the baseHaplotype
class and add the extra properties.The
Variant
classThe
Variant
class stores variant lines from the.hap
file. Each property in the object is a field in the line.The
Haplotypes
class will initializeVariant
objects in itsread()
and__iter__()
methods. It uses a few methods within theVariant
class for this:Variant.from_hap_spec()
- this static method initializes aVariant
object from a line in the.hap
file.Variant.to_hap_spec()
- this method converts aVariant
object into a line in the.hap
fileTo read "extra" fields from a
.hap
file, one needs only extend (sub-class) the baseVariant
class and add the extra properties.Changes to the example data in
tests/
sort -k1,4
.hap
fileOther miscellanea
Most of these are just small formatting things or changes to the documentation. For example, I changed the "Execution" section of the docs into a "File formats" section, and I added a few more files from the
data
module (covariates.py
andhaplotypes.py
) to the API docs. I also renamed theiterate()
method of each class in thedata
module to__iter__()
to be consistent with the standards for that.Testing
I added the following tests to a new
TestHaplotypes
class intests/test_data.py
:test_load()
Try the
Haplotypes.load()
method on a really basic file. This will also test the basic read functionality.test_read_subset()
Try the
Haplotypes.read()
method with theregion
and/orhaplotypes
parameters specified.test_read_extras()
Try to read a
.hap
file that has extras similar to those that would be needed forsimphenotype
test_read_extras_large()
Try to read a larger file with extras.
test_write()
Try to write a
.hap
file.test_write_extras()
Try to write a
.hap
file that has extras similar to those that would be needed forsimphenotype
.Future work
Next, I hope to integrate this code into the existing classes for the
simphenotype
subcommand.I also want to add a few subcommands to support the new file format:
index
command that simply wraps thesort
,bgzip
, andtabix
commands that people would normally use to index the file. That way, they won't have to remember what the commands are.In order to implement this, we may have to define
__lt__()
methods for theVariant
andHaplotype
classes so they can be sorted like this:validate
command that simply validates the.hap
file, ensuring it follows the specification. An optional parameter to this command could turn on messages about best practices.