-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: convert samples
argument in Genotypes.read
into a set and fix tr_harmonizer
bug arising when TRTools is also installed
#225
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple ideas:
- Now that I'm thinking about it, naming it
lazy
might be confusing sincelazy
is also a parameter ofcyvcf2
and people might think that this is that parameter. What do you think about calling it something likereorder_samples
and defaulting it to True? - Can you add type hints to be consistent? So
lazy=False
would becomelazy: bool = False
, for example - The Genotypes classes also have an
__iter__()
method that is supposed to work similarly to theread()
method except it's supposed to read things line by line to reduce memory-use. Unfortunately, we won't be able to usesubset()
to reorder samples for that. What should we do in that case, do you think? I'd prefer the behavior of__iter__()
to continue to align withread()
as much as possible
@mlamkin7 and @gymreklab, I just thought of something that might be potentially super important. According to the numpy docs,
https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing So we should probably also profile the memory of If it's true that the memory gets doubled, then my preference would be to make the |
Updated 1. and 2., but 3 does pose an issue since I believe it returns one variant at a time? we may have to sort each time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the readthedocs build is failing because the type hints refer to cyvcf2, but it isn't imported
changing cyvcf2.VCF
to VCF
should fix that
Confirming now that running resultsRunning reproducing thisin case we need it for later, here are the steps to reproduce this fil-profile run tests/bench_genotypes_mem.py with this script #!/usr/bin/env python
from filprofiler.api import profile
from haptools.data import GenotypesPLINK
gt = GenotypesPLINK("../happler/tests/data/19_45401409-46401409_1000G.pgen", chunk_size=1)
gt2 = GenotypesPLINK("../happler/tests/data/19_45401409-46401409_1000G.pgen", chunk_size=1)
def read():
gt.read()
def read_and_subset():
gt2.read()
gt2.subset(samples=gt2.samples, inplace=True)
profile(read, "fil-result/bench_read_GenotypesPLINK")
profile(read_and_subset, "fil-result/bench_read_subset_GenotypesPLINK") using this file |
Co-authored-by: Arya Massarat <[email protected]>
Co-authored-by: Arya Massarat <[email protected]>
Co-authored-by: Arya Massarat <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests passed! 🚀
Genotypes.samples
argument into a set and fix tr_harmonizer
bug arising when TRTools is also installed
Genotypes.samples
argument into a set and fix tr_harmonizer
bug arising when TRTools is also installedsamples
argument in Genotypes.read
into a set and fix tr_harmonizer
bug arising when TRTools is also installed
samples
argument in Genotypes.read
into a set and fix tr_harmonizer
bug arising when TRTools is also installedsamples
argument in Genotypes.read
into a set and fix tr_harmonizer
bug arising when TRTools is also installed
Added lazy option to all Genotypes Classes which when false (default) sorts output genotypes in the same order as the list of samples given (if given).
Also updated tr_harmonizer dependency so it won't error when TRTools is installed.