-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: regression in multiallelic support for simgenotype
#195
Conversation
otherwise, there might be issues that will be silently ignored due to the pgenlib collapsing multiallelic variants into biallelic
I'm keeping this as a draft PR because there are still a few things to do:
|
it's been a change I've wanted to make for a while and now is probably the best time, considering we're breaking the API, anyway
simgenotype
simgenotype
Melissa thinks its best if we have support for writing missing genotypes and store that instead of removing the sample or variant. For handling unphased we can keep as is right now. |
ok, great - I'll add support for missing genotypes soon
ok, so we don't want to check that the reference panel is phased, then? |
Correct |
In PR #163, we started using the
GenotypesRefAlt
class insimgenotype
to make it consistent with our other tools. Unfortunately, theGenotypesRefAlt
class only supports biallelic genotypes. But sincesimgenotype
had previously supported multi-allelic variants, this became a regression.This PR adds support for multi-allelic variants in the
GenotypesRefAlt
class and its children by changing thevariants
property to store a variable-length list of alleles instead of assuming that there are only ever two alleles. This is officially a BREAKING change to the haptools data API, specifically for theGenotypesRefAlt
class, theGenotypesPLINK
class and theGenotypesAncestry
class!Also, the
GenotypesRefAlt
class will be officially renamed toGenotypesVCF
! It's something I've liked to do for a while and figured we might as well do it now while we're breaking everything, anyway.Note that this PR does not add support for tandem repeats yet. You'll be able to read and write multi-allelic variants in the
GenotypesRefAlt
class but not much else besides that. For example, you shouldn't use this class for association analyses of multi-allelic variants because thedata
property of the class simply stores their index, not their dosage.