improvements to the Data module #19

aryarm · 2022-03-20T23:57:04Z

input files

would be nice if we could support the following inputs

Path objects representing paths to the files
- and files ending in gz
sys.stdout and sys.stdin
- this might be harder for the Genotypes class b/c I think cyvcf2 only accepts strings to paths (see Read VCF From StringIO or Buffer? brentp/cyvcf2#47)
TextIO objects
- This definitely won't be possible for the Genotypes class but we could do it for the Phenotypes and Covariates classes?

one strategy would be to create a function in the Data abstract class that could detect each of these cases and handle them appropriately?

we should also ensure that most of the classes can work appropriately on streams of data
- and rewrite Genotypes.read to allow it to read data line by line

informative warnings

would also be nice if we could warn users when the regions or samples that they provided encompass zero variants
- and tell them to check that the chroms prefix matches up or attempt to fix it ourselves
for all warnings and errors, use the Logger module instead of raising assertions?

additional classes

for covariates (as a table of samples x covariates)

filtering of variants

by whether they're multi-allelic
automatically by the subset of samples contained in the intersection of the genotype and phenotype files
- note that this might be something we should only do within the code that utilizes the data module (for ex: happler)
by MAF

subclasses for different kinds of genotyping data

or just some way to type-hint the specific kind that you need

phased vs no restriction on phasing
biallelic vs no restriction on allele number
- filterable for above a certain MAF (only applies to biallelic)
contains TRs (potentially handled by trtools - see support for a TR-based GenotypesPLINK class #73)

new functions

iterate() - a generator function that iterates over each line bit by bit and yields named tuples where each entry is a property of the module but having values just for a single row

The text was updated successfully, but these errors were encountered:

aryarm · 2022-05-14T06:13:50Z

For the Genotypes class

create some way to quickly obtain the index of a variant based on its ID
- add a dictionary
remove aaf from Genotypes.variants
- it was never really useful to begin with, anyway - just a bad idea from the start
- we can make it into a method, instead
also, remove the Genotypes.to_MAC() method
add subset() function to the Genotypes class
- by default, return a new Genotypes instance unless the inplace parameter is set to True
numpy-based subsetting and indexing
- implement __getitem__()
- implement __setitem__()
- implement __delitem__()?
a method to generate fake Genotypes
Compare genotypes read when _prephased=True and _prephased=False in the GenotypesPLINK class. Figure out why they're different
- update: I made an attempt at doing this in test: hidden phasing property in GenotypesPLINK class #153 but couldn't reproduce the issue
create a QC method for running all of the QC steps?
reduce memory usage by explicitly freeing memory after loading every chunk of a PGEN file
Right after this line of code, add the following lines:
```
del data
gc.collect()
```
(source and source)
Update: current progress on this

For the Phenotypes class

allow for storing multiple phenotypes in the Phenotypes object
support PLINK2-style .pheno files
- writing
- reading
make the Covariates class into a subclass of the Phenotypes object
add a method to generate fake Phenotypes
add subset() function for choosing a subset of samples
change the type of the samples argument for the Phenotypes class to a set (refactor: data.Phenotypes samples parameter to be of type set #152)

For the Haplotype and Haplotypes classes

use Genotypes.subset() within the transform() methods
- remove the samples arguments from each of the transform() methods
do not require an empty GenotypesRefAlt class as input to Haplotypes.transform()
add subset() function for choosing a subset of haplotypes after loading them
require that the haplotypes parameter of the Haplotypes.read() method be a set instead of a list

For the Haplotype and Variant classes

make it easier to extend the classes, so that extras don't have to be declared ahead of time? pros: it would make it easier to read files with multiple different sets of extra fields; but cons: it puts the burden of handling all of that on us, which could potentially be difficult to take on in the future

For the Haplotype class

create a method that will update the start and end coordinates according to the stored variants

see #19 (comment)

and make covariates class a subclass of phenotypes see #19 for more details

see #19 (comment)

aryarm · 2024-03-12T17:58:06Z

we also discussed some potential breaking changes to the classes in the data library which we would like to implement this summer:

we could change the __init__ methods to accept the values of the class properties as parameters
and then the read and write methods would take file names as input, instead
and then we could remove the load method and add a check method instead which can run the other checks (like check_maf, etc)

this idea originally arose in #49

another idea:

change the __iter__ methods to output chunk_size numbers of variants at a time instead of only one at a time. This could be useful for folks that don't want to have to load the entire genotype matrix into memory all at once

aryarm changed the title ~~allow for different file inputs in data module~~ allow for different types of file object inputs in data module Mar 21, 2022

aryarm added the enhancement label Mar 21, 2022

aryarm changed the title ~~allow for different types of file object inputs in data module~~ improvements to the Data module Mar 27, 2022

aryarm self-assigned this Apr 1, 2022

aryarm mentioned this issue Apr 2, 2022

feat: support filtering of multi-allelic variants and loading of covariates in data/ #20

Merged

aryarm added a commit that referenced this issue Apr 10, 2022

support gz and bz2 file extensions (see #19)

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

f71b853

aryarm added a commit that referenced this issue Apr 13, 2022

reduce memory in Genotypes.read (see #19)

7827e3d

aryarm added a commit that referenced this issue Apr 13, 2022

use logging instead of assertions in data module (see #19)

18f5a97

aryarm added a commit that referenced this issue Apr 13, 2022

add iterate function to data classes (see #19)

47f61e6

aryarm mentioned this issue Apr 24, 2022

feat: .hap file format IO #43

Merged

aryarm added a commit that referenced this issue Jun 16, 2022

simplify the transform method of the Haplotype and Haplotypes classes

f2a3017

see #19 (comment)

aryarm added a commit that referenced this issue Jul 1, 2022

support reading PLINK2-style pheno and covar files

afcdda5

and make covariates class a subclass of phenotypes see #19 for more details

aryarm mentioned this issue Jul 8, 2022

ref: simphenotype #64

Merged

aryarm added a commit that referenced this issue Oct 18, 2022

remove aaf field from Genotypes objects

94e1f61

see #19 (comment)

aryarm mentioned this issue Oct 19, 2022

feat: Genotypes.check_maf() method #124

Merged

aryarm mentioned this issue Nov 1, 2022

docs: cleanup and improve descriptions #126

Merged

aryarm mentioned this issue Dec 29, 2022

refactor: data.Phenotypes samples parameter to be of type set #152

Merged

aryarm mentioned this issue Apr 6, 2023

feat: new Phenotypes.subset() method #203

Merged

aryarm modified the milestone: mory Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improvements to the Data module #19

improvements to the Data module #19

aryarm commented Mar 20, 2022 •

edited

Loading

aryarm commented May 14, 2022 •

edited

Loading

aryarm commented Mar 12, 2024 •

edited

Loading

improvements to the Data module #19

improvements to the Data module #19

Comments

aryarm commented Mar 20, 2022 • edited Loading

input files

informative warnings

additional classes

filtering of variants

subclasses for different kinds of genotyping data

new functions

aryarm commented May 14, 2022 • edited Loading

For the Genotypes class

For the Phenotypes class

For the Haplotype and Haplotypes classes

For the Haplotype and Variant classes

For the Haplotype class

aryarm commented Mar 12, 2024 • edited Loading

aryarm commented Mar 20, 2022 •

edited

Loading

aryarm commented May 14, 2022 •

edited

Loading

aryarm commented Mar 12, 2024 •

edited

Loading