fix: convert `samples` argument in `Genotypes.read` into a set and fix `tr_harmonizer` bug arising when TRTools is also installed #225

mlamkin7 · 2023-10-03T02:57:57Z

Added lazy option to all Genotypes Classes which when false (default) sorts output genotypes in the same order as the list of samples given (if given).

Also updated tr_harmonizer dependency so it won't error when TRTools is installed.

… sort_str_samples

aryarm

A couple ideas:

Now that I'm thinking about it, naming it lazy might be confusing since lazy is also a parameter of cyvcf2 and people might think that this is that parameter. What do you think about calling it something like reorder_samples and defaulting it to True?
Can you add type hints to be consistent? So lazy=False would become lazy: bool = False, for example
The Genotypes classes also have an __iter__() method that is supposed to work similarly to the read() method except it's supposed to read things line by line to reduce memory-use. Unfortunately, we won't be able to use subset() to reorder samples for that. What should we do in that case, do you think? I'd prefer the behavior of __iter__() to continue to align with read() as much as possible

haptools/data/genotypes.py

aryarm · 2023-10-04T02:09:00Z

@mlamkin7 and @gymreklab, I just thought of something that might be potentially super important.

According to the numpy docs, Genotypes.subset() will trigger a copy of the entire genotype matrix to be made in memory because we have to use advanced indexing to reorder the rows of the genotype matrix:

Advanced indexing always returns a copy of the data

https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing

So we should probably also profile the memory of read() before and after this change? I can look into a way to do that, if you'd like. I've been meaning to do it for something else anyway (see here, for ex)

If it's true that the memory gets doubled, then my preference would be to make the samples argument into a set instead of a list. (since I think the default behavior should be as quick and as lightweight as possible.) Or, at least, maybe we default the new parameter to not performing the sorting?

… sort_str_samples

mlamkin7 · 2023-10-05T15:52:09Z

A couple ideas:

Now that I'm thinking about it, naming it lazy might be confusing since lazy is also a parameter of cyvcf2 and people might think that this is that parameter. What do you think about calling it something like reorder_samples and defaulting it to True?

Can you add type hints to be consistent? So lazy=False would become lazy: bool = False, for example

The Genotypes classes also have an __iter__() method that is supposed to work similarly to the read() method except it's supposed to read things line by line to reduce memory-use. Unfortunately, we won't be able to use subset() to reorder samples for that. What should we do in that case, do you think? I'd prefer the behavior of __iter__() to continue to align with read() as much as possible

Updated 1. and 2., but 3 does pose an issue since I believe it returns one variant at a time? we may have to sort each time.

aryarm

the readthedocs build is failing because the type hints refer to cyvcf2, but it isn't imported

changing cyvcf2.VCF to VCF should fix that

haptools/data/genotypes.py

aryarm · 2023-10-05T19:42:04Z

Confirming now that running subset() together with read() does indeed double the peak memory usage :(

results

Running read() and then subset() gives us a peak memory usage of 8.8 MiB. Judging by the width and color intensity of the flamegraph bars for the read() and subset() functions, roughly half of the 8.8 MiB is due to read() while the other half is due to subset().

Running just read() alone gives us a peak memory usage of 5.1 MiB. Most of this is due to the initial allocation of memory for the genotype matrix by read().

For more info on how to interpret the flamegraphs, take a look at the filprofiler docs

reproducing this

in case we need it for later, here are the steps to reproduce this
Using commit 56dddf4, run this command

fil-profile run tests/bench_genotypes_mem.py

with this script

#!/usr/bin/env python

from filprofiler.api import profile

from haptools.data import GenotypesPLINK

gt = GenotypesPLINK("../happler/tests/data/19_45401409-46401409_1000G.pgen", chunk_size=1)
gt2 = GenotypesPLINK("../happler/tests/data/19_45401409-46401409_1000G.pgen", chunk_size=1)

def read():
    gt.read()

def read_and_subset():
    gt2.read()
    gt2.subset(samples=gt2.samples, inplace=True)

profile(read, "fil-result/bench_read_GenotypesPLINK")
profile(read_and_subset, "fil-result/bench_read_subset_GenotypesPLINK")

using this file

Co-authored-by: Arya Massarat <[email protected]>

aryarm

Tests passed! 🚀

mlamkin7 added 5 commits September 20, 2023 13:50

Remove try except so always use our version of tr_harmonizer

d1d4f5c

Extended TRRecordHarmonizer to avoid installation of trtools issue

9da06fb

New TRRecordHarmonizerRegion class to extren TRRecordHarmonizer

6764f1f

Merge branch 'main' of https://github.com/CAST-genomics/haptools into…

babdbd6

… sort_str_samples

Added lazy option

146beac

mlamkin7 requested a review from aryarm October 3, 2023 02:57

Formatting

8499d28

aryarm reviewed Oct 3, 2023

View reviewed changes

haptools/data/genotypes.py Outdated Show resolved Hide resolved

haptools/data/genotypes.py Outdated Show resolved Hide resolved

haptools/data/genotypes.py Outdated Show resolved Hide resolved

mlamkin7 added 2 commits October 5, 2023 08:43

Changed lazy to reorder_samples

6ba15a8

Merge branch 'main' of https://github.com/CAST-genomics/haptools into…

e201cd4

… sort_str_samples

aryarm reviewed Oct 5, 2023

View reviewed changes

haptools/data/genotypes.py Outdated Show resolved Hide resolved

haptools/data/genotypes.py Outdated Show resolved Hide resolved

haptools/data/genotypes.py Outdated Show resolved Hide resolved

mlamkin7 and others added 3 commits October 5, 2023 14:02

Update haptools/data/genotypes.py

8d460f2

Co-authored-by: Arya Massarat <[email protected]>

Update haptools/data/genotypes.py

2d3ecf5

Co-authored-by: Arya Massarat <[email protected]>

Update haptools/data/genotypes.py

eef314f

Co-authored-by: Arya Massarat <[email protected]>

aryarm mentioned this pull request Nov 30, 2023

feat: convert samples arg in Genotypes classes into a set #230

Merged

feat: convert samples arg in Genotypes classes into a set (#230)

2f2dea8

aryarm approved these changes Dec 2, 2023

View reviewed changes

aryarm changed the title ~~feat: Added lazy option and fixed tr_harmonizer dependency~~ feat: convert Genotypes.samples argument into a set and fix tr_harmonizer bug arising when TRTools is also installed Dec 29, 2023

aryarm changed the title ~~feat: convert Genotypes.samples argument into a set and fix tr_harmonizer bug arising when TRTools is also installed~~ feat: convert samples argument in Genotypes.read into a set and fix tr_harmonizer bug arising when TRTools is also installed Dec 29, 2023

Merge branch 'main' into sort_str_samples

894da1d

aryarm merged commit 06cc273 into main Jan 12, 2024
11 checks passed

aryarm deleted the sort_str_samples branch January 12, 2024 19:51

github-actions bot mentioned this pull request Jan 12, 2024

chore(main): release 0.4.0 #223

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: convert `samples` argument in `Genotypes.read` into a set and fix `tr_harmonizer` bug arising when TRTools is also installed #225

fix: convert `samples` argument in `Genotypes.read` into a set and fix `tr_harmonizer` bug arising when TRTools is also installed #225

mlamkin7 commented Oct 3, 2023

aryarm left a comment •

edited

Loading

aryarm commented Oct 4, 2023 •

edited

Loading

mlamkin7 commented Oct 5, 2023

aryarm left a comment •

edited

Loading

aryarm commented Oct 5, 2023 •

edited

Loading

aryarm left a comment

fix: convert samples argument in Genotypes.read into a set and fix tr_harmonizer bug arising when TRTools is also installed #225

fix: convert samples argument in Genotypes.read into a set and fix tr_harmonizer bug arising when TRTools is also installed #225

Conversation

mlamkin7 commented Oct 3, 2023

aryarm left a comment • edited Loading

Choose a reason for hiding this comment

aryarm commented Oct 4, 2023 • edited Loading

mlamkin7 commented Oct 5, 2023

aryarm left a comment • edited Loading

Choose a reason for hiding this comment

aryarm commented Oct 5, 2023 • edited Loading

results

reproducing this

aryarm left a comment

Choose a reason for hiding this comment

fix: convert `samples` argument in `Genotypes.read` into a set and fix `tr_harmonizer` bug arising when TRTools is also installed #225

fix: convert `samples` argument in `Genotypes.read` into a set and fix `tr_harmonizer` bug arising when TRTools is also installed #225

aryarm left a comment •

edited

Loading

aryarm commented Oct 4, 2023 •

edited

Loading

aryarm left a comment •

edited

Loading

aryarm commented Oct 5, 2023 •

edited

Loading