Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: convert samples argument in Genotypes.read into a set and fix tr_harmonizer bug arising when TRTools is also installed #225

Merged
merged 13 commits into from
Jan 12, 2024

Conversation

mlamkin7
Copy link
Collaborator

@mlamkin7 mlamkin7 commented Oct 3, 2023

Added lazy option to all Genotypes Classes which when false (default) sorts output genotypes in the same order as the list of samples given (if given).

Also updated tr_harmonizer dependency so it won't error when TRTools is installed.

@mlamkin7 mlamkin7 requested a review from aryarm October 3, 2023 02:57
Copy link
Member

@aryarm aryarm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple ideas:

  1. Now that I'm thinking about it, naming it lazy might be confusing since lazy is also a parameter of cyvcf2 and people might think that this is that parameter. What do you think about calling it something like reorder_samples and defaulting it to True?
  2. Can you add type hints to be consistent? So lazy=False would become lazy: bool = False, for example
  3. The Genotypes classes also have an __iter__() method that is supposed to work similarly to the read() method except it's supposed to read things line by line to reduce memory-use. Unfortunately, we won't be able to use subset() to reorder samples for that. What should we do in that case, do you think? I'd prefer the behavior of __iter__() to continue to align with read() as much as possible

haptools/data/genotypes.py Outdated Show resolved Hide resolved
haptools/data/genotypes.py Outdated Show resolved Hide resolved
haptools/data/genotypes.py Outdated Show resolved Hide resolved
@aryarm
Copy link
Member

aryarm commented Oct 4, 2023

@mlamkin7 and @gymreklab, I just thought of something that might be potentially super important.

According to the numpy docs, Genotypes.subset() will trigger a copy of the entire genotype matrix to be made in memory because we have to use advanced indexing to reorder the rows of the genotype matrix:

Advanced indexing always returns a copy of the data

https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing

So we should probably also profile the memory of read() before and after this change? I can look into a way to do that, if you'd like. I've been meaning to do it for something else anyway (see here, for ex)

If it's true that the memory gets doubled, then my preference would be to make the samples argument into a set instead of a list. (since I think the default behavior should be as quick and as lightweight as possible.) Or, at least, maybe we default the new parameter to not performing the sorting?

@mlamkin7
Copy link
Collaborator Author

mlamkin7 commented Oct 5, 2023

A couple ideas:

  1. Now that I'm thinking about it, naming it lazy might be confusing since lazy is also a parameter of cyvcf2 and people might think that this is that parameter. What do you think about calling it something like reorder_samples and defaulting it to True?
  2. Can you add type hints to be consistent? So lazy=False would become lazy: bool = False, for example
  3. The Genotypes classes also have an __iter__() method that is supposed to work similarly to the read() method except it's supposed to read things line by line to reduce memory-use. Unfortunately, we won't be able to use subset() to reorder samples for that. What should we do in that case, do you think? I'd prefer the behavior of __iter__() to continue to align with read() as much as possible

Updated 1. and 2., but 3 does pose an issue since I believe it returns one variant at a time? we may have to sort each time.

Copy link
Member

@aryarm aryarm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the readthedocs build is failing because the type hints refer to cyvcf2, but it isn't imported

changing cyvcf2.VCF to VCF should fix that

haptools/data/genotypes.py Outdated Show resolved Hide resolved
haptools/data/genotypes.py Outdated Show resolved Hide resolved
haptools/data/genotypes.py Outdated Show resolved Hide resolved
@aryarm
Copy link
Member

aryarm commented Oct 5, 2023

Confirming now that running subset() together with read() does indeed double the peak memory usage :(

results

Running read() and then subset() gives us a peak memory usage of 8.8 MiB. Judging by the width and color intensity of the flamegraph bars for the read() and subset() functions, roughly half of the 8.8 MiB is due to read() while the other half is due to subset().
peak-memory
Running just read() alone gives us a peak memory usage of 5.1 MiB. Most of this is due to the initial allocation of memory for the genotype matrix by read().
peak-memory
For more info on how to interpret the flamegraphs, take a look at the filprofiler docs

reproducing this

in case we need it for later, here are the steps to reproduce this
Using commit 56dddf4, run this command

fil-profile run tests/bench_genotypes_mem.py

with this script

#!/usr/bin/env python

from filprofiler.api import profile

from haptools.data import GenotypesPLINK

gt = GenotypesPLINK("../happler/tests/data/19_45401409-46401409_1000G.pgen", chunk_size=1)
gt2 = GenotypesPLINK("../happler/tests/data/19_45401409-46401409_1000G.pgen", chunk_size=1)

def read():
    gt.read()

def read_and_subset():
    gt2.read()
    gt2.subset(samples=gt2.samples, inplace=True)

profile(read, "fil-result/bench_read_GenotypesPLINK")
profile(read_and_subset, "fil-result/bench_read_subset_GenotypesPLINK")

using this file

Copy link
Member

@aryarm aryarm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests passed! 🚀

@aryarm aryarm changed the title feat: Added lazy option and fixed tr_harmonizer dependency feat: convert Genotypes.samples argument into a set and fix tr_harmonizer bug arising when TRTools is also installed Dec 29, 2023
@aryarm aryarm changed the title feat: convert Genotypes.samples argument into a set and fix tr_harmonizer bug arising when TRTools is also installed feat: convert samples argument in Genotypes.read into a set and fix tr_harmonizer bug arising when TRTools is also installed Dec 29, 2023
@aryarm aryarm changed the title feat: convert samples argument in Genotypes.read into a set and fix tr_harmonizer bug arising when TRTools is also installed fix: convert samples argument in Genotypes.read into a set and fix tr_harmonizer bug arising when TRTools is also installed Dec 29, 2023
@aryarm aryarm merged commit 06cc273 into main Jan 12, 2024
11 checks passed
@aryarm aryarm deleted the sort_str_samples branch January 12, 2024 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants