Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up downsample_counts #340

Merged
merged 2 commits into from
Nov 4, 2018
Merged

Speed up downsample_counts #340

merged 2 commits into from
Nov 4, 2018

Conversation

ivirshup
Copy link
Member

@ivirshup ivirshup commented Nov 2, 2018

On master (3785143) from the base of the repo, I haven't seen the following code finish running:

import scanpy.api as sc
adata = sc.read("./data/pbmc3k_raw.h5ad")
%time sc.pp.downsample_counts(adata, 1500)

This PR implements an optimized version of the same thing, which gives:

%time sc.pp.downsample_counts(adata, 1500)                                                   
CPU times: user 2.25 s, sys: 44.7 ms, total: 2.29 s
Wall time: 2.32 s

What's changed

  • I've rewritten the function to use numba along with fewer allocations
  • Added a test for the function
  • Added argument replace, which indicates whether subsampling should happen with replacement

Notes

To me, it makes more sense to sample without replacement, since for small changes in total counts you'll have more similar profiles. However, I've set the default for replacement to True to preserve the current behavior.

Neither this or the previous method scale well with sampling depth, and it's maybe worth using a call to sample a multinomial or multivariate hypergeometric distribution instead.

@ivirshup
Copy link
Member Author

ivirshup commented Nov 2, 2018

Um, it wasn't me.

CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/linux-64/xorg-libxdmcp-1.1.2-h470a237_7.tar.bz2>

Also, downsampling from 3785143 finished after an hour, but definitely had the wrong answer (all counts in one gene). I'm not sure what to make of this, since it's given reasonable results for smaller tests.

@LuckyMD
Copy link
Contributor

LuckyMD commented Nov 2, 2018

Hey!
I wrote this function a while ago... it was definitely not the cleanest or quickest implementation. And it did take a while to run on ~5k cells at the time, but I thought it would be useful to have this functionality in scanpy.

Just wanted to note that the intention was definitely to implement this without resampling. I clearly missed that the default was to use resampling in np.random.choice. Thanks for spotting this.

@falexwolf
Copy link
Member

That's really cool, thank you!

I'll add a logging output about that replace=False is the more natural choice and we'll make it the default in the next major release.

@falexwolf falexwolf merged commit cfa5ee9 into scverse:master Nov 4, 2018
@ivirshup ivirshup mentioned this pull request Feb 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants