Speed up `downsample_counts` #340

ivirshup · 2018-11-02T05:01:19Z

On master (3785143) from the base of the repo, I haven't seen the following code finish running:

import scanpy.api as sc
adata = sc.read("./data/pbmc3k_raw.h5ad")
%time sc.pp.downsample_counts(adata, 1500)

This PR implements an optimized version of the same thing, which gives:

%time sc.pp.downsample_counts(adata, 1500)                                                   
CPU times: user 2.25 s, sys: 44.7 ms, total: 2.29 s
Wall time: 2.32 s

What's changed

I've rewritten the function to use numba along with fewer allocations
Added a test for the function
Added argument replace, which indicates whether subsampling should happen with replacement

Notes

To me, it makes more sense to sample without replacement, since for small changes in total counts you'll have more similar profiles. However, I've set the default for replacement to True to preserve the current behavior.

Neither this or the previous method scale well with sampling depth, and it's maybe worth using a call to sample a multinomial or multivariate hypergeometric distribution instead.

ivirshup · 2018-11-02T05:16:58Z

Um, it wasn't me.

CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/linux-64/xorg-libxdmcp-1.1.2-h470a237_7.tar.bz2>

Also, downsampling from 3785143 finished after an hour, but definitely had the wrong answer (all counts in one gene). I'm not sure what to make of this, since it's given reasonable results for smaller tests.

LuckyMD · 2018-11-02T09:51:25Z

Hey!
I wrote this function a while ago... it was definitely not the cleanest or quickest implementation. And it did take a while to run on ~5k cells at the time, but I thought it would be useful to have this functionality in scanpy.

Just wanted to note that the intention was definitely to implement this without resampling. I clearly missed that the default was to use resampling in np.random.choice. Thanks for spotting this.

falexwolf · 2018-11-04T03:25:00Z

That's really cool, thank you!

I'll add a logging output about that replace=False is the more natural choice and we'll make it the default in the next major release.

ivirshup added 2 commits November 2, 2018 00:07

Get downsampling to work

a0eb973

Added flag for subsampling with or without replacement

f16e51a

juugii mentioned this pull request Nov 2, 2018

Alevin : Normalization of samples with uneven sequencing depth / Batch Correction. COMBINE-lab/salmon#305

Open

falexwolf merged commit cfa5ee9 into scverse:master Nov 4, 2018

ivirshup mentioned this pull request Feb 11, 2019

Downsample total counts #474

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up `downsample_counts` #340

Speed up `downsample_counts` #340

ivirshup commented Nov 2, 2018

ivirshup commented Nov 2, 2018

LuckyMD commented Nov 2, 2018

falexwolf commented Nov 4, 2018

Speed up downsample_counts #340

Speed up downsample_counts #340

Conversation

ivirshup commented Nov 2, 2018

What's changed

Notes

ivirshup commented Nov 2, 2018

LuckyMD commented Nov 2, 2018

falexwolf commented Nov 4, 2018

Speed up `downsample_counts` #340

Speed up `downsample_counts` #340