Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory exceeded when running single cell data alignment #189

Closed
Qirongmao97 opened this issue May 16, 2024 · 29 comments
Closed

Memory exceeded when running single cell data alignment #189

Qirongmao97 opened this issue May 16, 2024 · 29 comments
Labels
fixed in release Issue resolved and the fix is released, waiting for approval performance Issues related to computational perfromance

Comments

@Qirongmao97
Copy link

Qirongmao97 commented May 16, 2024

plot_zoom_png (1)

Hi,

I tried running a task with a memory limit of 250GB, but it kept going over the limit.

I'm thinking the problem might be related to how I'm using the --read_group input. Right now, I'm using a file called putative_bc.csv from BLAZE to sort out barcodes. Do you know a better way to align things using BLAZE results with IsoQuant?"

Originally posted by @Qirongmao97 in #165 (comment)

@Qirongmao97 Qirongmao97 changed the title ![plot_zoom_png (1)](https://github.com/ablab/IsoQuant/assets/57286623/d9a249d8-fbb5-49a2-8a98-aa640f2da941) Memory exceed with single cell data alignment May 16, 2024
@Qirongmao97 Qirongmao97 changed the title Memory exceed with single cell data alignment Memory exceeded when running single cell data alignment May 16, 2024
@lianov
Copy link

lianov commented May 17, 2024

Agreed, we are also seeing this with the latest version / v3.4.1 (same approach for single-cell data with the use of --read_group). Before (version 3.3.1) the same sample used ~194 GB, and now it is using 2.08TB (and not respecting the memory limit which we have also set for 250GB).

@andrewprzh
Copy link
Collaborator

@Qirongmao97 @lianov

How do you set the memory limit?
I think exceeding the memory limit might be related to python multiprocessing.

It seems like RAM peak occurs right at end when results are being merged.
I will run IsoQuant on some single-cell data I have and check its memory consumption.

Best
Andrey

@andrewprzh andrewprzh added the performance Issues related to computational perfromance label May 20, 2024
@lianov
Copy link

lianov commented May 20, 2024

@andrewprzh : in my case it was set as part of the NextFlow job from our nf-core/scnanoseq pipeline. Upon checking the nextflow logs, it reported much higher usage without causing the job to fail in SLURM (which is another issue as well, as it should have failed). If you need more details, please let me know. Thanks for looking into this!

@andrewprzh
Copy link
Collaborator

I tested IsoQuant on some very large dataset with 70K barcodes, it does take a lot of RAM. I'll start investigating the issue.
I think it might be partially caused by the Python multiprocessing mechanisms.

I'll keep you updated.

Best
Andrey

@Qirongmao97
Copy link
Author

Hi Andrew @andrewprzh ,

I'm running IsoQuant on the Visium dataset with only 5K barcodes. Technically, it should not require much RAM, right? I was wondering if there might be an issue related to the input from the BLAZE demultiplexing step. Currently, I am also trying the nf-core/scnanoseq pipeline, but it would be great if you could share your pipeline for processing single-cell data with IsoQuant.

Thanks!

@ljwharbers
Copy link

ljwharbers commented Jun 3, 2024

Hi @andrewprzh ,

Just commenting to let you know that I was running into the same issues. I have data from some custom spatial transcriptomics data with closer to a (few) million(s) of 'barcodes'. While I have some nodes with multiple TB of memory available, I'm afraid that (after reading these comments) this won't be enough for my dataset. I got a run scheduled with 2TB of RAM this evening, so I will update when I know more.

Do you have any potential fix in mind or could you point me at the (most likely) chunk of code where this occurs, and I can have a look as well.

Edit:
Indeed as expected it also runs out of memory when allocating 2TB of memory sadly.

Thanks,
Luuk

@andrewprzh
Copy link
Collaborator

@Qirongmao97

Currently I use a barcode calling tool of my own, which will become a part of IsoQuant at some point.
In fact, I don't think it matters how the barcodes are called, it's the number of distinct barcodes that matters.
How many barcodes you have in total?

Best
Andrey

@andrewprzh
Copy link
Collaborator

@ljwharbers

A few million barcodes is really a lot, and it's kind of expected to consume a lot of RAM.
Do all of these barcodes represent real cells or there's is chance to apply some filtering?

Best
Andrey

@Qirongmao97
Copy link
Author

@andrewprzh

Hi, in this Visium dataset we have 3700 cells (spots)

@ljwharbers
Copy link

@andrewprzh

I realize it's a bit of an extreme scenario :')
These are barcodes representing real spatial coordinates (so not really cell but for analysis purposes it doesn't matter). Each barcode has only very few unique reads (and thus genes/transcripts) associated with them.

I think the main problem here is that, if I understand it correctly, a 'cell' x gene/transcript matrix is always generated and this consumes a huge amount of memory.

Could a solution be to (have an option) to not generate the output in this 'wide' format, but in a 'long' format instead. I can imagine that this would save a lot of memory.

@andrewprzh
Copy link
Collaborator

@ljwharbers

Yes, the matrix is always stored in some way. Previously, IsoQuant was outputting the "long" format, but then we decided to use "wide" format for everything.
I'll what I can do to make a workaround.

@andrewprzh
Copy link
Collaborator

@Qirongmao97

3700 is not really a lot... I also see that RAM peak occurs at then, probably when the counts are merged into a single table.

@ljwharbers
Copy link

@ljwharbers

Yes, the matrix is always stored in some way. Previously, IsoQuant was outputting the "long" format, but then we decided to use "wide" format for everything. I'll what I can do to make a workaround.

Thanks, that would be amazing!

@lianov
Copy link

lianov commented Jul 1, 2024

@andrewprzh : thank you for your work on this once again. Do you foresee a fix on this issue in the near future? We are getting close to final review with nf-core on scnanoseq and for now, we have chosen to downgrade isoquant to 3.3.1 as a temporary fix to this issue. If you think there might be a fix to this issue in the near future, could you let us know so we can attempt to update the pipeline before first release with a new version of isoquant?

If not, no problem - we will aim to release a patch as soon as it is available. Thank you again.

@andrewprzh
Copy link
Collaborator

@lianov

Unfortunately, I'm quite busy with other projects and trying to work on IsoQuant in between. I think using 3.3.1 for now is a good solution since I cannot predict the timeline at the moment...
I will keep you updated anyway.

Best
Andrey

@lianov
Copy link

lianov commented Jul 1, 2024

@andrewprzh : No problem, totally get it and thank you for the quick reply. We will move forward with this plan in the meantime.

@andrewprzh
Copy link
Collaborator

Makes sense, good luck and stay tuned :)

@andrewprzh
Copy link
Collaborator

andrewprzh commented Jul 13, 2024

@lianov

New 3.4.2 consumes significantly less memory compared to 3.4.1.

However, there still might be issues with single-cell data, which I'm still working on.

Best
Andrey

@lianov
Copy link

lianov commented Jul 15, 2024

@andrewprzh : thank you for the update. @atrull314 and I will be looking into this new release for sure for performance in single-cell data. Thank you for your continued updates etc!

@andrewprzh
Copy link
Collaborator

@Qirongmao97 @lianov @ljwharbers

New IsoQuant 3.5 should consume far less RAM when using read groups, for gene, transcript and exon counts too.

It also outputs grouped counts in both - matrix and linear formats.

Best
Andrey

@andrewprzh andrewprzh added the fixed in release Issue resolved and the fix is released, waiting for approval label Aug 3, 2024
@lianov
Copy link

lianov commented Aug 5, 2024

@andrewprzh : Great, we will be trying this out asap. Thank you again for your updates.

@ljwharbers
Copy link

@andrewprzh this is amazing, thanks! I'm testing it now and it runs smoothly so far, no memory issues (and this is with ~50 million barcodes!). Amazing work!

@lianov
Copy link

lianov commented Aug 7, 2024

@andrewprzh : just to follow-up on our end. We are also seeing improvements in memory with this latest version after some pre-lim tests (~80GB with a PromethION dataset). We will continue to test on other datasets and upgrade the pipeline ASAP to be released with IsoQuant 3.5.

@lianov
Copy link

lianov commented Aug 23, 2024

Follow-up here to close the loop on our end at least: fully tested this version across the datasets and we can confirm better performance. Also quantification sensitivity on this latest version is much better than before! Thanks for all the improvements! [this latest version is implemented in the scnanoseq pipeline and we are very close to releasing it on our end.

@andrewprzh
Copy link
Collaborator

Thanks a lot for getting back and happy to hear about positive results!
And thank you for embedding IsoQuant into you pipeline!

@ljwharbers
Copy link

Also follow-up from my side.
I've ran the latest version with >50 million barcodes and there are no memory issues any more. The run time is (very) long due to outputting in dense matrix format, typically days for my dataset. After simply commenting out lines that write in matrix format, everything processed in a couple of hours.

Super impressed with the speed and sensitivity. I will also be including isoquant in my nf-core pipeline (which is still a bit away from being released).

Thanks for your continued work and your quick responses!

@lianov
Copy link

lianov commented Aug 26, 2024

@ljwharbers : that's good info on tracking down the run time source. On most of our datasets with default threads it takes about ~8hr, but this is helpful to us and maybe an area that we can also aid in contributing in the future.

@ljwharbers
Copy link

I think that ultimately the best option would be to, during processing, have the intermediate files in the 'linear' format and only in the final merging step transform it to a (sparse) matrix or linear format, depending on the user requirement. This would save a lot of time even if the user wants the output in matrix format.

@lianov I simply have a small script to change the linear format into a sparse mtx, which is compatible with (almost) all downstream single-cell processing tools.

While writing this, I see that @andrewprzh just released v3.5.1 already with the ability for the user to specify the output format. Amazing work once again!

@andrewprzh
Copy link
Collaborator

@ljwharbers

Thanks for the feedback! For now I implemented a simple option --counts_format, but I'll rework counts output in a more optimal way to avoid merging large files. Interestingly, linear format was the default option previously for grouped counts with the large number of groups, but somehow we decided to switch to matrix format.

I'll close this issue for now, feel free to reopen or start a discussion if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed in release Issue resolved and the fix is released, waiting for approval performance Issues related to computational perfromance
Projects
None yet
Development

No branches or pull requests

4 participants