Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Problems with ONT Data #23

Open
eprdz opened this issue Dec 21, 2023 · 7 comments
Open

Memory Problems with ONT Data #23

eprdz opened this issue Dec 21, 2023 · 7 comments

Comments

@eprdz
Copy link

eprdz commented Dec 21, 2023

I was using isONclust in parallel as a previous step to define a transcriptome using ONT data with isONform. I looked at the memory profiling of isONclust and after a few minutes, when almost reaching the memory limit (125 Gb), the memory consumption of isONclust dropped at 40-50 Gb.
IsONclust "seems to be working" as the command line did not appear and no error was thrown but only 1 thread out of the total that were running was actually alive.
I realized that there were 2 reads that were very long (<100 Kb), while the other reads were 10 Kb long at most. I removed those outlayers and now it seems to work.

I was thinking that maybe some error might be thrown in order not to lead to confusion.

Thanks for your time!

@ksahlin
Copy link
Owner

ksahlin commented Dec 21, 2023

Thank you for reporting.

lima (pacbio) or pychopper (ONT) pre-proccessing tools can be used to remove long degenerate reads. I recommend using one of them for any preprocessing of "raw" reads.

The peak you saw could have happened in the sorting step (prior to clustering).

Yes, I could add a warning message and flag reads above a certain length threshold (but I think taking care of these reads with a preprocessing tool is the way to go).

Best,
Kristoffer

@ksahlin
Copy link
Owner

ksahlin commented Jan 1, 2024

Additionally, what parameters do you run isONclust with? Parameters such as --k and --w can affect runtime and memory usage significantly.

@eprdz
Copy link
Author

eprdz commented Jan 2, 2024

Hi, first of all, thank you for your feedback but I could not answer back before.

I ran pychopper and those extremely long reads were still in the dataset, so I removed them manually and everything went well.

Moreover, in order to run isONclust, I used the full_pipeline.sh script with the full option in the isONform repository, so I think that the following commands were executed:

/usr/bin/time -v isONclust  --t $num_cores  --ont --fastq $outfolder/full_length.fq \
             --outfolder $outfolder/clustering
/usr/bin/time -v isONclust write_fastq --N $iso_abundance --clusters $outfolder/clustering/final_clusters.tsv \
                      --fastq $outfolder/full_length.fq --outfolder  $outfolder/clustering/fastq_files

Thanks again for your help.

@ksahlin
Copy link
Owner

ksahlin commented Jan 2, 2024

/usr/bin/time -v isONclust --t $num_cores --ont

I see, then my first answer stands. My second answer was in reference to this comment in the isONform repo: aljpetri/isONform#16 (comment):

@eprdz
Copy link
Author

eprdz commented Feb 15, 2024

Hi,
Sorry to open up again this issue. I have a question regarding this.

As I said last time, I implemented an in-house filtering step before isONclust to remove reads longer than 5kb, as I have seen that datasets with reads longer than that length are very time and memory consuming. Nevertheless, some of these reads are not artifacts and I want to use them for isONcorrect and isONform. Do you know if there is a way to "rescue" those reads in isONcorrect and isONform?

Thank you again!

@ksahlin
Copy link
Owner

ksahlin commented Feb 15, 2024

Hi again!

Two new answers:

  1. @aljpetri is close to finishing a better algorithm, so I will make it simple for myself and say that replacing isONclust with the new implementation (hopefully ready in a month) is the long-term solution.
  2. If you don't want to wait for 1, you can try to using less cores (I don't know how many you used now though). With half the cores used, isONclust's memory usage drops with 2x, but runtime is not twice as long (due to non-trivial parallelisation). So this could be an option.

@eprdz
Copy link
Author

eprdz commented Feb 15, 2024

Understood! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants