Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suggestions with big data #16

Open
alexyfyf opened this issue Dec 29, 2023 · 3 comments
Open

suggestions with big data #16

alexyfyf opened this issue Dec 29, 2023 · 3 comments

Comments

@alexyfyf
Copy link

Hi Alex,

I found your tool generating a lot of intermedia files (also from isonclust and isoncorrect). It consumes my inodes quickly.
Any suggestions how to alleviate this for big dataset?
Would increase (or decrease) --max_seqs or --max_seqs_to_spoa help?

Thank you so much.
Cheers,

@alexyfyf
Copy link
Author

also i noticed in your pipeline, you set inonclust --k 8 --w 9 rather than the default --k 13 --w 20 for ONT data, which also slow down a lot of clustering step. Any reason for choosing that?

@aljpetri
Copy link
Owner

aljpetri commented Jan 9, 2024

Hi thank you very much again for reporting your findings.

also i noticed in your pipeline, you set isonclust --k 8 --w 9 rather than the default --k 13 --w 20 for ONT data, which also slow down a lot of clustering step. Any reason for choosing that?

I have fixed this in commit 2f40387 and also changed the name of the run_mode to ont instead of analysis to make clearer what the mode is used for. The parameters k and w were used in our analyses to alleviate any possible impacts of isONclust on the final results but are not recommended to be run with with ONT data sets.

Any suggestions how to alleviate this for big dataset?

If you refer to the number of clusters (isONclust and isONcorrect), one thing you could try is to set a higher value for iso_abundance when running the pipeline. This would require more reads per cluster to be formed (for isONclust and isONcorrect) as well as a higher number of reads supporting an isoform to be called and should reduce the number of clusters. This, however, might mean that some isoforms with very low read support might not be called. If this is not what you meant could you explain a bit more?
Best,
Alex

@alexyfyf
Copy link
Author

alexyfyf commented Feb 1, 2024

Hi, sorry for the late reply. Thanks for your suggestions.
And what if I already have a lot of clusters, and when I run isONform_parallel.py, is there any parameters that can improve the speed and IO?
My issues are when I run isONform_parallel.py, too many temporary files were generated, and quickly used up my inode. I would like some suggestions to (1) reduce the tmp files generated, (2) increased speed for isONform_parrallel.py.

Cheers,
Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants