-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
majority clusters after correction was not included in isonform output #18
Comments
Dear Alex, |
-How did you simulate the data? Is the simulation tool capable of generating end to end (full length) transcripts? -How many transcripts (and possibly clusters) would you expect to be in the data set? -Which parameters did you run the isONpipeline with? -It is unexpected from our side that the correction step outputs less clusters than the clustering step did. Were there any errors when running the data? -For isONform it may actually happen that empty clusterxxx_merged.fa files are produced, as if we do not get any transcripts having the necessary support no transcript will be called from the cluster. Thank you so much for your reply. Do you think it might be helpful if I send you the log? Cheers, |
Looks like the trans-nanosim and RNAbloom2 (same author-group) are using 11% error rates (typical ONT dRNA). cDNA ONT error rate have in the last years typically been around 6-7%. (isONpipeline is designed/uses parameters suitable for cDNA around 7%). If reads have 11% errors on average, it may be that isONcorrect delivers a post error correction rate around 2-4% (see fig from isONcorrect paper below, panel B). It is possible that this affects isONform's ability to trace out isoforms with the default parameters. Just a guess applicable only in case you have reads with 10-11% errors. I guess the only way to verify that is a simulation with typical cDNA error rates. |
Thank you, Kristoffer for your insights. just wondering is there any parameter I can change to make isonform work with this high error rate? I will also run it in some recent real data soon. |
It looks like isONclust is doing a decent job looking onto the number of genes 12000 and clusters(~11500). |
Thank you @aljpetri . |
Yes, some low-abundance clusters (e.g. singleton reads) are typically expected due to the higher error rate. Still, it still seems reassuring that 11,496 clusters made the cutoff (given that 12,000 genes in the simulation) - while it doesn't give any evidence for the clustering quality, its reassuring that those numbers are in the same ballpark.
I think this is a good first fix. We are guessing the decay is due to the default parameters for isONcorrect, tuned for 6-7% error rates. Only in a second attempt I would try changing isONform parameters. Note that isONcorrect will take significantly longer with |
Hi @aljpetri ,
I'm trying to run your tool on a simulated ONT data with about 6 million reads. I used the new pipeline script, and run without error.
But when I checked the results, I found I have 29583 clusters after clustering step, and 11498 after correction (I've checked and think the smaller number is because low abundance clusters with less than 5 reads are removed). The number of reads after correction is about 3.9 million reads.
But for the final isonform results, it's only 2700 transcripts. And the total reads supporting these 2700 transcripts is only about 46000 reads.
I think that is not normal. Do you have any idea where to look for the reason? Any chance there is a bug? I can see the output folder of isonform contains the same number of
clusterxxx_merged.fa
andclusterxxx_merged.fa
as the cluster number in correction output. But some of the files are empty.Thank you so much.
Alex
The text was updated successfully, but these errors were encountered: