-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should ambiguity threshold scale with non-singleton group size? #695
Comments
Hi @miwolff, Thanks for pointing this out, indeed the calculation is different between the current code and the paper. Have you noticed a large difference in call rate and accuracy with and without scaling for group size? Before deciding whether to change this I'd like to do some experiments to see how much it effects the frequencies. I'm at nanopore's NCM meeting for the rest of the week so it may take me some time to resolve this issue. Jared |
Hi @jts, I've looked at the log likelihood distribution over all sites of two S. cerevisiae test samples (completely methylated after cleaning the DNA and not methylated at all). Groups show a much broader distribution: If one rescales the log likelihood with 1/group size, effectively setting the threshold to 2.5*groupsize, the distributions become much more similar. I saw the same behaviour for samples with around 30% methylation. Rescaling the threshold will have some impact on the ambiguous fraction, depending on how many groups there are. In our case the average group size is 1.34. I think the real question is, whether the Nanopolish algorithm actually can/should give you lower ambiguity for groups compared to singleton sites. In a randomly methylated sample, I would expect groups to be ambiguous more often, since the CG's inside the group might not all have the same methylation status. Best, |
hi @jts |
Hi, Sorry for letting this sit so long. I had a bit of time to run some tests today using my standard comparison (nanopolish methylation frequencies vs bisulfite for NA12878 chromsome 20). Current version (not scaling by groups): Group scaling: The correlation with bisulfite is a tiny bit lower, but still quite good. The average coverage drops a little from 15.4x to 13.5x, since group scaling is more stringent. I also tested group scaling with a lower log-likelihood ratio threshold of 2.0. In other discussions I've had the default of 2.5 may be too strict so I've been considering lowering this threshold anyway. With group-scaling and LLR2, we have similar coverage to the current version of nanopolish. Right now I am leaning towards lowering the default LLR, and scaling it by group size. Thoughts anyone? Jared |
Hi Jared, |
I have implemented the suggested changes in the above branch, which I will merge after a bit of testing. |
Hi Jared, |
In naturally occurring DNA (for human at least) the methylation states of nearby CGs are correlated, so all-methylated or all-unmethylated are more likely to occur than you'd expect by chance, if the CGs were independent. That said, the ability to train models with mixed states is the right solution here. |
This has been merged into v0.12. Thanks for the report, tests and discussion leading to this change. |
Hi Jared,
our group has been using Nanopolish quite heavily, it really is a great tool.
I'm not sure if this is a bug, but I've found a discrepancy in the python script calculate_methylation_frequency.py compared to the description in the Supplemental Material of Simpson et al. 2017.
The script treats non-singleton sites the same way as singleton sites, i.e. a site will be ambiguous if the absolute value of the log likelihood ratio is below the threshold of 2.5. In section 1.3.3 of the Supplement, the threshold is additionally multiplied with the number of CGs within the group. I'm not sure if a similar discrepancy exists in other scripts in the repo.
I've already wondered some time ago, why the analysis shows that non-singleton sites have less ambiguity (using the method from the script). When looking at the log likelihood distributions for different group sizes, larger groups have broader distributions, but they agree much better after a rescaling with 1/group size.
Can you clarify this for us?
Best regards,
Michael
The text was updated successfully, but these errors were encountered: