Should ambiguity threshold scale with non-singleton group size? #695

miwolff · 2019-12-04T12:14:52Z

Hi Jared,

our group has been using Nanopolish quite heavily, it really is a great tool.

I'm not sure if this is a bug, but I've found a discrepancy in the python script calculate_methylation_frequency.py compared to the description in the Supplemental Material of Simpson et al. 2017.

The script treats non-singleton sites the same way as singleton sites, i.e. a site will be ambiguous if the absolute value of the log likelihood ratio is below the threshold of 2.5. In section 1.3.3 of the Supplement, the threshold is additionally multiplied with the number of CGs within the group. I'm not sure if a similar discrepancy exists in other scripts in the repo.

I've already wondered some time ago, why the analysis shows that non-singleton sites have less ambiguity (using the method from the script). When looking at the log likelihood distributions for different group sizes, larger groups have broader distributions, but they agree much better after a rescaling with 1/group size.

Can you clarify this for us?

Best regards,
Michael

jts · 2019-12-04T13:14:57Z

Hi @miwolff,

Thanks for pointing this out, indeed the calculation is different between the current code and the paper. Have you noticed a large difference in call rate and accuracy with and without scaling for group size?

Before deciding whether to change this I'd like to do some experiments to see how much it effects the frequencies. I'm at nanopore's NCM meeting for the rest of the week so it may take me some time to resolve this issue.

Jared

miwolff · 2019-12-17T11:23:35Z

Hi @jts,

I've looked at the log likelihood distribution over all sites of two S. cerevisiae test samples (completely methylated after cleaning the DNA and not methylated at all).

Groups show a much broader distribution:

If one rescales the log likelihood with 1/group size, effectively setting the threshold to 2.5*groupsize, the distributions become much more similar.

I saw the same behaviour for samples with around 30% methylation.

Rescaling the threshold will have some impact on the ambiguous fraction, depending on how many groups there are. In our case the average group size is 1.34.

I think the real question is, whether the Nanopolish algorithm actually can/should give you lower ambiguity for groups compared to singleton sites. In a randomly methylated sample, I would expect groups to be ambiguous more often, since the CG's inside the group might not all have the same methylation status.

Best,
Michael

daanverhagen · 2020-02-18T14:21:50Z

hi @jts
I am also interested in this issue. would be great if you could look into it.
best,
Daan

jts · 2020-02-18T20:26:19Z

Hi,

Sorry for letting this sit so long. I had a bit of time to run some tests today using my standard comparison (nanopolish methylation frequencies vs bisulfite for NA12878 chromsome 20).

Current version (not scaling by groups):

Group scaling:

The correlation with bisulfite is a tiny bit lower, but still quite good. The average coverage drops a little from 15.4x to 13.5x, since group scaling is more stringent.

I also tested group scaling with a lower log-likelihood ratio threshold of 2.0. In other discussions I've had the default of 2.5 may be too strict so I've been considering lowering this threshold anyway.

With group-scaling and LLR2, we have similar coverage to the current version of nanopolish.

Right now I am leaning towards lowering the default LLR, and scaling it by group size. Thoughts anyone?

Jared

miwolff · 2020-02-19T12:40:05Z

Hi Jared,
thanks your answer. I'm currently looking into the error rates for singleton sites and groups (with and without rescaling) for different thresholds for the two test sets. I'll post again when I'm done...
I think trying to get the error rates of singleton sites and groups similar to each other would be a good goal.
Best,
Michael

jts · 2020-02-19T14:44:06Z

I have implemented the suggested changes in the above branch, which I will merge after a bit of testing.

miwolff · 2020-02-19T15:51:48Z

Hi Jared,
alright, sounds good.
Regarding the error rates of groups vs singleton sites:
Even without rescaling, error rates of groups are lower than for singletons in our test data. This difference will increase with rescaling, but I think now that my idea to try to get the error rates similar is flawed, since error rates for groups are tricky; they only make since in the context of test samples that are not methylated at all or are completely methylated, but in real samples with half-methylated groups it gets more complicated.
I think rescaling makes sense in the way that it leads to a similar fraction of ambiguous sites for each group size in a real sample.
Groups showing less ambiguity in the case without rescaling seems strange, since any half-methylated groups should rather increase the ambiguous fraction.

jts · 2020-02-19T16:05:20Z

In naturally occurring DNA (for human at least) the methylation states of nearby CGs are correlated, so all-methylated or all-unmethylated are more likely to occur than you'd expect by chance, if the CGs were independent. That said, the ability to train models with mixed states is the right solution here.

jts · 2020-03-03T16:08:35Z

This has been merged into v0.12. Thanks for the report, tests and discussion leading to this change.

jts added a commit that referenced this issue Feb 19, 2020

change default LLR to 2, scale threshold by group size (#695)

ba96719

jts closed this as completed Mar 3, 2020

hasindu2008 mentioned this issue Apr 18, 2023

recommended threshold for the log likelihood ratio hasindu2008/f5c#129

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should ambiguity threshold scale with non-singleton group size? #695

Should ambiguity threshold scale with non-singleton group size? #695

miwolff commented Dec 4, 2019 •

edited

Loading

jts commented Dec 4, 2019

miwolff commented Dec 17, 2019

daanverhagen commented Feb 18, 2020

jts commented Feb 18, 2020

miwolff commented Feb 19, 2020

jts commented Feb 19, 2020

miwolff commented Feb 19, 2020

jts commented Feb 19, 2020

jts commented Mar 3, 2020

Should ambiguity threshold scale with non-singleton group size? #695

Should ambiguity threshold scale with non-singleton group size? #695

Comments

miwolff commented Dec 4, 2019 • edited Loading

jts commented Dec 4, 2019

miwolff commented Dec 17, 2019

daanverhagen commented Feb 18, 2020

jts commented Feb 18, 2020

miwolff commented Feb 19, 2020

jts commented Feb 19, 2020

miwolff commented Feb 19, 2020

jts commented Feb 19, 2020

jts commented Mar 3, 2020

miwolff commented Dec 4, 2019 •

edited

Loading