Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should ambiguity threshold scale with non-singleton group size? #695

Closed
miwolff opened this issue Dec 4, 2019 · 9 comments
Closed

Should ambiguity threshold scale with non-singleton group size? #695

miwolff opened this issue Dec 4, 2019 · 9 comments

Comments

@miwolff
Copy link

miwolff commented Dec 4, 2019

Hi Jared,

our group has been using Nanopolish quite heavily, it really is a great tool.

I'm not sure if this is a bug, but I've found a discrepancy in the python script calculate_methylation_frequency.py compared to the description in the Supplemental Material of Simpson et al. 2017.

The script treats non-singleton sites the same way as singleton sites, i.e. a site will be ambiguous if the absolute value of the log likelihood ratio is below the threshold of 2.5. In section 1.3.3 of the Supplement, the threshold is additionally multiplied with the number of CGs within the group. I'm not sure if a similar discrepancy exists in other scripts in the repo.

I've already wondered some time ago, why the analysis shows that non-singleton sites have less ambiguity (using the method from the script). When looking at the log likelihood distributions for different group sizes, larger groups have broader distributions, but they agree much better after a rescaling with 1/group size.

Can you clarify this for us?

Best regards,
Michael

@jts
Copy link
Owner

jts commented Dec 4, 2019

Hi @miwolff,

Thanks for pointing this out, indeed the calculation is different between the current code and the paper. Have you noticed a large difference in call rate and accuracy with and without scaling for group size?

Before deciding whether to change this I'd like to do some experiments to see how much it effects the frequencies. I'm at nanopore's NCM meeting for the rest of the week so it may take me some time to resolve this issue.

Jared

@miwolff
Copy link
Author

miwolff commented Dec 17, 2019

Hi @jts,

I've looked at the log likelihood distribution over all sites of two S. cerevisiae test samples (completely methylated after cleaning the DNA and not methylated at all).

Groups show a much broader distribution:
log_likelihood_ratio_distribution_not_scaled_with_group_size

If one rescales the log likelihood with 1/group size, effectively setting the threshold to 2.5*groupsize, the distributions become much more similar.
log_likelihood_ratio_distribution_scaled_with_group_size

I saw the same behaviour for samples with around 30% methylation.

Rescaling the threshold will have some impact on the ambiguous fraction, depending on how many groups there are. In our case the average group size is 1.34.

I think the real question is, whether the Nanopolish algorithm actually can/should give you lower ambiguity for groups compared to singleton sites. In a randomly methylated sample, I would expect groups to be ambiguous more often, since the CG's inside the group might not all have the same methylation status.

Best,
Michael

@daanverhagen
Copy link

hi @jts
I am also interested in this issue. would be great if you could look into it.
best,
Daan

@jts
Copy link
Owner

jts commented Feb 18, 2020

Hi,

Sorry for letting this sit so long. I had a bit of time to run some tests today using my standard comparison (nanopolish methylation frequencies vs bisulfite for NA12878 chromsome 20).

Current version (not scaling by groups):

baseline

Group scaling:

group_scale

The correlation with bisulfite is a tiny bit lower, but still quite good. The average coverage drops a little from 15.4x to 13.5x, since group scaling is more stringent.

I also tested group scaling with a lower log-likelihood ratio threshold of 2.0. In other discussions I've had the default of 2.5 may be too strict so I've been considering lowering this threshold anyway.

group_scale llr2

With group-scaling and LLR2, we have similar coverage to the current version of nanopolish.

Right now I am leaning towards lowering the default LLR, and scaling it by group size. Thoughts anyone?

Jared

@miwolff
Copy link
Author

miwolff commented Feb 19, 2020

Hi Jared,
thanks your answer. I'm currently looking into the error rates for singleton sites and groups (with and without rescaling) for different thresholds for the two test sets. I'll post again when I'm done...
I think trying to get the error rates of singleton sites and groups similar to each other would be a good goal.
Best,
Michael

@jts
Copy link
Owner

jts commented Feb 19, 2020

I have implemented the suggested changes in the above branch, which I will merge after a bit of testing.

@miwolff
Copy link
Author

miwolff commented Feb 19, 2020

Hi Jared,
alright, sounds good.
Regarding the error rates of groups vs singleton sites:
Even without rescaling, error rates of groups are lower than for singletons in our test data. This difference will increase with rescaling, but I think now that my idea to try to get the error rates similar is flawed, since error rates for groups are tricky; they only make since in the context of test samples that are not methylated at all or are completely methylated, but in real samples with half-methylated groups it gets more complicated.
I think rescaling makes sense in the way that it leads to a similar fraction of ambiguous sites for each group size in a real sample.
Groups showing less ambiguity in the case without rescaling seems strange, since any half-methylated groups should rather increase the ambiguous fraction.

@jts
Copy link
Owner

jts commented Feb 19, 2020

In naturally occurring DNA (for human at least) the methylation states of nearby CGs are correlated, so all-methylated or all-unmethylated are more likely to occur than you'd expect by chance, if the CGs were independent. That said, the ability to train models with mixed states is the right solution here.

@jts
Copy link
Owner

jts commented Mar 3, 2020

This has been merged into v0.12. Thanks for the report, tests and discussion leading to this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants