modified_bases.5mC.bed gets very different results than running modbam2bed on mod_mappings.bam #247

benbfly · 2022-02-08T13:42:47Z

Megaldon/2.4.2
modbam2bed/0.4.0 (https://github.com/epi2me-labs/modbam2bed)
samtools/1.14

I am running Megalodon in CG remora mode ("--remora-modified-bases dna_r9.4.1_e8 hac 0.0.0 5mc CG 0 --guppy-config dna_r9.4.1_450bps_hac.cfg").

I wanted to compare the results in the modified_bases.5mC.bed file, with the results of running the ONT modbam2bed utility on the mod_mappings.bam file. In theory, I thought that they should be very similar or identical. However, in my testing so far, they produce very different results.

I did a small test on 140,000 reads, and the following is the number of CpGs in the output:
CpGs covered by modbam2bed and modified_based: 2,255
CpGs covered by modbam2bed only: 1,033
CpGs covered by modified_bases only: 1,799

The bases in only one file were not on the opposite strand or off by 1 position or anything. There were not close to a CpG covered in the other file. I get similar results when doing much larger datasets. The only processing I had to do was a "samtools sort" and "samtools index" on the mod_mappings.bam file in order for it to work with modbam2bed. Also, we are sequencing very short reads (~167bp , cfDNA), which is probably not the typical use case.

This seems like strange behavior. I will look into the differences and see if I can figure out what's going on, but I thought I'd ask the experts. @marcus1487 @cjw85 .

I could always be doing something wrong, but I've checked pretty carefully and I don't think so.

benbfly · 2022-02-08T13:54:16Z

I just read issue #246 and realized this is likely because the mod_mappings.bam output is reference-anchored and the modified_bases.bed file is basecall-anchored.

From my understanding of your comments in issue #246 , the reference-anchored generally works better? In our data (we are using R9.4 / lsk109), the reference-anchored output generally covers less CpGs (as shown above). But if it's more accurate, it's still preferable.

I'll let you confirm before closing issue.

Thanks!
Ben.

benbfly · 2022-02-09T07:27:36Z

I think it would be a good idea to either discontinue or put a big warning for the creation of the basecall-anchored modified_bases.5mC.bed file. It is the only simple bed file that can be output directly from the megalodon command line, and people (like me) will tend to just grab it assuming it is the right output without understanding that it's inferior. I know this may eventually be rolled into guppy, but in the meantime people will be using megalodon.

thanks for the amazing tools!!!!

marcus1487 · 2022-02-09T14:36:34Z

The modified_bases.5mC.bed file is reference anchored and therefore should be this highest quality. These files should not be disagreeing nearly this much and I've not seen this level of disagreement before. I am working on reproducing this behavior internally and will report back.

benbfly · 2022-02-10T04:32:09Z

I just noticed the discussion in #206 which noted that the bed methyl file in megalodon uses default cutoffs of 0.2 and 0.8. I believe modmap2bed uses defaults of 0.333 and 0.667. I don't think this would give the large discrepancy that we are seeing, but I will check that.

benbfly · 2022-02-10T17:11:21Z

Just to follow up on this. It is not due to default cutoffs for canonical vs. modified. modbam2bed outputs every CG covered by a read, regardless of modification status. So it has to do with what each of the two programs think is a reference CG to output. I gave one example of a covered reference CG output by modbam2bed but not output by modified_bases.5mC.bed (chr1:596372), in issue #249. For some reason it does not have an Mm/Ml tag for this position.

I am still investigating the converse case (output in modified_bases.5mC.bed but not modbam2bed).

marcus1487 · 2022-02-10T17:14:01Z

The second case is certainly more confusing to me as well. A browser shot of the mappings.bam, mod_mappings.bam at one of these sites would be very helpful.

benbfly · 2022-02-10T17:42:18Z

Here is an example of the second case (output in modified_bases.5mC.bed but not modbam2bed).

This read from chr16:19113482-19113641 has multiple CpGs in modified_bases.5mC.bed, but not a single one in the output of modbam2bed. One of the CpGs for instance is at position 19113602.

Since this includes multiple CpGs, my best guess is that modbam2bed uses a blacklist or something, and Megalodon bed generator does not.

cjw85 · 2022-02-10T17:50:14Z

@benbfly Are you running modbam2bed with the --cpg option? This will only output lines to the bed where the reference is sequence is CG (or GC to take into account the second strand). Since the sequences above from the BAM represent read sequences, they will not necessarily lead to a line in the BED file if the reference sequence is not also CG at that point. If you are not running with --cpg then the BED will contain a line for every single reference position (regardless of reference base).

benbfly · 2022-02-10T18:07:51Z

I am running with --cpg option. But these are definitely reference CGs in the genome I am using (hg38). I believe the Megalodon modification BAM actually shows the reference sequence rather than the read sequence. But in the screenshot I grab it from a fresh version of hg38.

I don't think there's any read filtering that could cause this, since I think Megalodon sets the mapping quality to 40 for all reads and does not set any SAM flags (as you can see above). So my only thought is a blacklist of some kind, or a bug. I see a large number like this in my data.

One thing that might be different from my data than your test datasets. We are doing cfDNA so these reads are very short. Maybe you have a read length filter or something?

cjw85 · 2022-02-10T18:09:32Z

Can you provide an extra of the BAM file you are processing with modbam2bed and I will take a look.

benbfly · 2022-02-10T18:26:33Z

Attached...
modbam2bed_berman.tar.gz

cjw85 · 2022-02-10T20:17:52Z

I cannot repoduce the lack of reporting sites:

$ ./modbam2bed -a 0.333 -b 0.667 ~/data/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa.gz --cpg --region=chr16:19113400-19113700 -e modbam2bed_berman/mod_mappings.sorted.bam
Analysing: 5-methylcytosine (5mC, C>m)
Processing: chr16:19113399-19113700
chr16	19113481	19113482	5mC	1000	+	19113481	19113482	0,0,0	1	100.00	0	1	0
chr16	19113493	19113494	5mC	1000	+	19113493	19113494	0,0,0	1	100.00	0	1	0
chr16	19113545	19113546	5mC	1000	+	19113545	19113546	0,0,0	1	100.00	0	1	0
chr16	19113602	19113603	5mC	1000	+	19113602	19113603	0,0,0	1	100.00	0	1	0
chr16	19113625	19113626	5mC	1000	+	19113625	19113626	0,0,0	1	0.00	1	0	0
100 %
Total time: 0.421185s

IGV showing methylated bases

I note in your screenshot above that you have a warning that the index file is older than the fasta file. Is it the correct index corresponding to the fasta?

cjw85 · 2022-02-10T20:19:14Z

And the one site that isn't methylated:

benbfly · 2022-02-10T20:38:33Z

Ok I don't have time now, but I'll look at my reference genome. We use it for everything , but maybe it's corrupted somehow.

benbfly · 2022-02-10T20:43:51Z

could it have to do with lower case vs. upper case letters in reference fasta (repeatmasker)? See my fasta pull above, all the cgs are lower case. I'm not at my computer anymore, but I could look at this later.

I am using v0.4.0 of modbam2bed

cjw85 · 2022-02-10T20:49:48Z

Ah yes, I believe you are correct. The check for a CpG site is simply to check the reference character is 'C' (or 'G' for reverse strand).

I don't know what I would call the "correct" behaviour here as I believe the intention of writing lowercase bases to a reference is so programs can easily perform this sort of filtering. In your case I suppose you haven't masked the reference for the purpose of masking methylation calls. Pragmatically I will change the default to report bother upper and lowercase reference sites, with an option to ignore lowercase.

benbfly · 2022-02-10T21:11:08Z

we just use the fasta created by UCSC , and they do lower case for repeat masked regions. We certainly don't want to filter those out, and I think very few people would want to filter them out of the bed file (especially with long reads where repeats are even less of an issue).

I think this accounts for all of the discrepancy I was seeing with the megalodon created bed file (with the --edge-buffer and --mod-min-prob from issue #249 accounting for the ones in the other direction)

if you make a new release , I can compare the two on my dataset to make sure they agree.

here is where you can download the UCSC fasta for testing, if you want: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/

cjw85 · 2022-02-11T18:22:58Z

I've implemented allow reference bases to be lowercase in v0.4.4 of modbam2bed. This is available now.

benbfly · 2022-02-14T11:41:24Z

@cjw85 I downloaded v0.4.4 and it did not fix this problem. I'm checking into it now , maybe I didn't build it correctly. But please keep this issue open until I can check. Thanks - Ben.

cjw85 · 2022-02-14T11:58:19Z

I'll have a look

cjw85 · 2022-02-14T12:10:34Z

Apoligies @benbfly, I see the issue.

cjw85 · 2022-02-14T12:31:50Z

@benbfly Could you try v0.4.5?

benbfly · 2022-02-15T11:09:42Z

v0.4.5 works!

Now, every CpG covered by modified_bases.5mC.bed is covered by modbam2bed -a 0.2 -b 0.8. So the upper case lower case problem is fixed.

However, there are still a lot of CpGs that are covered by modbam2bed -a 0.2 -b 0.8 that not in modified_bases.5mC.bed , even when I run megalodon with --edge-buffer 0 --mod-min-prob 0. So maybe there are still some filters that megalodon bed is filtering out, that are not being filtered out by modbam2bed. I'm not sure yet if these have Mm/Ml tags, I will check and report back @marcus1487 . So I think keep this issue open until I can do that.

benbfly · 2022-02-15T12:00:44Z

I see what this difference in behavior is - it is something I noticed before but forgot. It's the fact that modbam2bed outputs the CpGs even if they have 0 reads with "confident" modification calls (in this case modification prob<0.2 or >0.8). In the nomenclature of modbam2bed, these are CpGs with 1 or more "filtered" reads, but no "canonical" or "modified" reads.

Megalodon modified_bases.5mC.bed does not output these at all, which to me is the more reasonable behavior.

Modbam2bed outputs these with percent modified value of 0 (column 11) . This is not good behavior in my opinion, and I would consider it a bug for modbam2bed. The modbam2bed formula for percent modified is Nmod / (Nmod + Ncanon), and in this case the denominator is 0 so it should be undefined. Outputting these as 0 percent modified is incorrect and can lead to inaccurate analyses. It would be better for these to be omitted or with percent modified set to "." (the bed file convention for NA).

I do see that the ones I'm talking about could be easily removed by the user by filtering on column 5==0. But I just think many people will just use the default behavior and treat these as 0% modification.

I think this accounts for all the differences between modified_bases.5mC.bed and modbam2bed (except for the missing Mm/Ml tag issue from #249 ). Probably this was not a big deal with some of your test data, since it usually only makes a difference when you have a single low quality read covering the CpG. If you have multiple reads covering, it usually won't make a difference. We are doing very low depth sequencing (0.3x), so most of our CpGs are only covered by a single read. Thus we get a very large number of these CpGs where Nmod+Ncanon is 0.

cjw85 · 2022-02-15T13:41:01Z

You are correct, modbam2bed calculates column 11 as (counts.c#L160):

meth = tot == 0 ? 0 : (100.0f * md) / tot;

so the description is not completely accurate. In such cases the column 5 will also be zero representing complete confusion in the value in column 11; in that sense the reported numbers are self-consistent, and the value of column 5 must be taken into account in the evaluation of column 11.

That the interpretation of column 11 is contigent on column 5 is the case in other circumstances also, its just that such facts are often ignored. For example consider the case that 90% of reads contain a substitution to a C>T, is it really correct to conclude from a 100% value in column 11 that there is 100% methylation in the sample? No it is not, however much the casual user wants column 11 to mean that. This is why the definitions in columns 5 and 10 were chosen the way they were; the "specification" referenced in the modbam2bed documentation makes a false dichotomy between modified and unmodified. The modbam2bed avoids making this erroneous assumption, which makes the output messier but rather more truthful. It is also why the extended mode reporting the verbatim counts was added so users can derived whatever summary statistics they wish from the counts (one thing that the extended output misses is that only the sum N_mod + N_del can be derived not the individual components).

Note also there's some connection here to the earlier discussion that within the htslib Mm tag specification where missing data is assume unmodified.

benbfly · 2022-02-15T14:11:59Z

What would be the downside of outputting . for column 11 rather than 0? One could still look at the extended columns if they wanted to get more information. I don't see a good reason to assume that a single read with a modification probability of 0.5 should be given a modification percentage of 0.

I think when the Mm tag is missing, you just have to follow the current spec - if the alphabet is set to '?' you treat missing Mm as N_filtered, if it is '.' , you treat missing as unmodified.

cjw85 · 2022-02-15T14:37:18Z

There's little downside to outputting . save for parsers needing to be aware that its possible and what its meaning is. We can look to implement this in a next version. Practically it should make no difference because as highlighted above column 11 should never be interpreted without reference to columns 5 and 10.

colindaven · 2022-03-18T11:07:42Z

I also support outputting a . instead of a 0 for a missing value, because a missing value or unknown is different to a value of 0, unmethylated. 0 is inherently wrong for positions with no read coverage.

Many users might naively just erroneously evaluate column 11.

I implemented the bedMethyl conversion in a recent pipeline only converting column 11 if by default a coverage of >=5 was reported in column 10.

cjw85 · 2022-03-18T12:26:51Z

I reread the (recent and admittedly post-hoc) specification for BED files: https://samtools.github.io/hts-specs/

It doesn't acknowledge . to mean a missing or NaN (which is really what we'd be meaning here). It only mentions . in the context of the name and strand fields. Any naive parser is going to trip up over a column containing ostensibly Float type data. I quickly surveyed the bedTools code and it seems to just leave everything stringly-typed. Similar concerns were raised in a comment when the specification was being drafted: samtools/hts-specs#570 (comment)

I've opened a issue here samtools/hts-specs#634, to get some clarification. I don't want to make a change here without knowing its going to be the correct one (and have to make another change in the future). There is a sense in which 0 is correct: the bedMethyl description states

Percentage of reads that show methylation at this position in the genome

And it is true that 0% of reads are showing methylation, though admittedly the modbam2bed definition is more precise.

m0nib · 2022-03-24T13:35:27Z

Hi,

@cjw85 I have run into the same issue as @benbfly. The generated .bed file from modbam2bed contains more CpGs as being unmethylated.

This affects our downstream analysis and prediction of sample type from trained model.

I was wondering, while you wait to make it a permanent change or decision, is there anyway I can rectify this.

I mean by specifying a command line option or other way?

Thank you
Best
Monib

benbfly · 2022-03-24T15:20:33Z

I have found that if you specify the meglodon command line parameters --mod-min-prob 0, and set the modbam2bed parameters -a 0.2 -b 0.8, and update to modbam2bed version 0.4.5 or higher, that the results are quite close. By default, modbam2bed uses a lower threshold for methylation (-a 0.333 -b 0.667), so that is part of why it outputs more CpGs. Which of these settings is preferred may be application specific. For our application we have a very small number of reads so we want more CpGs (so we use the 0.333/0.667). But in general, I think I would tend to prefer the more selective 0.2/0.8 setting.

We also use the megalodon parameter --edge-buffer 0, but that's something that is mostly relevant because we are doing short read sequencing (<200bp).

cjw85 · 2022-03-24T15:27:37Z

I'm waiting for a reply on the hts-specs for BED file as currently there is no guidance on how NaNs should be recorded in a BED file. I'll give this a couple of weeks and then make a decision. Currently my thought is that Column 11 in the modbam2bed output will be recorded as nan where the total = mod. + canon = 0 has occurred. This follows the VCF specification for such things and will allow a natural parsing of the file from code; the suggestion of using . would require user code to contain special handling whereas NaN is supported in various data analysis environments.

As it stands filtering the output file based on a the values of columns 5 and 10 is the recommended approach.

A second option is to provide an option to elide such rows from the output. This will necessarily be lossy, and you will lose information about filtered canon or mod calls (or maybe more interesting substitutions).

benbfly · 2022-03-24T16:18:16Z

@cjw85 I don't think there will be any clarity on NaNs in BED format (unfortunately, BED is not a real spec). Bedtools, the most popular BED manipulation tool, will not allow numeric operations on any column that contains a non-number including NaN or . (bedtools::isNumeric). So if you want to be as compliant as possible, you would need to remove those rows (or provide a tool to make a "compliant" version). I suggested . because the original BED spec used . for other fields with missing data, but I'm not actually aware of any tool that would parse that as NaN for a numeric field. So I think you're right that NaN would be a better option if that would allow it to be parsed correctly by other packages.

@arq5x is the BED guru and might have an idea about this.

marcus1487 · 2022-03-25T01:22:28Z

Note that 2.5.0 release sets edge buffer to 0 and min mod prob to 0 to call all covered positions be default and reduce this issue for most users.

michaelmhoffman · 2022-05-23T20:41:20Z

unfortunately, BED is not a real spec

Ouch! Probably deserved though.

I would consider any custom BED fields to have use that "may be determined by private agreement among cooperating users", as Unicode says. Without any such other agreement or specification, you can't expect that NaN will—or won't—be used in any particular way. I would do what you think is best and document it. And if there's other software you use that applies heuristics to figure out data types (like BEDTools), you can always request that they add NaN to what they consider numeric.

benbfly changed the title ~~modified_bases.5mC.bed produced by megalodon is very different from running modbam2bed on mod_mappings.bam~~ modified_bases.5mC.bed gets very different results than running modbam2bed on mod_mappings.bam Feb 8, 2022

benbfly mentioned this issue Feb 10, 2022

Why do some reference CGs not have Mm/Ml entries in mod_mappings.bam? #249

Closed

cjw85 mentioned this issue Mar 17, 2022

More sequence contexts for modified base calls (feature request) epi2me-labs/modbam2bed#9

Closed

benbfly closed this as completed Mar 24, 2022

benbfly reopened this Mar 24, 2022

marcus1487 closed this as completed Mar 25, 2022

cjw85 mentioned this issue Jan 19, 2023

the coverage we really want is N_mod+N_canon epi2me-labs/modbam2bed#11

Closed

modified_bases.5mC.bed gets very different results than running modbam2bed on mod_mappings.bam #247

modified_bases.5mC.bed gets very different results than running modbam2bed on mod_mappings.bam #247

Comments

benbfly commented Feb 8, 2022

benbfly commented Feb 8, 2022

benbfly commented Feb 9, 2022

marcus1487 commented Feb 9, 2022

benbfly commented Feb 10, 2022

benbfly commented Feb 10, 2022 • edited Loading

marcus1487 commented Feb 10, 2022

benbfly commented Feb 10, 2022 • edited Loading

cjw85 commented Feb 10, 2022

benbfly commented Feb 10, 2022 • edited Loading

cjw85 commented Feb 10, 2022

benbfly commented Feb 10, 2022

cjw85 commented Feb 10, 2022

cjw85 commented Feb 10, 2022

benbfly commented Feb 10, 2022

benbfly commented Feb 10, 2022 • edited Loading

cjw85 commented Feb 10, 2022

benbfly commented Feb 10, 2022

cjw85 commented Feb 11, 2022

benbfly commented Feb 14, 2022

cjw85 commented Feb 14, 2022

cjw85 commented Feb 14, 2022

cjw85 commented Feb 14, 2022

benbfly commented Feb 15, 2022

benbfly commented Feb 15, 2022 • edited Loading

cjw85 commented Feb 15, 2022

benbfly commented Feb 15, 2022

cjw85 commented Feb 15, 2022

colindaven commented Mar 18, 2022

cjw85 commented Mar 18, 2022

m0nib commented Mar 24, 2022

benbfly commented Mar 24, 2022

cjw85 commented Mar 24, 2022

benbfly commented Mar 24, 2022 • edited Loading

marcus1487 commented Mar 25, 2022

michaelmhoffman commented May 23, 2022

benbfly commented Feb 10, 2022 •

edited

Loading

benbfly commented Feb 10, 2022 •

edited

Loading

benbfly commented Feb 10, 2022 •

edited

Loading

benbfly commented Feb 10, 2022 •

edited

Loading

benbfly commented Feb 15, 2022 •

edited

Loading

benbfly commented Mar 24, 2022 •

edited

Loading