Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modified_bases.5mC.bed gets very different results than running modbam2bed on mod_mappings.bam #247

Closed
benbfly opened this issue Feb 8, 2022 · 35 comments

Comments

@benbfly
Copy link

benbfly commented Feb 8, 2022

Megaldon/2.4.2
modbam2bed/0.4.0 (https://github.com/epi2me-labs/modbam2bed)
samtools/1.14

I am running Megalodon in CG remora mode ("--remora-modified-bases dna_r9.4.1_e8 hac 0.0.0 5mc CG 0 --guppy-config dna_r9.4.1_450bps_hac.cfg").

I wanted to compare the results in the modified_bases.5mC.bed file, with the results of running the ONT modbam2bed utility on the mod_mappings.bam file. In theory, I thought that they should be very similar or identical. However, in my testing so far, they produce very different results.

I did a small test on 140,000 reads, and the following is the number of CpGs in the output:
CpGs covered by modbam2bed and modified_based: 2,255
CpGs covered by modbam2bed only: 1,033
CpGs covered by modified_bases only: 1,799

The bases in only one file were not on the opposite strand or off by 1 position or anything. There were not close to a CpG covered in the other file. I get similar results when doing much larger datasets. The only processing I had to do was a "samtools sort" and "samtools index" on the mod_mappings.bam file in order for it to work with modbam2bed. Also, we are sequencing very short reads (~167bp , cfDNA), which is probably not the typical use case.

This seems like strange behavior. I will look into the differences and see if I can figure out what's going on, but I thought I'd ask the experts. @marcus1487 @cjw85 .

I could always be doing something wrong, but I've checked pretty carefully and I don't think so.

@benbfly benbfly changed the title modified_bases.5mC.bed produced by megalodon is very different from running modbam2bed on mod_mappings.bam modified_bases.5mC.bed gets very different results than running modbam2bed on mod_mappings.bam Feb 8, 2022
@benbfly
Copy link
Author

benbfly commented Feb 8, 2022

I just read issue #246 and realized this is likely because the mod_mappings.bam output is reference-anchored and the modified_bases.bed file is basecall-anchored.

From my understanding of your comments in issue #246 , the reference-anchored generally works better? In our data (we are using R9.4 / lsk109), the reference-anchored output generally covers less CpGs (as shown above). But if it's more accurate, it's still preferable.

I'll let you confirm before closing issue.

Thanks!
Ben.

@benbfly
Copy link
Author

benbfly commented Feb 9, 2022

I think it would be a good idea to either discontinue or put a big warning for the creation of the basecall-anchored modified_bases.5mC.bed file. It is the only simple bed file that can be output directly from the megalodon command line, and people (like me) will tend to just grab it assuming it is the right output without understanding that it's inferior. I know this may eventually be rolled into guppy, but in the meantime people will be using megalodon.

thanks for the amazing tools!!!!

@marcus1487
Copy link
Collaborator

The modified_bases.5mC.bed file is reference anchored and therefore should be this highest quality. These files should not be disagreeing nearly this much and I've not seen this level of disagreement before. I am working on reproducing this behavior internally and will report back.

@benbfly
Copy link
Author

benbfly commented Feb 10, 2022

I just noticed the discussion in #206 which noted that the bed methyl file in megalodon uses default cutoffs of 0.2 and 0.8. I believe modmap2bed uses defaults of 0.333 and 0.667. I don't think this would give the large discrepancy that we are seeing, but I will check that.

@benbfly
Copy link
Author

benbfly commented Feb 10, 2022

Just to follow up on this. It is not due to default cutoffs for canonical vs. modified. modbam2bed outputs every CG covered by a read, regardless of modification status. So it has to do with what each of the two programs think is a reference CG to output. I gave one example of a covered reference CG output by modbam2bed but not output by modified_bases.5mC.bed (chr1:596372), in issue #249. For some reason it does not have an Mm/Ml tag for this position.

I am still investigating the converse case (output in modified_bases.5mC.bed but not modbam2bed).

@marcus1487
Copy link
Collaborator

The second case is certainly more confusing to me as well. A browser shot of the mappings.bam, mod_mappings.bam at one of these sites would be very helpful.

@benbfly
Copy link
Author

benbfly commented Feb 10, 2022

Here is an example of the second case (output in modified_bases.5mC.bed but not modbam2bed).

This read from chr16:19113482-19113641 has multiple CpGs in modified_bases.5mC.bed, but not a single one in the output of modbam2bed. One of the CpGs for instance is at position 19113602.

Since this includes multiple CpGs, my best guess is that modbam2bed uses a blacklist or something, and Megalodon bed generator does not.

image

@cjw85
Copy link
Member

cjw85 commented Feb 10, 2022

@benbfly Are you running modbam2bed with the --cpg option? This will only output lines to the bed where the reference is sequence is CG (or GC to take into account the second strand). Since the sequences above from the BAM represent read sequences, they will not necessarily lead to a line in the BED file if the reference sequence is not also CG at that point. If you are not running with --cpg then the BED will contain a line for every single reference position (regardless of reference base).

@benbfly
Copy link
Author

benbfly commented Feb 10, 2022

I am running with --cpg option. But these are definitely reference CGs in the genome I am using (hg38). I believe the Megalodon modification BAM actually shows the reference sequence rather than the read sequence. But in the screenshot I grab it from a fresh version of hg38.

I don't think there's any read filtering that could cause this, since I think Megalodon sets the mapping quality to 40 for all reads and does not set any SAM flags (as you can see above). So my only thought is a blacklist of some kind, or a bug. I see a large number like this in my data.

One thing that might be different from my data than your test datasets. We are doing cfDNA so these reads are very short. Maybe you have a read length filter or something?

image

@cjw85
Copy link
Member

cjw85 commented Feb 10, 2022

Can you provide an extra of the BAM file you are processing with modbam2bed and I will take a look.

@benbfly
Copy link
Author

benbfly commented Feb 10, 2022

Attached...
modbam2bed_berman.tar.gz

@cjw85
Copy link
Member

cjw85 commented Feb 10, 2022

I cannot repoduce the lack of reporting sites:

$ ./modbam2bed -a 0.333 -b 0.667 ~/data/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa.gz --cpg --region=chr16:19113400-19113700 -e modbam2bed_berman/mod_mappings.sorted.bam
Analysing: 5-methylcytosine (5mC, C>m)
Processing: chr16:19113399-19113700
chr16	19113481	19113482	5mC	1000	+	19113481	19113482	0,0,0	1	100.00	0	1	0
chr16	19113493	19113494	5mC	1000	+	19113493	19113494	0,0,0	1	100.00	0	1	0
chr16	19113545	19113546	5mC	1000	+	19113545	19113546	0,0,0	1	100.00	0	1	0
chr16	19113602	19113603	5mC	1000	+	19113602	19113603	0,0,0	1	100.00	0	1	0
chr16	19113625	19113626	5mC	1000	+	19113625	19113626	0,0,0	1	0.00	1	0	0
100 %
Total time: 0.421185s

IGV showing methylated bases
image

I note in your screenshot above that you have a warning that the index file is older than the fasta file. Is it the correct index corresponding to the fasta?

@cjw85
Copy link
Member

cjw85 commented Feb 10, 2022

And the one site that isn't methylated:
image

@benbfly
Copy link
Author

benbfly commented Feb 10, 2022

Ok I don't have time now, but I'll look at my reference genome. We use it for everything , but maybe it's corrupted somehow.

image

@benbfly
Copy link
Author

benbfly commented Feb 10, 2022

could it have to do with lower case vs. upper case letters in reference fasta (repeatmasker)? See my fasta pull above, all the cgs are lower case. I'm not at my computer anymore, but I could look at this later.

I am using v0.4.0 of modbam2bed

@cjw85
Copy link
Member

cjw85 commented Feb 10, 2022

Ah yes, I believe you are correct. The check for a CpG site is simply to check the reference character is 'C' (or 'G' for reverse strand).

I don't know what I would call the "correct" behaviour here as I believe the intention of writing lowercase bases to a reference is so programs can easily perform this sort of filtering. In your case I suppose you haven't masked the reference for the purpose of masking methylation calls. Pragmatically I will change the default to report bother upper and lowercase reference sites, with an option to ignore lowercase.

@benbfly
Copy link
Author

benbfly commented Feb 10, 2022

we just use the fasta created by UCSC , and they do lower case for repeat masked regions. We certainly don't want to filter those out, and I think very few people would want to filter them out of the bed file (especially with long reads where repeats are even less of an issue).

I think this accounts for all of the discrepancy I was seeing with the megalodon created bed file (with the --edge-buffer and --mod-min-prob from issue #249 accounting for the ones in the other direction)

if you make a new release , I can compare the two on my dataset to make sure they agree.

here is where you can download the UCSC fasta for testing, if you want: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/

@cjw85
Copy link
Member

cjw85 commented Feb 11, 2022

I've implemented allow reference bases to be lowercase in v0.4.4 of modbam2bed. This is available now.

@benbfly
Copy link
Author

benbfly commented Feb 14, 2022

@cjw85 I downloaded v0.4.4 and it did not fix this problem. I'm checking into it now , maybe I didn't build it correctly. But please keep this issue open until I can check. Thanks - Ben.

image

@cjw85
Copy link
Member

cjw85 commented Feb 14, 2022

I'll have a look

@cjw85
Copy link
Member

cjw85 commented Feb 14, 2022

Apoligies @benbfly, I see the issue.

@cjw85
Copy link
Member

cjw85 commented Feb 14, 2022

@benbfly Could you try v0.4.5?

@benbfly
Copy link
Author

benbfly commented Feb 15, 2022

v0.4.5 works!

Now, every CpG covered by modified_bases.5mC.bed is covered by modbam2bed -a 0.2 -b 0.8. So the upper case lower case problem is fixed.

However, there are still a lot of CpGs that are covered by modbam2bed -a 0.2 -b 0.8 that not in modified_bases.5mC.bed , even when I run megalodon with --edge-buffer 0 --mod-min-prob 0. So maybe there are still some filters that megalodon bed is filtering out, that are not being filtered out by modbam2bed. I'm not sure yet if these have Mm/Ml tags, I will check and report back @marcus1487 . So I think keep this issue open until I can do that.

@benbfly
Copy link
Author

benbfly commented Feb 15, 2022

I see what this difference in behavior is - it is something I noticed before but forgot. It's the fact that modbam2bed outputs the CpGs even if they have 0 reads with "confident" modification calls (in this case modification prob<0.2 or >0.8). In the nomenclature of modbam2bed, these are CpGs with 1 or more "filtered" reads, but no "canonical" or "modified" reads.

Megalodon modified_bases.5mC.bed does not output these at all, which to me is the more reasonable behavior.

Modbam2bed outputs these with percent modified value of 0 (column 11) . This is not good behavior in my opinion, and I would consider it a bug for modbam2bed. The modbam2bed formula for percent modified is Nmod / (Nmod + Ncanon), and in this case the denominator is 0 so it should be undefined. Outputting these as 0 percent modified is incorrect and can lead to inaccurate analyses. It would be better for these to be omitted or with percent modified set to "." (the bed file convention for NA).

I do see that the ones I'm talking about could be easily removed by the user by filtering on column 5==0. But I just think many people will just use the default behavior and treat these as 0% modification.

I think this accounts for all the differences between modified_bases.5mC.bed and modbam2bed (except for the missing Mm/Ml tag issue from #249 ). Probably this was not a big deal with some of your test data, since it usually only makes a difference when you have a single low quality read covering the CpG. If you have multiple reads covering, it usually won't make a difference. We are doing very low depth sequencing (0.3x), so most of our CpGs are only covered by a single read. Thus we get a very large number of these CpGs where Nmod+Ncanon is 0.

image

@cjw85
Copy link
Member

cjw85 commented Feb 15, 2022

You are correct, modbam2bed calculates column 11 as (counts.c#L160):

meth = tot == 0 ? 0 : (100.0f * md) / tot;

so the description is not completely accurate. In such cases the column 5 will also be zero representing complete confusion in the value in column 11; in that sense the reported numbers are self-consistent, and the value of column 5 must be taken into account in the evaluation of column 11.

That the interpretation of column 11 is contigent on column 5 is the case in other circumstances also, its just that such facts are often ignored. For example consider the case that 90% of reads contain a substitution to a C>T, is it really correct to conclude from a 100% value in column 11 that there is 100% methylation in the sample? No it is not, however much the casual user wants column 11 to mean that. This is why the definitions in columns 5 and 10 were chosen the way they were; the "specification" referenced in the modbam2bed documentation makes a false dichotomy between modified and unmodified. The modbam2bed avoids making this erroneous assumption, which makes the output messier but rather more truthful. It is also why the extended mode reporting the verbatim counts was added so users can derived whatever summary statistics they wish from the counts (one thing that the extended output misses is that only the sum Nmod + Ndel can be derived not the individual components).

Note also there's some connection here to the earlier discussion that within the htslib Mm tag specification where missing data is assume unmodified.

@benbfly
Copy link
Author

benbfly commented Feb 15, 2022

What would be the downside of outputting . for column 11 rather than 0? One could still look at the extended columns if they wanted to get more information. I don't see a good reason to assume that a single read with a modification probability of 0.5 should be given a modification percentage of 0.

I think when the Mm tag is missing, you just have to follow the current spec - if the alphabet is set to '?' you treat missing Mm as N_filtered, if it is '.' , you treat missing as unmodified.

@cjw85
Copy link
Member

cjw85 commented Feb 15, 2022

There's little downside to outputting . save for parsers needing to be aware that its possible and what its meaning is. We can look to implement this in a next version. Practically it should make no difference because as highlighted above column 11 should never be interpreted without reference to columns 5 and 10.

@colindaven
Copy link
Contributor

I also support outputting a . instead of a 0 for a missing value, because a missing value or unknown is different to a value of 0, unmethylated. 0 is inherently wrong for positions with no read coverage.

Many users might naively just erroneously evaluate column 11.

I implemented the bedMethyl conversion in a recent pipeline only converting column 11 if by default a coverage of >=5 was reported in column 10.

@cjw85
Copy link
Member

cjw85 commented Mar 18, 2022

I reread the (recent and admittedly post-hoc) specification for BED files: https://samtools.github.io/hts-specs/

It doesn't acknowledge . to mean a missing or NaN (which is really what we'd be meaning here). It only mentions . in the context of the name and strand fields. Any naive parser is going to trip up over a column containing ostensibly Float type data. I quickly surveyed the bedTools code and it seems to just leave everything stringly-typed. Similar concerns were raised in a comment when the specification was being drafted: samtools/hts-specs#570 (comment)

I've opened a issue here samtools/hts-specs#634, to get some clarification. I don't want to make a change here without knowing its going to be the correct one (and have to make another change in the future). There is a sense in which 0 is correct: the bedMethyl description states

Percentage of reads that show methylation at this position in the genome

And it is true that 0% of reads are showing methylation, though admittedly the modbam2bed definition is more precise.

@m0nib
Copy link

m0nib commented Mar 24, 2022

Hi,

@cjw85 I have run into the same issue as @benbfly. The generated .bed file from modbam2bed contains more CpGs as being unmethylated.

This affects our downstream analysis and prediction of sample type from trained model.

I was wondering, while you wait to make it a permanent change or decision, is there anyway I can rectify this.

I mean by specifying a command line option or other way?

Thank you
Best
Monib

@benbfly benbfly closed this as completed Mar 24, 2022
@benbfly
Copy link
Author

benbfly commented Mar 24, 2022

I have found that if you specify the meglodon command line parameters --mod-min-prob 0, and set the modbam2bed parameters -a 0.2 -b 0.8, and update to modbam2bed version 0.4.5 or higher, that the results are quite close. By default, modbam2bed uses a lower threshold for methylation (-a 0.333 -b 0.667), so that is part of why it outputs more CpGs. Which of these settings is preferred may be application specific. For our application we have a very small number of reads so we want more CpGs (so we use the 0.333/0.667). But in general, I think I would tend to prefer the more selective 0.2/0.8 setting.

We also use the megalodon parameter --edge-buffer 0, but that's something that is mostly relevant because we are doing short read sequencing (<200bp).

@cjw85
Copy link
Member

cjw85 commented Mar 24, 2022

I'm waiting for a reply on the hts-specs for BED file as currently there is no guidance on how NaNs should be recorded in a BED file. I'll give this a couple of weeks and then make a decision. Currently my thought is that Column 11 in the modbam2bed output will be recorded as nan where the total = mod. + canon = 0 has occurred. This follows the VCF specification for such things and will allow a natural parsing of the file from code; the suggestion of using . would require user code to contain special handling whereas NaN is supported in various data analysis environments.

As it stands filtering the output file based on a the values of columns 5 and 10 is the recommended approach.

A second option is to provide an option to elide such rows from the output. This will necessarily be lossy, and you will lose information about filtered canon or mod calls (or maybe more interesting substitutions).

@benbfly benbfly reopened this Mar 24, 2022
@benbfly
Copy link
Author

benbfly commented Mar 24, 2022

@cjw85 I don't think there will be any clarity on NaNs in BED format (unfortunately, BED is not a real spec). Bedtools, the most popular BED manipulation tool, will not allow numeric operations on any column that contains a non-number including NaN or . (bedtools::isNumeric). So if you want to be as compliant as possible, you would need to remove those rows (or provide a tool to make a "compliant" version). I suggested . because the original BED spec used . for other fields with missing data, but I'm not actually aware of any tool that would parse that as NaN for a numeric field. So I think you're right that NaN would be a better option if that would allow it to be parsed correctly by other packages.

@arq5x is the BED guru and might have an idea about this.

@marcus1487
Copy link
Collaborator

Note that 2.5.0 release sets edge buffer to 0 and min mod prob to 0 to call all covered positions be default and reduce this issue for most users.

@michaelmhoffman
Copy link

unfortunately, BED is not a real spec

Ouch! Probably deserved though.

I would consider any custom BED fields to have use that "may be determined by private agreement among cooperating users", as Unicode says. Without any such other agreement or specification, you can't expect that NaN will—or won't—be used in any particular way. I would do what you think is best and document it. And if there's other software you use that applies heuristics to figure out data types (like BEDTools), you can always request that they add NaN to what they consider numeric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants