-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
modified_bases.5mC.bed gets very different results than running modbam2bed on mod_mappings.bam #247
Comments
I just read issue #246 and realized this is likely because the mod_mappings.bam output is reference-anchored and the modified_bases.bed file is basecall-anchored. From my understanding of your comments in issue #246 , the reference-anchored generally works better? In our data (we are using R9.4 / lsk109), the reference-anchored output generally covers less CpGs (as shown above). But if it's more accurate, it's still preferable. I'll let you confirm before closing issue. Thanks! |
I think it would be a good idea to either discontinue or put a big warning for the creation of the basecall-anchored modified_bases.5mC.bed file. It is the only simple bed file that can be output directly from the megalodon command line, and people (like me) will tend to just grab it assuming it is the right output without understanding that it's inferior. I know this may eventually be rolled into guppy, but in the meantime people will be using megalodon. thanks for the amazing tools!!!! |
The |
I just noticed the discussion in #206 which noted that the bed methyl file in megalodon uses default cutoffs of 0.2 and 0.8. I believe modmap2bed uses defaults of 0.333 and 0.667. I don't think this would give the large discrepancy that we are seeing, but I will check that. |
Just to follow up on this. It is not due to default cutoffs for canonical vs. modified. modbam2bed outputs every CG covered by a read, regardless of modification status. So it has to do with what each of the two programs think is a reference CG to output. I gave one example of a covered reference CG output by modbam2bed but not output by modified_bases.5mC.bed (chr1:596372), in issue #249. For some reason it does not have an Mm/Ml tag for this position. I am still investigating the converse case (output in modified_bases.5mC.bed but not modbam2bed). |
The second case is certainly more confusing to me as well. A browser shot of the mappings.bam, mod_mappings.bam at one of these sites would be very helpful. |
Here is an example of the second case (output in modified_bases.5mC.bed but not modbam2bed). This read from chr16:19113482-19113641 has multiple CpGs in modified_bases.5mC.bed, but not a single one in the output of modbam2bed. One of the CpGs for instance is at position 19113602. Since this includes multiple CpGs, my best guess is that modbam2bed uses a blacklist or something, and Megalodon bed generator does not. |
@benbfly Are you running modbam2bed with the |
I am running with --cpg option. But these are definitely reference CGs in the genome I am using (hg38). I believe the Megalodon modification BAM actually shows the reference sequence rather than the read sequence. But in the screenshot I grab it from a fresh version of hg38. I don't think there's any read filtering that could cause this, since I think Megalodon sets the mapping quality to 40 for all reads and does not set any SAM flags (as you can see above). So my only thought is a blacklist of some kind, or a bug. I see a large number like this in my data. One thing that might be different from my data than your test datasets. We are doing cfDNA so these reads are very short. Maybe you have a read length filter or something? |
Can you provide an extra of the BAM file you are processing with modbam2bed and I will take a look. |
Attached... |
I cannot repoduce the lack of reporting sites:
I note in your screenshot above that you have a warning that the index file is older than the fasta file. Is it the correct index corresponding to the fasta? |
could it have to do with lower case vs. upper case letters in reference fasta (repeatmasker)? See my fasta pull above, all the cgs are lower case. I'm not at my computer anymore, but I could look at this later. I am using v0.4.0 of modbam2bed |
Ah yes, I believe you are correct. The check for a CpG site is simply to check the reference character is I don't know what I would call the "correct" behaviour here as I believe the intention of writing lowercase bases to a reference is so programs can easily perform this sort of filtering. In your case I suppose you haven't masked the reference for the purpose of masking methylation calls. Pragmatically I will change the default to report bother upper and lowercase reference sites, with an option to ignore lowercase. |
we just use the fasta created by UCSC , and they do lower case for repeat masked regions. We certainly don't want to filter those out, and I think very few people would want to filter them out of the bed file (especially with long reads where repeats are even less of an issue). I think this accounts for all of the discrepancy I was seeing with the megalodon created bed file (with the --edge-buffer and --mod-min-prob from issue #249 accounting for the ones in the other direction) if you make a new release , I can compare the two on my dataset to make sure they agree. here is where you can download the UCSC fasta for testing, if you want: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/ |
I've implemented allow reference bases to be lowercase in v0.4.4 of modbam2bed. This is available now. |
@cjw85 I downloaded v0.4.4 and it did not fix this problem. I'm checking into it now , maybe I didn't build it correctly. But please keep this issue open until I can check. Thanks - Ben. |
I'll have a look |
Apoligies @benbfly, I see the issue. |
@benbfly Could you try v0.4.5? |
v0.4.5 works! Now, every CpG covered by However, there are still a lot of CpGs that are covered by |
I see what this difference in behavior is - it is something I noticed before but forgot. It's the fact that
I do see that the ones I'm talking about could be easily removed by the user by filtering on column 5==0. But I just think many people will just use the default behavior and treat these as 0% modification. I think this accounts for all the differences between |
You are correct, modbam2bed calculates column 11 as (counts.c#L160):
so the description is not completely accurate. In such cases the column 5 will also be zero representing complete confusion in the value in column 11; in that sense the reported numbers are self-consistent, and the value of column 5 must be taken into account in the evaluation of column 11. That the interpretation of column 11 is contigent on column 5 is the case in other circumstances also, its just that such facts are often ignored. For example consider the case that 90% of reads contain a substitution to a C>T, is it really correct to conclude from a 100% value in column 11 that there is 100% methylation in the sample? No it is not, however much the casual user wants column 11 to mean that. This is why the definitions in columns 5 and 10 were chosen the way they were; the "specification" referenced in the modbam2bed documentation makes a false dichotomy between modified and unmodified. The modbam2bed avoids making this erroneous assumption, which makes the output messier but rather more truthful. It is also why the extended mode reporting the verbatim counts was added so users can derived whatever summary statistics they wish from the counts (one thing that the extended output misses is that only the sum Nmod + Ndel can be derived not the individual components). Note also there's some connection here to the earlier discussion that within the htslib Mm tag specification where missing data is assume unmodified. |
What would be the downside of outputting I think when the Mm tag is missing, you just have to follow the current spec - if the alphabet is set to '?' you treat missing Mm as N_filtered, if it is '.' , you treat missing as unmodified. |
There's little downside to outputting |
I also support outputting a Many users might naively just erroneously evaluate column 11. I implemented the bedMethyl conversion in a recent pipeline only converting column 11 if by default a coverage of |
I reread the (recent and admittedly post-hoc) specification for BED files: https://samtools.github.io/hts-specs/ It doesn't acknowledge I've opened a issue here samtools/hts-specs#634, to get some clarification. I don't want to make a change here without knowing its going to be the correct one (and have to make another change in the future). There is a sense in which 0 is correct: the bedMethyl description states
And it is true that 0% of reads are showing methylation, though admittedly the modbam2bed definition is more precise. |
Hi, @cjw85 I have run into the same issue as @benbfly. The generated .bed file from modbam2bed contains more CpGs as being unmethylated. This affects our downstream analysis and prediction of sample type from trained model. I was wondering, while you wait to make it a permanent change or decision, is there anyway I can rectify this. I mean by specifying a command line option or other way? Thank you |
I have found that if you specify the meglodon command line parameters We also use the megalodon parameter |
I'm waiting for a reply on the hts-specs for BED file as currently there is no guidance on how NaNs should be recorded in a BED file. I'll give this a couple of weeks and then make a decision. Currently my thought is that Column 11 in the modbam2bed output will be recorded as As it stands filtering the output file based on a the values of columns 5 and 10 is the recommended approach. A second option is to provide an option to elide such rows from the output. This will necessarily be lossy, and you will lose information about filtered canon or mod calls (or maybe more interesting substitutions). |
@cjw85 I don't think there will be any clarity on NaNs in BED format (unfortunately, BED is not a real spec). Bedtools, the most popular BED manipulation tool, will not allow numeric operations on any column that contains a non-number including @arq5x is the BED guru and might have an idea about this. |
Note that 2.5.0 release sets edge buffer to 0 and min mod prob to 0 to call all covered positions be default and reduce this issue for most users. |
Ouch! Probably deserved though. I would consider any custom BED fields to have use that "may be determined by private agreement among cooperating users", as Unicode says. Without any such other agreement or specification, you can't expect that |
Megaldon/2.4.2
modbam2bed/0.4.0 (https://github.com/epi2me-labs/modbam2bed)
samtools/1.14
I am running Megalodon in CG remora mode ("--remora-modified-bases dna_r9.4.1_e8 hac 0.0.0 5mc CG 0 --guppy-config dna_r9.4.1_450bps_hac.cfg").
I wanted to compare the results in the modified_bases.5mC.bed file, with the results of running the ONT modbam2bed utility on the mod_mappings.bam file. In theory, I thought that they should be very similar or identical. However, in my testing so far, they produce very different results.
I did a small test on 140,000 reads, and the following is the number of CpGs in the output:
CpGs covered by modbam2bed and modified_based: 2,255
CpGs covered by modbam2bed only: 1,033
CpGs covered by modified_bases only: 1,799
The bases in only one file were not on the opposite strand or off by 1 position or anything. There were not close to a CpG covered in the other file. I get similar results when doing much larger datasets. The only processing I had to do was a "samtools sort" and "samtools index" on the mod_mappings.bam file in order for it to work with modbam2bed. Also, we are sequencing very short reads (~167bp , cfDNA), which is probably not the typical use case.
This seems like strange behavior. I will look into the differences and see if I can figure out what's going on, but I thought I'd ask the experts. @marcus1487 @cjw85 .
I could always be doing something wrong, but I've checked pretty carefully and I don't think so.
The text was updated successfully, but these errors were encountered: