Base modifications section uses undefined term “top strand” #639

jmarshall · 2022-03-30T10:37:42Z

SAMtags.pdf §1.7 describes MM:Z as

MM:Z:…
The first character is the unmodified “fundamental” base as reported by the sequencing instrument for the top strand. […two further occurrences in later paragraphs…]

However “top strand” is a relatively uncommon term, and it does not appear in SAMv1.pdf's glossary. IMHO it should be briefly explained where it appears and/or in the glossary.

jrobinso · 2022-08-25T20:21:06Z

I was just searching for a definition of "top strand" myself, to interpret the samtags spec. I have no idea what it means.

I also am slightly confused by the strand indication in the modification string, even after reading this #362. It indicates that "-" is needed to indicate a modification was observed on the "opposite strand". Opposite to what? Is there any difference in C+m and G-m?

Apologies if I've added an extraneous question here, but I was hoping an understanding of "top strand" will clarify the use of modification strand.

jkbonfield · 2022-08-26T08:48:25Z

We are open to better wording if you can think of it.

The strand does need to be defined. From memory (I need to reread it all and get my head back in the right mode), the purpose is that the primary strand is whatever the sequencing instrument typically sequences. Ie the bases emitted by all Illumina platforms are all on the top strand - see "relative to the original sequenced strand of SEQ"). Some platforms however, such as the old ONT "2D" methods and some of their modern techniques, sequence forward along the template and then backwards again on the other (ie "opposite") strand, in order to reduce sequence-specific artifacts and to improve accuracy through consensus techniques. These methods have the potential to observe base modifications on both strands of the DNA template. Additionally in theory it may be possible for methods that prime from both ends of the template with an insert size that is short enough for the sequence pair to overlap, to produce a single double-stranded sequence record. (In practice we generally create two alignment records.)

The key thing to observe here is that these tags are additional and have to cope with standard manipulations of sequences by realignment. Such as converting back to fastq, aligning, and potentially mapping in the opposite orientation (reverse complementing). These tools may not be aware of the existing of the modification tags, hence the tags need to be grounded in terms of the original orientation as produced by the sequencing instrument. That unfortunately adds a second layer. Finally, the templates themselves may be randomly oriented with respect to the original chromosome, eg when replicating and fragmenting the DNA and then added adapters to the fragments with no regard to their orientation. Ie when paired seq we typically get ---read1---> <---read2--- or ---read2---> <---read1---.

Ie we have

Which strand(s) of the original DNA template were sequence (affects MM)
Whether or not the sequence has been reverse complemented during mapping (no affect on MM)
The orientation of the sequence insert/template prior to sequencing (no affect on MM, and it's also typically the cause of the point above).

All very confusing, but only one of these is modelled in MM.

jrobinso · 2022-08-26T14:45:57Z

Thanks for the explanation @jkbonfield .

I'm trying to insure IGV correctly handles the "-" fields correctly, as in "G-m", and ham hampered due to lack of real data that uses the "-" tag. Biologically it seems that "G-m" is equivalent to "C+m", , that is to say a cytosine was modified by addition of 5-methylcytosine. The bookkeeping differs of course. It appears the spec would allow an alignment to be annotated with both "G-m" and "C+m", and they could even conflict.

jkbonfield · 2022-08-26T15:35:50Z

As some sequencing instruments can take double-stranded data and internally at least are producing signals for both strands, it is indeed possible that there can be modifications detected on both strands at the same coordinate, and even potentially differing ones. This is one of the questions I had when designing this (not being a biologist). Hence the format permits it, even if right now I don't think any tools are generating such data.

The lack of real data really hampered development, but we had a catch-22 problem for a while as you can imagine. You're presumably now hitting the same problem in viewers - support for something that doesn't yet exist in the wild! All I can suggest is manually faking some data.

jrobinso · 2022-08-26T16:30:33Z

@jkbonfield Thanks for responding, really helpful, I had a feeling I was "missing something" but I don't think I was. BTW the unit test data, which I think you created has been key, almost essential, for me in understanding the spec.

For now I'm not going to worry excessively about data that doesn't exist anywhere.

cviner · 2022-08-26T16:57:06Z

In case it might be useful, please find below links to two genome-wide modified sequencing datasets, from BLUEPRINT. These data pertain to conventional and oxidative WGBS data generated for naive CD4+ T cells, extracted from the spleens of C57BL/6J mice, aged 6 weeks–8 weeks.

Conventional WGBS; annotates 5mC and 5hmC bases: GSE94674

Oxidative WGBS; annotates 5hmC only (ignoring a potentially small number of undetected 5fC nucleobases): GSE94675

These could be used to annotate existing SAM files, with modifications at specific positions. As @jkbonfield notes, it definitely is possible for differing and conflicting modifications to be detected at the same locus. Ultimately, some approach is needed to determine which takes precedence, unless all possibilities get annotated. We describe one such means of calling modified nucleobases in our TFBS work, which uses these datasets.

jrobinso · 2022-08-26T17:06:51Z

Thanks @cviner this is helpful.

I understand that it is definitely possible for multiple modifications to be specified for a single position, each with a probability. This is common now in data I have (5mC and 5hmC called at the same position). The probabilities cannot sum to > 1 however. The G- / C+ combination example is different, however, as you can potentially specify the same biological modification (e.g. 5mC) at the same position with different probabilities. That is all that I was pointing out.

jrobinso · 2022-08-26T22:24:54Z

@jkbonfield I'm making a lot of noise I guess, but maybe this is will be helpful. If I struggle with this I suspect someone else will. So after studying the unit test again, here is my understanding of "top strand" and the related "sequence as reported by the instrument". I don't know if this is an improvement or not, but it appears the latter refers to the sequence in the SEQ field, or its reverse complement if SAM flag bit 0x10 is set.

It's clear what I need to do from the examples and unit tests, but for me at least not clear from the spec. Terms such as "top strand" and "as reported by the instrument" are not precise, I have to make guesses as to how these relate to the fields in the SAM spec as neither "top strand" nor "instrument" are defined in that spec, or even mentioned.

So my suggestion would be to consider defining these terms explicitly in terms of fields in the SAM spec. This is after all what's needed to parse them. Possibly SEQ and the reverse-complement bit (0x10) are all that is needed, it appears to be all I need in order to parse the unit test data.

Hope this is helpful, if not ignore.

jkbonfield · 2022-08-30T08:04:20Z

Don't worry about the noise. I wish there was more while drafting this spec really. It took ages because the lack of data (or indeed people with an actual understanding of the biology) meant drafting the spec took months (if not years). All queries are an indication that the spec is unclear and needs improvement, but I'm unsure on how best to rephrase things so am always open to suggestions. (I'll have a think about your comments above.)

There were still things that got missed or I didn't understand though when doing this. For example I didn't twig that some callers would only attempt to call CPG islands and leave every other C as unanalysed.

This is better phrased in terms of original strand/orientation produced by sequencing instrument, and "other strand" meaning the opposite to the one it called. This avoids trying to define what "top" means and any ambiguity that causes in relation to the template orientation (for example). Fixes samtools#639

jmarshall added the sam-tags label Mar 30, 2022

This was referenced Mar 30, 2022

Describe MM and MP methylation tags. #418

Merged

Clarify MI for identifying source molecule strand #633

Open

jkbonfield linked a pull request Aug 31, 2022 that will close this issue

Remove "top strand" phrase from MM tag description. #668

Open

jkbonfield added the sam label Sep 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Base modifications section uses undefined term “top strand” #639

Base modifications section uses undefined term “top strand” #639

jmarshall commented Mar 30, 2022

jrobinso commented Aug 25, 2022 •

edited

Loading

jkbonfield commented Aug 26, 2022

jrobinso commented Aug 26, 2022

jkbonfield commented Aug 26, 2022

jrobinso commented Aug 26, 2022

cviner commented Aug 26, 2022

jrobinso commented Aug 26, 2022 •

edited

Loading

jrobinso commented Aug 26, 2022 •

edited

Loading

jkbonfield commented Aug 30, 2022 •

edited

Loading

Base modifications section uses undefined term “top strand” #639

Base modifications section uses undefined term “top strand” #639

Comments

jmarshall commented Mar 30, 2022

jrobinso commented Aug 25, 2022 • edited Loading

jkbonfield commented Aug 26, 2022

jrobinso commented Aug 26, 2022

jkbonfield commented Aug 26, 2022

jrobinso commented Aug 26, 2022

cviner commented Aug 26, 2022

jrobinso commented Aug 26, 2022 • edited Loading

jrobinso commented Aug 26, 2022 • edited Loading

jkbonfield commented Aug 30, 2022 • edited Loading

jrobinso commented Aug 25, 2022 •

edited

Loading

jrobinso commented Aug 26, 2022 •

edited

Loading

jrobinso commented Aug 26, 2022 •

edited

Loading

jkbonfield commented Aug 30, 2022 •

edited

Loading