Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base modifications section uses undefined term “top strand” #639

Open
jmarshall opened this issue Mar 30, 2022 · 9 comments · May be fixed by #668
Open

Base modifications section uses undefined term “top strand” #639

jmarshall opened this issue Mar 30, 2022 · 9 comments · May be fixed by #668

Comments

@jmarshall
Copy link
Member

SAMtags.pdf §1.7 describes MM:Z as

MM:Z:…
The first character is the unmodified “fundamental” base as reported by the sequencing instrument for the top strand. […two further occurrences in later paragraphs…]

However “top strand” is a relatively uncommon term, and it does not appear in SAMv1.pdf's glossary. IMHO it should be briefly explained where it appears and/or in the glossary.

@jrobinso
Copy link

jrobinso commented Aug 25, 2022

I was just searching for a definition of "top strand" myself, to interpret the samtags spec. I have no idea what it means.

I also am slightly confused by the strand indication in the modification string, even after reading this #362. It indicates that "-" is needed to indicate a modification was observed on the "opposite strand". Opposite to what? Is there any difference in C+m and G-m?

Apologies if I've added an extraneous question here, but I was hoping an understanding of "top strand" will clarify the use of modification strand.

@jkbonfield
Copy link
Contributor

We are open to better wording if you can think of it.

The strand does need to be defined. From memory (I need to reread it all and get my head back in the right mode), the purpose is that the primary strand is whatever the sequencing instrument typically sequences. Ie the bases emitted by all Illumina platforms are all on the top strand - see "relative to the original sequenced strand of SEQ"). Some platforms however, such as the old ONT "2D" methods and some of their modern techniques, sequence forward along the template and then backwards again on the other (ie "opposite") strand, in order to reduce sequence-specific artifacts and to improve accuracy through consensus techniques. These methods have the potential to observe base modifications on both strands of the DNA template. Additionally in theory it may be possible for methods that prime from both ends of the template with an insert size that is short enough for the sequence pair to overlap, to produce a single double-stranded sequence record. (In practice we generally create two alignment records.)

The key thing to observe here is that these tags are additional and have to cope with standard manipulations of sequences by realignment. Such as converting back to fastq, aligning, and potentially mapping in the opposite orientation (reverse complementing). These tools may not be aware of the existing of the modification tags, hence the tags need to be grounded in terms of the original orientation as produced by the sequencing instrument. That unfortunately adds a second layer. Finally, the templates themselves may be randomly oriented with respect to the original chromosome, eg when replicating and fragmenting the DNA and then added adapters to the fragments with no regard to their orientation. Ie when paired seq we typically get ---read1---> <---read2--- or ---read2---> <---read1---.

Ie we have

  • Which strand(s) of the original DNA template were sequence (affects MM)
  • Whether or not the sequence has been reverse complemented during mapping (no affect on MM)
  • The orientation of the sequence insert/template prior to sequencing (no affect on MM, and it's also typically the cause of the point above).

All very confusing, but only one of these is modelled in MM.

@jrobinso
Copy link

Thanks for the explanation @jkbonfield .

I'm trying to insure IGV correctly handles the "-" fields correctly, as in "G-m", and ham hampered due to lack of real data that uses the "-" tag. Biologically it seems that "G-m" is equivalent to "C+m", , that is to say a cytosine was modified by addition of 5-methylcytosine. The bookkeeping differs of course. It appears the spec would allow an alignment to be annotated with both "G-m" and "C+m", and they could even conflict.

@jkbonfield
Copy link
Contributor

As some sequencing instruments can take double-stranded data and internally at least are producing signals for both strands, it is indeed possible that there can be modifications detected on both strands at the same coordinate, and even potentially differing ones. This is one of the questions I had when designing this (not being a biologist). Hence the format permits it, even if right now I don't think any tools are generating such data.

The lack of real data really hampered development, but we had a catch-22 problem for a while as you can imagine. You're presumably now hitting the same problem in viewers - support for something that doesn't yet exist in the wild! All I can suggest is manually faking some data.

@jrobinso
Copy link

@jkbonfield Thanks for responding, really helpful, I had a feeling I was "missing something" but I don't think I was. BTW the unit test data, which I think you created has been key, almost essential, for me in understanding the spec.

For now I'm not going to worry excessively about data that doesn't exist anywhere.

@cviner
Copy link

cviner commented Aug 26, 2022

In case it might be useful, please find below links to two genome-wide modified sequencing datasets, from BLUEPRINT. These data pertain to conventional and oxidative WGBS data generated for naive CD4+ T cells, extracted from the spleens of C57BL/6J mice, aged 6 weeks–8 weeks.

Conventional WGBS; annotates 5mC and 5hmC bases: GSE94674

Oxidative WGBS; annotates 5hmC only (ignoring a potentially small number of undetected 5fC nucleobases): GSE94675

These could be used to annotate existing SAM files, with modifications at specific positions. As @jkbonfield notes, it definitely is possible for differing and conflicting modifications to be detected at the same locus. Ultimately, some approach is needed to determine which takes precedence, unless all possibilities get annotated. We describe one such means of calling modified nucleobases in our TFBS work, which uses these datasets.

@jrobinso
Copy link

jrobinso commented Aug 26, 2022

Thanks @cviner this is helpful.

I understand that it is definitely possible for multiple modifications to be specified for a single position, each with a probability. This is common now in data I have (5mC and 5hmC called at the same position). The probabilities cannot sum to > 1 however. The G- / C+ combination example is different, however, as you can potentially specify the same biological modification (e.g. 5mC) at the same position with different probabilities. That is all that I was pointing out.

@jrobinso
Copy link

jrobinso commented Aug 26, 2022

@jkbonfield I'm making a lot of noise I guess, but maybe this is will be helpful. If I struggle with this I suspect someone else will. So after studying the unit test again, here is my understanding of "top strand" and the related "sequence as reported by the instrument". I don't know if this is an improvement or not, but it appears the latter refers to the sequence in the SEQ field, or its reverse complement if SAM flag bit 0x10 is set.

It's clear what I need to do from the examples and unit tests, but for me at least not clear from the spec. Terms such as "top strand" and "as reported by the instrument" are not precise, I have to make guesses as to how these relate to the fields in the SAM spec as neither "top strand" nor "instrument" are defined in that spec, or even mentioned.

So my suggestion would be to consider defining these terms explicitly in terms of fields in the SAM spec. This is after all what's needed to parse them. Possibly SEQ and the reverse-complement bit (0x10) are all that is needed, it appears to be all I need in order to parse the unit test data.

Hope this is helpful, if not ignore.

@jkbonfield
Copy link
Contributor

jkbonfield commented Aug 30, 2022

Don't worry about the noise. I wish there was more while drafting this spec really. It took ages because the lack of data (or indeed people with an actual understanding of the biology) meant drafting the spec took months (if not years). All queries are an indication that the spec is unclear and needs improvement, but I'm unsure on how best to rephrase things so am always open to suggestions. (I'll have a think about your comments above.)

There were still things that got missed or I didn't understand though when doing this. For example I didn't twig that some callers would only attempt to call CPG islands and leave every other C as unanalysed.

jkbonfield added a commit to jkbonfield/hts-specs that referenced this issue Aug 31, 2022
This is better phrased in terms of original strand/orientation
produced by sequencing instrument, and "other strand" meaning the
opposite to the one it called.  This avoids trying to define what
"top" means and any ambiguity that causes in relation to the template
orientation (for example).

Fixes samtools#639
@jkbonfield jkbonfield added the sam label Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Progressing
Development

Successfully merging a pull request may close this issue.

4 participants