Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify MI for identifying source molecule strand #633

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

nh13
Copy link
Member

@nh13 nh13 commented Mar 2, 2022

For duplex sequencing technology, we can identify the strand (top or bottom) from which the read was derived relative to the duplex source molecule. This convention or recommendation would allow the strand information to be appended to the MI tag. This is already being used in the wild by fgbio and their users.

Related to samtools/samtools#1605.

@hts-specs-bot
Copy link

Changed PDFs as of 3727bec: SAMtags (diff). This link will expire in 30 days

@jkbonfield
Copy link
Contributor

Sorry for the slow reply.

Why /A and /B? That seems both redundant given we have /1 and /2 routinely used, and also rather sailing against the norm.
I do agree with /1 and /2 though. I'm not sure "the convention" is correct given this is basically done by one piece of software only. I'd maybe prefer something like "... the recommended way ...", or if we wish to be more forceful, "the IDs should end in /1 or /2". We'd probably also want some language here that states for purposes of comparing two identifiers as being equal, characters after and including the last "/" should be ignored.

@tfenne
Copy link
Member

tfenne commented Mar 28, 2022

A couple of thoughts @jkbonfield. I believe the /A and /B derives from the original duplex-seq paper out our UW which refers to the two strands of the duplex as the A and B strands. I don't love using /1 and /2 personally - folks are very used to that meaning read number in fastqs, and here we're talking about differentiating two DNA strands from a duplex, where there is no correlation to sequencing read, nor any real correlation to genomic strand. I would worry that using /1 and /2 could introduce confusion and make people think they should be linked to read numbers.

Understood that this is "one piece of software", though it should be pointed out that the one piece of software is recommended by both IDT and Twinstrand Bio, the two largest players in the duplex UMI space. I personally don't care if it's called convention, recommendation, common practise or something else - just trying to give context.

@jkbonfield
Copy link
Contributor

jkbonfield commented Mar 28, 2022

Thanks Tim. That's useful, and indeed I was confusing /1 /2 with the read numbers so already fallen into that trap.

Tbh I don't really know enough about this stuff to work out what's appropriate for the spec vs what's appropriate for a specific bit of software. However clearly it's helpful for the spec to say something as it's necessary to flag this up somehow, and if you think there's realistically only the one implementation out there right now then it's a good time to codify how things should look. (This is a mistake we made before by taking too long to define barcodes!)

@nh13
Copy link
Member Author

nh13 commented Mar 30, 2022

I am happy to remove /1 and /2. What additional things need for this to be approved?

@jkbonfield
Copy link
Contributor

We discussed it briefly at the File Formats conference call yesterday (you're welcome to join us for such things if you wish btw, as is anyone else with a general interest). The general consensus is it perhaps needs approaching from the other direction.

For example, instead of defining MI and then an additional note that it may end in /A, /B, tackle it more head on by defining MI in a structured manner of identifier plus status codes (eg ([^/]*)(/.*)?, but I'm not sure that's the best regexp, and maybe a regexp isn't the right description anyway). It then needs to explain that the first element is the part you use when comparing for equality, and the second element may be used for strand identification. It may be worth considering if we could envisage other status information appearing at some point, and to leave room for expansion if you think it's a possibility.

I think @jmarshall has some comments to add too, hopefully along similar lines.

@jmarshall
Copy link
Member

jmarshall commented Mar 30, 2022

As alluded to in samtools/samtools#1605 (comment) (which inspired this proposal by asking for an issue to be raised here), SAMtags.pdf currently describes MI as

A unique ID within the SAM file for the source molecule from which this read is derived. All reads with the same MI tag represent the group of reads derived from the same source molecule.

You seem to be saying that in fact fgbio is using the union of e.g. MI:Z:foo/A and MI:Z:foo/B to represent the group of reads derived from the same source molecule. IMHO this is an abuse of the MI field and it would have been better to add an additional (e.g. MS) tag to contain the strand information.

(Thus MI:Z:foo and e.g. MI:Z:foo MS:Z:B would give the strand information while preserving the specification's invariant that MI:Z:foo is the group of reads derived from the foo molecule.)

The problem is that there is no defined character set for MI values, so other tools may have been using slash characters as ordinary parts of identifiers. Hence downstream tools seeing a slash in these values would not know in advance whether or not they need to strip a /suffix from the MI value in order to compute the molecule equivalence classes as intended.

Is this something that only fgbio does, or do other workflows heavily using MI do something similar? What do MI values typically look like, both as generated by fgbio or by other workflows?
(Hence: Is the slash character still mostly available to be redefined as a suffix sigil?)

I am not necessarily against redefining MI in this way, provided MI values found in the wild do not preclude this redefinition of slashes. But the appropriate way for the spec to phrase it would be rather different (which is why I asked for an issue rather than proposed text 😄), something like

The MI tag value may end with a /[^/]+ suffix indicating that it is one of several related barcodes. Where appropriate, tools may wish to omit these suffixes when determining a read's source molecule.

[Footnote] For example, MI:Z:mol1/A and MI:Z:mol1/B could be used to identify reads from the top and bottom strands of a source molecule. Then tools can find either the group of reads derived from that source molecule (those with the stemtrimmed MI value mol1) or the groups of reads derived from each strand of that source molecule (those with the full MI value mol1/A, or mol1/B respectively).

Additional notes:

  • It should also explain what it means by “top” and “bottom” — see Base modifications section uses undefined term “top strand”  #639.
  • If it's going to talk about /1 and /2 then it needs to explicitly say that this is a confusing orthogonal usage to QNAMEs as seen in FASTQ files. (Probably in a footnote.)
  • You need to decide whether fgbio wants an arbitrarily long suffix (as in your initial samtools PR) or a single character (as in later versions).
  • Possibly this sort of suffix is what “In some experimental setups opposite strands of the same double-stranded DNA molecule get related barcodes” in the §1.3 introduction is intended to refer to, in which case the new text and that sentence should be altered so the link is clear.

@nh13
Copy link
Member Author

nh13 commented Apr 15, 2022

Is this something that only fgbio does, or do other workflows heavily using MI do something similar? What do MI values typically look like, both as generated by fgbio or by other workflows?
(Hence: Is the slash character still mostly available to be redefined as a suffix sigil?)

In fgbio, they look something like MI:Z:100005 post consensus calling (when the top and bottom strands have been used to create a consensus read), and MI:Z:100005/A and MI:Z:100005/B for top and bottom strands for the raw reads prior to consensus calling.

As for your suggestions, I've commented below:

Removed /1 and /2 and explained "top" and "bottom" incorporating your suggested text and footnote.

You need to decide whether fgbio wants an arbitrarily long suffix (as in your initial samtools PR) or a single character (as in later versions).

Changing it to a two character suffix, the slash and single character.

Possibly this sort of suffix is what “In some experimental setups opposite strands of the same double-stranded DNA molecule get related barcodes” in the §1.3 introduction is intended to refer to, in which case the new text and that sentence should be altered so the link is clear.

Done.

SAMtags.tex Outdated Show resolved Hide resolved
@hts-specs-bot
Copy link

hts-specs-bot commented Apr 15, 2022

Changed PDFs as of 3aafe16: SAMtags (diff). This link will expire in 30 days

@nh13
Copy link
Member Author

nh13 commented Apr 15, 2022

Looks like the link to the PDF from the bot is broken :/
Fixed by a preview of the upcoming GitHub Actions-based robot (see jmarshall#3).

@hts-specs-bot
Copy link

hts-specs-bot commented Jun 21, 2022

Changed PDFs as of f1fcd12: SAMtags (diff).

@nh13
Copy link
Member Author

nh13 commented Jun 28, 2022

@jkbonfield @jmarshall do you have any more guidance on this? We merged samtools/samtools#1605 which supports the strand of the source molecule in the template:coordinate sort.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jul 4, 2022

@jmarshall makes a good point about modifying MI vs adding a new e.g. MS tag. I simply don't have sufficient knowledge though to know how these are used in practice and to know whether others are already using slashes in their tag names.

The SAMtags doc at the moment clearly states:

The UMI is intended to identify the (single- or double-stranded) molecule at the time the barcode was introduced.

This is somewhat ambiguous, as a compound X/Y tag does still mean the UMI can identify the molecule X despite having different tags, but only by using additional formatting knowledge. However it then goes on to say

In some experimental setups opposite strands of the same double-stranded DNA molecule get related barcodes. These templates can also be considered duplicates even though technically they may have different UMIs.

So this is explicit in permitting UMI to imply duplication through something other than a naive strict string comparison. This PR does appear to be in agreement with that original intention, even though it is at odds with the wording in the MI tag itself. This isn't an ideal starting point obviously.

I see your PR modifies both the introductory text I quoted above, as well as the MI tag definition itself. That seems reasonable. Specifically related is totally woolly and this PR now clarifies how this relation is codified.

@jmarshall
Copy link
Member

Especially as the related samtools PR has since been merged, the ship has long since sailed on weighing up altering MI vs adding e.g. MS.

I fixed up the formatting previously so this can be reviewed more easily. IMHO the PR's direction is acceptable; I have some clarifications to the proposed text that I would like to make.

@jmarshall jmarshall self-assigned this Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Stalled
Development

Successfully merging this pull request may close these issues.

5 participants