Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify MI for identifying source molecule strand #633

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions SAMtags.tex
Original file line number Diff line number Diff line change
Expand Up @@ -299,10 +299,11 @@ \subsection{Barcodes}
\item
The \emph{UMI} is intended to identify the (single- or double-stranded) molecule at the time that the barcode was introduced.
This can be used to inform duplicate marking and make consensus calling in ultra-deep sequencing.
Additionally, the UMI can be used to (informatically) link reads that were generated from the same long molecule, enabling long-range phasing and better informed mapping.
In some experimental setups opposite strands of the same double-stranded DNA molecule get related barcodes.
In some experimental setups opposite strands of the same double-stranded DNA molecule get related barcodes to differentiate from which strand of the double-stranded DNA molecule each read was observed.
In this case, the {\\t MI} tag can store not only the unique molecular identifier but also group reads that observe the top and bottom genomic strands respectively.
These templates can also be considered duplicates even though technically they may have different UMIs.
Multiple UMIs can be added by a protocol, possibly at different time-points, which means that specific knowledge of the protocol may be needed in order to analyze the resulting data correctly.
Additionally, the UMI can be used to (informatically) link reads that were generated from the same long molecule, enabling long-range phasing and better informed mapping.
Finally, multiple UMIs can be added by a protocol, possibly at different time-points, which means that specific knowledge of the protocol may be needed in order to analyze the resulting data correctly.
\end{itemize}

\begin{description}
Expand Down Expand Up @@ -337,7 +338,9 @@ \subsection{Barcodes}
\item[MI:Z:\tagvalue{str}]
Molecular Identifier.
A unique ID within the SAM file for the source molecule from which this read is derived.
All reads with the same {\tt MI} tag represent the group of reads derived from the same source molecule.
All reads with the same {\tt MI} tag represent the group of reads derived from the same source molecule.

The MI tag value may end with a {\tt /[^/]} suffix indicating that it is one of several related barcodes\footnote{For example, {\tt MI:Z:mol1/A} and {\tt MI:Z:mol1/B} could be used to identify read pairs from the opposite strands of a duplex source molecule, where the {\tt MI:Z:mol1/A} are by convention the "top (genomic) strand" reads and have 5' unclipped position of read one (of the pair) less than or equal to the 5' unclipped position of read two (of the pair). Then tools can find either the group of reads derived from that source molecule (those with the trimmed MI value {\tt mol1}) or the groups of reads derived from each strand of that duplex source molecule (those with the full MI value {\tt mol1/A}, or {\tt mol1/B} respectively).}. Where appropriate, tools may wish to omit these suffixes when determining a read's source molecule.
nh13 marked this conversation as resolved.
Show resolved Hide resolved

\item[OX:Z:\tagvalue{sequence+}]
Raw (uncorrected) unique molecular identifier bases, with any quality scores (optionally) stored in the {\tt BZ} tag.
Expand Down