Merge branch 'master' into cigar-64k

dkj · Oct 3, 2017 · e4ef421 · e4ef421
2 parents 95b4c75 + 3f2e480
commit e4ef421
Show file tree

Hide file tree

Showing 12 changed files with 229 additions and 23 deletions.
diff --git a/CRAMv3.pdf b/CRAMv3.pdf
diff --git a/SAMtags.pdf b/SAMtags.pdf
diff --git a/SAMtags.tex b/SAMtags.tex
@@ -2,6 +2,7 @@
 \usepackage[margin=1in]{geometry}
 \usepackage{longtable}
 \usepackage[pdfborder={0 0 0},hyperfootnotes=false]{hyperref}
+\usepackage[title]{appendix}
 
 \newcommand{\mailtourl}[1]{\href{mailto:#1}{\tt #1}}
 \newcommand{\tagvalue}[1]{\tt #1}
@@ -55,16 +56,17 @@ \section{Standard tags}
   \hline
   {\tt AM} & i & The smallest template-independent mapping quality of segments in the rest \\
   {\tt AS} & i & Alignment score generated by aligner \\
-  {\tt BC} & Z & Barcode sequence \\
+  {\tt BC} & Z & Barcode sequence identifying the sample \\
   {\tt BQ} & Z & Offset to base alignment quality (BAQ) \\
+  {\tt BZ} & Z & Phred quality of the unique molecular barcode bases in the {\tt OX} tag \\
   {\tt CC} & Z & Reference name of the next hit \\
   {\tt CG} & B,I & Intended to store the real {\sf CIGAR} if it contains $>$65535 operations\\
   {\tt CM} & i & Edit distance between the color sequence and the color reference (see also {\tt NM})\\
   {\tt CO} & Z & Free-text comments \\
   {\tt CP} & i & Leftmost coordinate of the next hit \\
   {\tt CQ} & Z & Color read base qualities \\
   {\tt CS} & Z & Color read sequence \\
-  {\tt CT} & Z & Complete read annotation tag, used for consensus annotation dummy features.\\
+  {\tt CT} & Z & Complete read annotation tag, used for consensus annotation dummy features \\
   {\tt E2} & Z & The 2nd most likely base calls \\
   {\tt FI} & i & The index of segment in the template \\
   {\tt FS} & Z & Segment suffix \\
@@ -76,32 +78,36 @@ \section{Standard tags}
   {\tt H1} & i & Number of 1-difference hits (see also {\tt NM}) \\
   {\tt H2} & i & Number of 2-difference hits \\
   {\tt HI} & i & Query hit index \\
-  {\tt IH} & i & Number of stored alignments in SAM that contains the query in the current record\\
+  {\tt IH} & i & Number of stored alignments in SAM that contains the query in the current record \\
   {\tt LB} & Z & Library \\
-  {\tt MC} & Z & CIGAR string for mate/next segment\\
+  {\tt MC} & Z & CIGAR string for mate/next segment \\
   {\tt MD} & Z & String for mismatching positions \\
   {\tt MF} & ? & Reserved for backwards compatibility reasons \\
+  {\tt MI} & Z & Molecular identifier; a string that uniquely identifies the molecule from which the record was derived \\
   {\tt MQ} & i & Mapping quality of the mate/next segment \\
-  {\tt NH} & i & Number of reported alignments that contains the query in the current record\\
+  {\tt NH} & i & Number of reported alignments that contains the query in the current record \\
   {\tt NM} & i & Edit distance to the reference \\
   {\tt OC} & Z & Original CIGAR \\
   {\tt OP} & i & Original mapping position \\
   {\tt OQ} & Z & Original base quality \\
+  {\tt OX} & Z & Original unique molecular barcode bases \\
   {\tt PG} & Z & Program \\
   {\tt PQ} & i & Phred likelihood of the template \\
   {\tt PT} & Z & Read annotations for parts of the padded read sequence \\
   {\tt PU} & Z & Platform unit \\
-  {\tt QT} & Z & Barcode ({\tt BC} or {\tt RT}) phred-scaled base qualities \\
   {\tt Q2} & Z & Phred quality of the mate/next segment sequence in the {\tt R2} tag \\
+  {\tt QT} & Z & Phred quality of the sample-barcode sequence in the {\tt BC} (or {\tt RT}) tag \\
+  {\tt QX} & Z & Quality score of the unique molecular identifier in the {\tt RX} tag \\
   {\tt R2} & Z & Sequence of the mate/next segment in the template \\
   {\tt RG} & Z & Read group \\
   {\tt RT} & Z & Barcode sequence (deprecated; use {\tt BC} instead) \\
+  {\tt RX} & Z & Sequence bases of the (possibly corrected) unique molecular identifier \\
   {\tt SA} & Z & Other canonical alignments in a chimeric alignment \\
   {\tt SM} & i & Template-independent mapping quality \\
   {\tt SQ} & ? & Reserved for backwards compatibility reasons \\
   {\tt S2} & ? & Reserved for backwards compatibility reasons \\
   {\tt TC} & i & The number of segments in the template \\
-  {\tt U2} & Z & Phred probility of the 2nd call being wrong conditional on the best being wrong \\
+  {\tt U2} & Z & Phred probability of the 2nd call being wrong conditional on the best being wrong \\
   {\tt UQ} & i & Phred likelihood of the segment, conditional on the mapping being correct \\
   {\tt X?} & ? & Reserved for end users \\
   {\tt Y?} & ? & Reserved for end users \\
@@ -244,10 +250,44 @@ \subsection{Barcodes}
 
 \begin{description}
 \item[BC:Z:\tagvalue{sequence}]
-Barcode sequence, with any quality scores stored in the {\tt QT} tag.
-
-\item[QT:Z:\tagvalue{qualities}]
-Phred quality of the barcode sequence in the {\tt BC} (or {\tt RT}) tag. Same encoding as {\sf QUAL}.
+Barcode sequence (Identifying the sample/library), with any quality scores (optionally) stored in the {\tt QT} tag.
+The {\tt BC} tag should match the {\tt QT} tag in length. 
+In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the barcodes and places a hyphen (`{\tt -}') between the barcodes from the same template. 
+
+\item[QT:Z:\tagvalue{qualities}] 
+Phred quality of the sample-barcode sequence in the {\tt BC} (or {\tt RT}) tag. 
+Same encoding as {\sf QUAL}, i.e., Phred score + 33.
+In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the quality strings with spaces (`{\tt \textvisiblespace}') between the different strings from the same template. 
+
+\item[RX:Z:\tagvalue{sequence+}]
+Sequence bases from the unique molecular identifier. 
+These could be either corrected or uncorrected. Unlike {\tt MI}, the value may be non-unique in the file. 
+Should be comprised of a sequence of bases. 
+In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the barcodes with a hyphen (`{\tt -}') between the different barcodes.
+
+If the bases represent corrected bases, the original sequence can be stored in {\tt OX} (similar to {\tt OQ} storing the original qualities of bases.)
+
+\item[QX:Z:\tagvalue{qualities+}] 
+Phred quality of the unique molecular identifier sequence in the {\tt RX} tag. 
+Same encoding as {\sf QUAL}, i.e., Phred score + 33.
+The qualities here may have been corrected (Raw bases and qualities can be stored in {\tt OX} and {\tt BZ} respectively.)
+The lengths of the {\tt QX} and the {\tt RX} tags must match. 
+In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the quality strings with a space (`{\tt \textvisiblespace}') between the different strings.
+
+\item[MI:Z:\tagvalue{str}]
+Molecular Identifier. 
+A unique ID within the SAM file for the source molecule from which this read is derived. 
+All reads with the same {\tt MI} tag represent the group of reads derived from the same source molecule. 
+
+\item[OX:Z:\tagvalue{sequence+}] 
+Raw (uncorrected) unique molecular identifier bases, with any quality scores (optionally) stored in the {\tt BZ} tag. 
+In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the barcodes with a hyphen (`{\tt -}') between the different barcodes.
+
+\item[BZ:Z:\tagvalue{qualities+}] 
+Phred quality of the (uncorrected) unique molecular identifier sequence in the {\tt OX} tag.
+Same encoding as {\sf QUAL}, i.e., Phred score + 33.
+The {\tt OX} tags should match the {\tt BZ} tag in length. 
+In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the quality strings with a space (`{\tt \textvisiblespace}') between the different strings.
 
 \item[RT:Z:\tagvalue{sequence}]
 Deprecated alternative to {\tt BC} tag originally used at Sanger.
@@ -345,4 +385,19 @@ \section{Locally-defined tags}
 \url{https://github.com/samtools/hts-specs/issues} and/or by sending email
 to \mailtourl{[email protected]}.
 
+\begin{appendices}
+\appendix
+\section{SAM Tags History}\label{sec:history}
+
+This lists the date of each tagged SAM version along with changes that
+have been made while that version was current.  
+
+\subsection*{1.5: 23 May 2013 to current}
+\begin{itemize}
+\item Add UMI-related tags (RX, QX, OX, BZ, MI) and clarified usage of sample barcode tag BC. (August 2017)
+\item SAMtags.txt (this file) created with tags from SAMv1 
+\end{itemize}
+
+\end{appendices}
+
 \end{document}
diff --git a/SAMv1.pdf b/SAMv1.pdf
diff --git a/SAMv1.tex b/SAMv1.tex
@@ -7,6 +7,7 @@
 \usepackage{longtable}
 \usepackage{makecell}
 \usepackage[pdfborder={0 0 0},hyperfootnotes=false]{hyperref}
+\usepackage[title]{appendix}
 
 \makeindex
 
@@ -35,6 +36,10 @@ \section{The SAM Format Specification}
 information such as mapping position, and variable number of optional
 fields for flexible or aligner specific information.
 
+This specification is for version 1.5 of the SAM and BAM formats.  Each SAM and
+BAM file may optionally specify the version being used via the
+{\tt @HD VN} tag. For full version history see Appendix~\ref{sec:history}. 
+
 \subsection{An example}\label{sec:example}
 Suppose we have the following alignment with bases in lower cases
 clipped from the alignment. Read {\tt r001/1} and {\tt r001/2}
@@ -194,14 +199,28 @@ \subsection{The header section}
     grouped by {\sf QNAME}), and {\tt reference} (alignments are grouped by
     {\sf RNAME}/{\sf POS}).\\\cline{1-3}
   \multicolumn{2}{|l}{\tt @SQ} & Reference sequence dictionary. The order of {\tt @SQ} lines defines the alignment sorting order.\\\cline{2-3}
-  & {\tt SN}* & Reference sequence name. Each {\tt @SQ} line must have a unique {\tt SN} tag. The value of this
-  field is used in the
+  & {\tt SN}* & Reference sequence name.
+The {\tt SN} tags and all individual {\tt AN} names in all {\tt @SQ} lines
+must be distinct.
+  The value of this field is used in the
   alignment records in {\sf RNAME} and {\sf RNEXT} fields. Regular expression: {\tt [!-)+-\char60\char62-\char126][!-\char126]*}\\\cline{2-3}
   & {\tt LN}* & Reference sequence length. \emph{Range}: {\tt [1,2$^{31}$-1]}\\\cline{2-3}
   & {\tt AH} & Indicates that this sequence is an alternate locus.%
 \footnote{See \url{https://www.ncbi.nlm.nih.gov/grc/help/definitions} for descriptions of \emph{alternate locus} and \emph{primary assembly}.}
   The value is the locus in the primary assembly for which this sequence is an alternative, in the format `\emph{chr}{\tt :}\emph{start}{\tt -}\emph{end}', `\emph{chr}' (if known), or `{\tt *}' (if unknown), where `\emph{chr}' is a sequence in the primary assembly.
   Must not be present on sequences in the primary assembly.\\\cline{2-3}
+  & {\tt AN} & Alternative reference sequence names.
+A comma-separated list of alternative names that tools may use when referring
+to this reference sequence.%
+\footnote{For example, given `{\tt @SQ SN:MT AN:chrMT,M,chrM LN:16569}',
+tools can ensure that a user's request for any of `MT', `chrMT', `M',
+or~`chrM' succeeds and refers to the same sequence.
+Note the restricted set of characters allowed in an alternative name.}
+These alternative names are not used elsewhere within the SAM file;
+in particular, they must not appear in alignment records' {\sf RNAME}
+or~{\sf RNEXT} fields.
+\emph{Regular expression}: \emph{name}{\tt (,}\emph{name}{\tt )*}
+where \emph{name} is {\tt [0-9A-Za-z][0-9A-Za-z*+.@\_|-]*}\\\cline{2-3}
   & {\tt AS} & Genome assembly identifier. \\\cline{2-3}
   & {\tt M5} & MD5 checksum of the sequence in the uppercase, excluding spaces but including pads (as `*'s).\\\cline{2-3}
   & {\tt SP} & Species.\\\cline{2-3}
@@ -1057,4 +1076,75 @@ \subsection{C source code for computing bin number and overlapping bins}\label{s
 \end{verbatim}
 }
 
+\pagebreak
+
+\begin{appendices}
+\appendix
+\section{SAM Version History}\label{sec:history}
+
+This lists the date of each tagged SAM version along with changes that
+have been made while that version was current.  The key changes
+that caused the version number to change are shown in bold.
+
+Note the auxiliary tags have now moved to their own
+specification with its own version numbering.\footnote{
+\href{http://samtools.github.io/hts-specs/SAMtags.pdf}{http://samtools.github.io/hts-specs/SAMtags.pdf}}
+
+\subsection*{1.5: 23 May 2013 to current}
+
+\begin{itemize}
+\item Add {\tt @SQ AH} header tag. (Mar 2017)
+\item Auxiliary tags migrated to SAMtags document. (Sep 2016)
+\item Z and H auxiliary tags are permitted to be zero length. (Jun 2016)
+\item QNAME limited to 254 bytes (was 255). (Aug 2015)
+\item Generalise 0x200 flag bit as filtered-out bit. (Aug 2015)
+\item Add {\tt @HD GO} for group order. (Mar 2015)
+\item Add {\tt ONT} to the {\tt @RG PL} and {\tt @RG PM} header tags. (Mar 2015)
+\item Add meaning to reverse FLAG on unmapped reads. (Mar 2015)
+\item Document the {\tt idxstats} .bai elements. (Nov 2014)
+\item Addition of CSI index. (Sep 2014)
+\item Add {\tt MC} auxiliary tag. (Dec 2013)
+\item Add {\tt @PG DS} header field. (Dec 2013)
+\item Document the BAM EOF byte values. (Dec 2013)
+\item Glossary of alignment types. (May 2013)
+\item Add {\tt SA:Z} tag; PNEXT/RNEXT points to next read, not
+  segment.  (May 2013)
+\item \textbf{Add SUPPLEMENTARY flag bit}. (May 2013)
+\end{itemize}
+
+\subsection*{1.4: 21 April 2011 to May 2013}
+
+\begin{itemize}
+\item Add guide to using sequence annotations ({\tt CT/PT tags}). (Mar 2012)
+\item Increase max reference length from $2^{29}$ to $2^{31}$. (Sep
+  2011)
+\item Add {\tt CO} and {\tt RT} auxiliary tags. (Sep 2011)
+\item Clarify {\tt @SQ M5} header tag generation. (Sep 2011)
+\item Describe padded alignments and add {\tt CT/PT tags}. (Sep 2011)
+\item Add {\tt BC} barcode auxiliary tag. (Sep 2011)
+\item Change {\tt FZ} tag from type {\tt H} to type {\tt B,S}. (Aug 2011)
+\item Add {\tt @RG FO}, {\tt KS} header fields. (Apr 2011)
+\item Add {\tt FZ} auxiliary tag. (Apr 2011)
+\item Clarify chaining of PG records. (Apr 2011)
+\item \textbf{Add {\tt B} array auxiliary tag type.} (Apr 2011)\
+\item \textbf{Permit IUPAC in SEQ and {\tt MD} auxiliary tag.} (Apr 2011)
+\item \textbf{Permit QNAME ``{\tt *}''.} (Apr 2011)
+\end{itemize}
+
+\subsection*{1.3: July 2010 to April 2011}
+
+\begin{itemize}
+\item Re-add {\tt CC} and {\tt CP} auxiliary tags. (Mar 2011)
+\item Add CIGAR N intron/skip operator. (Dec 2010)
+\item Add {\tt BQ} BAQ tag. (Nov 2010)
+\item Add {\tt RG PG} header field. (Nov 2010)
+\item Add BAM description and index sections. (Nov 2010)
+\item \textbf{Removal of FLAG letters.} (July 2010)
+\end{itemize}
+
+\subsection*{1.0: 2009 to July 2010}
+
+Initial edition.
+
+\end{appendices}
 \end{document}
diff --git a/VCFv4.1.pdf b/VCFv4.1.pdf
diff --git a/VCFv4.1.tex b/VCFv4.1.tex
@@ -1155,7 +1155,7 @@ \subsubsection{Type encoding}
 
 \vspace{0.3cm}
 
-\textbf{Integers} may be encoded as 8, 16, or 32 bit values, in little-endian order.  It is up to the encoder to determine the appropriate ranged value to use when writing the BCF2 file.  For each integer size, the value with all bits set (0x80, 0x8000, 0x80000000) for 8, 16, and 32 bit values, respectively) indicates that the field is a missing value.
+\textbf{Integers} may be encoded as 8, 16, or 32 bit values, in little-endian order.  It is up to the encoder to determine the appropriate ranged value to use when writing the BCF2 file.  For each integer size, the values 0x80, 0x8000, 0x80000000 are interpreted as missing values.
 
 \vspace{0.3cm}
 \textbf{Floats} are encoded as single-precision (32 bit) in the basic format defined by the IEEE-754-1985 standard.  This is the standard representation for floating point numbers on modern computers, with direct support in programming languages like C and Java (see Java's Double class for example).  BCF2 supports the full range of values from -Infinity to +Infinity, including NaN.  BCF2 needs to represent missing values for single precision floating point numbers.  This is accomplished by writing the NaN value as the quiet NaN (qNaN), while the MISSING value is encoded as a signaling NaN.  From the NaN wikipedia entry, we have:

diff --git a/VCFv4.2.pdf b/VCFv4.2.pdf
diff --git a/VCFv4.2.tex b/VCFv4.2.tex
@@ -1172,7 +1172,7 @@ \subsubsection{Type encoding}
 
 \vspace{0.3cm}
 
-\textbf{Integers} may be encoded as 8, 16, or 32 bit values, in little-endian order.  It is up to the encoder to determine the appropriate ranged value to use when writing the BCF2 file.  For each integer size, the value with all bits set (0x80, 0x8000, 0x80000000) for 8, 16, and 32 bit values, respectively) indicates that the field is a missing value.
+\textbf{Integers} may be encoded as 8, 16, or 32 bit values, in little-endian order.  It is up to the encoder to determine the appropriate ranged value to use when writing the BCF2 file.  For each integer size, the values 0x80, 0x8000, 0x80000000 are interpreted as missing values.
 
 \vspace{0.3cm}
 \textbf{Floats} are encoded as single-precision (32 bit) in the basic format defined by the IEEE-754-1985 standard.  This is the standard representation for floating point numbers on modern computers, with direct support in programming languages like C and Java (see Java's Double class for example).  BCF2 supports the full range of values from -Infinity to +Infinity, including NaN.  BCF2 needs to represent missing values for single precision floating point numbers.  This is accomplished by writing the NaN value as the quiet NaN (qNaN), while the MISSING value is encoded as a signaling NaN.  From the NaN wikipedia entry, we have:

diff --git a/VCFv4.3.pdf b/VCFv4.3.pdf