Skip to content

Commit

Permalink
Support CIGARs with >65535 operations in BAM files
Browse files Browse the repository at this point in the history
This commit addresses samtools#40. It added optional tag `CG` and
explained the workaround to store alignments with >65535 CIGAR operations in
BAM files. The proposal is implemented in samtools/htslib#560.
  • Loading branch information
lh3 committed Jul 14, 2017
1 parent 084587e commit f49a2e6
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 2 deletions.
8 changes: 8 additions & 0 deletions SAMtags.tex
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ \section{Standard tags}
{\tt BC} & Z & Barcode sequence \\
{\tt BQ} & Z & Offset to base alignment quality (BAQ) \\
{\tt CC} & Z & Reference name of the next hit \\
{\tt CG} & B,I & BAM-only tag to store the real {\sf CIGAR} if it contains $>$65535 operations\\
{\tt CM} & i & Edit distance between the color sequence and the color reference (see also {\tt NM})\\
{\tt CO} & Z & Free-text comments \\
{\tt CP} & i & Leftmost coordinate of the next hit \\
Expand Down Expand Up @@ -125,6 +126,13 @@ \subsection{Additional Template and Mapping data}
\item[CC:Z:\tagvalue{rname}]
Reference name of the next hit; `{\tt =}' for the same chromosome.

\item[CG:B:I,\tagvalue{encodedCigar}]
Real CIGAR in its binary form if it contains $>$65535 operations. This is
intended to be a BAM file only tag as a workaround of BAM's incapability to
store long CIGARs in the standard way. SAM and CRAM files created with updated
tools aware of the workaround are not expected to contain this tag. See also
the footnote in Section 4.2 of the SAM spec for details.

\item[CP:i:\tagvalue{pos}]
Leftmost coordinate of the next hit.

Expand Down
16 changes: 14 additions & 2 deletions SAMv1.tex
Original file line number Diff line number Diff line change
Expand Up @@ -808,7 +808,7 @@ \subsection{The BAM format}
& \multicolumn{2}{l|}{\sf refID} & Reference sequence ID, $-1\leq{\sf refID}<{\sf n\_ref}$; -1 for a read without a mapping position. & {\tt int32\_t} & [-1] \\\cline{2-6}
& \multicolumn{2}{l|}{\sf pos} & 0-based leftmost coordinate ($=\underline{\sf POS}-1$)& {\tt int32\_t} & [-1]\\\cline{2-6}
& \multicolumn{2}{l|}{\sf bin\_mq\_nl} & {\tt{\sf bin}\char60\char60 16\char124\underline{\sf MAPQ}\char60\char60 8\char124{\sf l\_read\_name}}; {\sf bin} is computed from the mapping position;\footnotemark\ {\sf l\_read\_name} is the length of {\sf read\_name} below ($={\sf length}(\underline{\sf QNAME})+1$). & {\tt uint32\_t} & \\\cline{2-6}
& \multicolumn{2}{l|}{\sf flag\_nc} & {\tt \underline{\sf FLAG}\char60\char60 16\char124{\sf n\_cigar\_op}};\footnotemark\ {\sf n\_cigar\_op} is the number of operations in \underline{\sf CIGAR}. & {\tt uint32\_t} & \\\cline{2-6}
& \multicolumn{2}{l|}{\sf flag\_nc} & {\tt \underline{\sf FLAG}\char60\char60 16\char124{\sf n\_cigar\_op}};\footnotemark\ {\sf n\_cigar\_op} is the number of operations in \underline{\sf CIGAR}\footnotemark. & {\tt uint32\_t} & \\\cline{2-6}
& \multicolumn{2}{l|}{\sf l\_seq} & Length of \underline{\sf SEQ} & {\tt int32\_t} & \\\cline{2-6}
& \multicolumn{2}{l|}{\sf next\_refID} & Ref-ID of the next segment ($-1\le{\sf mate\_refID}<{\sf n\_ref}$) & {\tt int32\_t} & [-1] \\\cline{2-6}
& \multicolumn{2}{l|}{\sf next\_pos} & 0-based leftmost pos of the next segment ($=\underline{\sf PNEXT}-1$) & {\tt int32\_t} & [-1] \\\cline{2-6}
Expand All @@ -824,7 +824,7 @@ \subsection{The BAM format}
\cline{1-6}
\end{tabular}}
\end{table}
\addtocounter{footnote}{-4}
\addtocounter{footnote}{-5}
\footnotetext{{\sf BIN} is calculated using the {\sf reg2bin()} function
in Section~\ref{sec:code}. For mapped reads this uses {\sf POS-1}
(i.e.,~0-based left position) and the alignment end point using the
Expand All @@ -838,6 +838,18 @@ \subsection{The BAM format}
\footnotetext{As noted in Section~\ref{sec:alnrecord}, reserved {\sf FLAG} bits
should be written as zero and ignored on reading by current software.}
\stepcounter{footnote}
\footnotetext{With 16 bits, {\sf n\_cigar\_op} can keep at most 65535 {\sf
CIGAR} operations in BAM files. For an alignment with more {\sf CIGAR}
operations, BAM stores the real {\sf CIGAR}, in its binary form, to the {\tt
CG} optional tag of type `{\tt B,I}', and sets the {\sf CIGAR} to `{\tt kS}' as
a placeholder, where `{\tt k}' is the length of {\sf SEQ} and `{\tt S}' the
soft-clipping {\sf CIGAR} operator (i.e. in the binary form, {\sf
n\_cigar\_op}=1 and {\sf cigar}={\tt [k\char60\char60 4\char124{4}]}).
This workaround is applied to BAM files \emph{only}. SAM and CRAM files are not
affected. If tag {\tt CG} is present, BAM parsing libraries are expected to
seamlessly update {\sf n\_cigar\_op} and {\sf cigar} with the real {\sf CIGAR}
stored in the {\tt CG} tag.}
\stepcounter{footnote}
\footnotetext{For backward compatibility, a {\sf QNAME} `{\tt *}' is stored as a C string {\tt "*\char92 0"}.}
\stepcounter{footnote}
\footnotetext{An integer may be stored as one of `{\tt cCsSiI}' in BAM, representing {\tt int8\_t}, {\tt uint8\_t},
Expand Down

0 comments on commit f49a2e6

Please sign in to comment.