Skip to content

Commit

Permalink
Suggest {foo}:1-8 notation; add HD-VN notes around validating RNAMEs
Browse files Browse the repository at this point in the history
Also small wording tweaks.
  • Loading branch information
jmarshall committed Dec 19, 2018
1 parent 7505297 commit e38e9b0
Showing 1 changed file with 13 additions and 7 deletions.
20 changes: 13 additions & 7 deletions SAMv1.tex
Original file line number Diff line number Diff line change
Expand Up @@ -190,8 +190,8 @@ \subsubsection{Character set restrictions}\label{sec:charset}
(They are also limited in length.)

Reference sequence names may contain any printable ASCII characters in the range {\tt [!-\verb:~:]} apart from backslashes, commas, quotation marks, and brackets---i.e., apart from `{\tt \verb:\:\,,\,"`'\,()\,[]\,\verb:{}:\,<>}'---and may not start with `{\tt *}' or `{\tt =}'.%
\footnote{Characters that are \emph{not} disallowed include `{\tt |}', which historically appeared in reference names derived from NCBI FASTA files, and `{\tt :}', which appear in HLA allele names.
Appendix~\ref{sec:parse-region} describes an approach for parsing \emph{name}{\tt [:}\emph{begin}{\tt -}\emph{end}{\tt ]} region notation unambiguously even though \emph{name} may itself contain colons.}
\footnote{Characters that are \emph{not} disallowed include `{\tt |}', which historically appeared in reference names derived from NCBI FASTA files, and `{\tt :}', which appears in HLA allele names.
Appendix~\ref{sec:parse-region} describes approaches for parsing \emph{name}{\tt [:}\emph{begin}{\tt -}\emph{end}{\tt ]} region notation unambiguously even though \emph{name} may itself contain colons.}
Thus they match the following regular expression:
\begin{center}
Expand All @@ -203,7 +203,7 @@ \subsubsection{Character set restrictions}\label{sec:charset}
\newcommand*{\rnameRegexp}{[\cclass{rname}\caret*=][\cclass{rname}]*}
\noindent
For clarity, elsewhere in this specification we write this set of characters as a character class~{\tt [\cclass{rname}]} and extend the POSIX regular expression notation to use {\tt\caret *=} to indicate the omission of `{\tt *}' and `{\tt =}' from the character class.
For clarity, elsewhere in this specification we write this set of allowed characters as a character class~{\tt [\cclass{rname}]} and extend the POSIX regular expression notation to use {\tt\caret *=} to indicate the omission of `{\tt *}' and `{\tt =}' from the character class.
Thus this regular expression can be written more clearly as {\tt\rnameRegexp}.
\subsection{The header section}
Expand Down Expand Up @@ -1271,13 +1271,17 @@ \section{Parsing region notation}\label{sec:parse-region}
else\qquad{\sf\ldots either {\sl str} does not contain a colon or the suffix is not plausibly numeric}
\\
\> if \emph{str} is in the known set then return (\emph{str}, entire sequence) \\
\> else\quad{\sf\ldots error: unknown reference sequence name}
\> else\quad{\sf\ldots error: unknown reference sequence name or invalid interval syntax}
\end{tabbing}
\noindent
The check leading to ``{\sf error: ambiguous representation}'' is important as it prevents confusing interpretations of actually ambiguous input.
Typically the set of valid reference sequence names will not contain names that are prefixes of other names in the set, so in practice this error will not usually be encountered in non-malicious data.
Either in addition to this algorithm or as an alternative to it, tools can use additional delimiter characters to make an unambigiously parsable notation.
We recommend a convention using curly brackets around the reference sequence name--- \verb"{"\emph{name}\verb"}"{\tt [:}\emph{begin}{\tt [-}\emph{end}{\tt ]]} ---as being memorable, easily typed, unambiguous, and not expanded by most shells.
% (RNAME cannot contain commas, so Bash's {a,b} brace expansion won't occur.)
\section{SAM Version History}\label{sec:history}
This lists the date of each tagged SAM version along with changes that
Expand All @@ -1293,9 +1297,11 @@ \section{SAM Version History}\label{sec:history}
\subsection*{1.6: 28 November 2017 to current}
\begin{itemize}
\item Restricted the allowable punctuation characters in RNAME and similar fields.
The sets of characters allowed in {\tt @SQ SN} and {\tt @SQ AN} are now identical, which enlarges the previous {\tt AN} set.
(Sep 2018)
\item\textbf{Restricted the allowable punctuation characters in reference sequence names} (in {\tt @SQ SN}, {\sf RNAME}, etc).
The sets of characters allowed in {\tt @SQ SN} and {\tt @SQ AN} are now identical, which enlarges the previous {\tt AN} set. (Dec 2018)
We recommend that implementations validating reference sequence names do so using the rules in Section~\ref{sec:charset}; are more lenient for files declaring $\mbox{\tt @HD VN} \leq 1.5$; and validate {\tt AN} only against these rules, not the previous more restrictive {\tt AN} rules.
\item B array optional fields may have no entries---this was already representable in BAM, clarified that empty arrays are permitted in SAM too. (Jul 2018)
\item Add {\tt @SQ DS} header tag. (Jul 2018)
\item Add {\tt @RG BC} header tag. (Apr 2018)
Expand Down

0 comments on commit e38e9b0

Please sign in to comment.