Suggest {foo}:1-8 notation; add HD-VN notes around validating RNAMEs

Also small wording tweaks.
samtools · Dec 19, 2018 · e38e9b0 · e38e9b0
1 parent 7505297
commit e38e9b0
Showing 1 changed file with 13 additions and 7 deletions.
diff --git a/SAMv1.tex b/SAMv1.tex
@@ -190,8 +190,8 @@ \subsubsection{Character set restrictions}\label{sec:charset}
 (They are also limited in length.)
 
 Reference sequence names may contain any printable ASCII characters in the range {\tt [!-\verb:~:]} apart from backslashes, commas, quotation marks, and brackets---i.e., apart from `{\tt \verb:\:\,,\,"`'\,()\,[]\,\verb:{}:\,<>}'---and may not start with `{\tt *}' or `{\tt =}'.%
-\footnote{Characters that are \emph{not} disallowed include `{\tt |}', which historically appeared in reference names derived from NCBI FASTA files, and `{\tt :}', which appear in HLA allele names.
-Appendix~\ref{sec:parse-region} describes an approach for parsing \emph{name}{\tt [:}\emph{begin}{\tt -}\emph{end}{\tt ]} region notation unambiguously even though \emph{name} may itself contain colons.}
+\footnote{Characters that are \emph{not} disallowed include `{\tt |}', which historically appeared in reference names derived from NCBI FASTA files, and `{\tt :}', which appears in HLA allele names.
+Appendix~\ref{sec:parse-region} describes approaches for parsing \emph{name}{\tt [:}\emph{begin}{\tt -}\emph{end}{\tt ]} region notation unambiguously even though \emph{name} may itself contain colons.}
 
 Thus they match the following regular expression:
 \begin{center}
@@ -203,7 +203,7 @@ \subsubsection{Character set restrictions}\label{sec:charset}
 \newcommand*{\rnameRegexp}{[\cclass{rname}\caret*=][\cclass{rname}]*}
 
 \noindent
-For clarity, elsewhere in this specification we write this set of characters as a character class~{\tt [\cclass{rname}]} and extend the POSIX regular expression notation to use {\tt\caret *=} to indicate the omission of `{\tt *}' and `{\tt =}' from the character class.
+For clarity, elsewhere in this specification we write this set of allowed characters as a character class~{\tt [\cclass{rname}]} and extend the POSIX regular expression notation to use {\tt\caret *=} to indicate the omission of `{\tt *}' and `{\tt =}' from the character class.
 Thus this regular expression can be written more clearly as {\tt\rnameRegexp}.
 
 \subsection{The header section}
@@ -1271,13 +1271,17 @@ \section{Parsing region notation}\label{sec:parse-region}
 else\qquad{\sf\ldots either {\sl str} does not contain a colon or the suffix is not plausibly numeric}
 \\
 \> if \emph{str} is in the known set then return (\emph{str}, entire sequence) \\
-\> else\quad{\sf\ldots error: unknown reference sequence name}
+\> else\quad{\sf\ldots error: unknown reference sequence name or invalid interval syntax}
 \end{tabbing}
 
 \noindent
 The check leading to ``{\sf error: ambiguous representation}'' is important as it prevents confusing interpretations of actually ambiguous input.
 Typically the set of valid reference sequence names will not contain names that are prefixes of other names in the set, so in practice this error will not usually be encountered in non-malicious data.
 
+Either in addition to this algorithm or as an alternative to it, tools can use additional delimiter characters to make an unambigiously parsable notation.
+We recommend a convention using curly brackets around the reference sequence name--- \verb"{"\emph{name}\verb"}"{\tt [:}\emph{begin}{\tt [-}\emph{end}{\tt ]]} ---as being memorable, easily typed, unambiguous, and not expanded by most shells.
+% (RNAME cannot contain commas, so Bash's {a,b} brace expansion won't occur.)
+
 \section{SAM Version History}\label{sec:history}
 
 This lists the date of each tagged SAM version along with changes that
@@ -1293,9 +1297,11 @@ \section{SAM Version History}\label{sec:history}
 \subsection*{1.6: 28 November 2017 to current}
 
 \begin{itemize}
-\item Restricted the allowable punctuation characters in RNAME and similar fields.
-The sets of characters allowed in {\tt @SQ SN} and {\tt @SQ AN} are now identical, which enlarges the previous {\tt AN} set.
-(Sep 2018)
+\item\textbf{Restricted the allowable punctuation characters in reference sequence names} (in {\tt @SQ SN}, {\sf RNAME}, etc).
+The sets of characters allowed in {\tt @SQ SN} and {\tt @SQ AN} are now identical, which enlarges the previous {\tt AN} set. (Dec 2018)
+
+We recommend that implementations validating reference sequence names do so using the rules in Section~\ref{sec:charset}; are more lenient for files declaring $\mbox{\tt @HD VN} \leq 1.5$; and validate {\tt AN} only against these rules, not the previous more restrictive {\tt AN} rules.
+
 \item B array optional fields may have no entries---this was already representable in BAM, clarified that empty arrays are permitted in SAM too. (Jul 2018)
 \item Add {\tt @SQ DS} header tag. (Jul 2018)
 \item Add {\tt @RG BC} header tag. (Apr 2018)