From 0ef79c01cd255fb7f5c3a3d6c36eee8cdaa9be3c Mon Sep 17 00:00:00 2001 From: John Marshall Date: Tue, 3 Mar 2020 20:31:00 +0500 Subject: [PATCH] Restrict allowed VCF Contig ID chars the same way as SAM RNAME (and allow colons) (#379) * Allow colons in VCF Contig IDs: breakend notation is unambiguous Breakend notation always includes a ":pos" part, so breakends are unambiguous even if the "chr" in "chr:pos" also itself contains colons. As this is a relaxation of the previous rules, there is no concern about altering all three 4.1/4.2/4.3 specs. Fixes the VCF/colon aspects of #124. Fixes #258. Closes #291. * Restrict allowed VCF Contig ID chars to those allowed in SAM RNAMEs Disallow \ , "`' (){} punctuation characters in VCF contig IDs. The characters []<> were already disallowed in VCF; this also relaxes the prohibition of * to merely disallowing initial *. Statistics gathered from various reference sequence archives suggest that the characters restricted appear vanishingly infrequently in SAM reference sequence names in existing files in the wild. To the extent that all contig IDs in VCF files come from corresponding SAM/BAM files, this means there is little concern about making the same restrictions in VCF contig IDs. Fixes #124 and fixes #167 for VCF; their SAM aspects were previously fixed by PR #333. --- VCFv4.1.tex | 2 +- VCFv4.2.tex | 2 +- VCFv4.3.tex | 18 ++++++++++++++++-- 3 files changed, 18 insertions(+), 4 deletions(-) diff --git a/VCFv4.1.tex b/VCFv4.1.tex index 2a76a9fe2..108166553 100644 --- a/VCFv4.1.tex +++ b/VCFv4.1.tex @@ -157,7 +157,7 @@ \subsubsection{Fixed fields} There are 8 fixed fields per record. All data lines are tab-delimited. In all cases, missing values are specified with a dot (`.'). Fixed fields are: \begin{enumerate} - \item CHROM - chromosome: An identifier from the reference genome or an angle-bracketed ID String (``$<$ID$>$'') pointing to a contig in the assembly file (cf.\ the \#\#assembly line in the header). All entries for a specific CHROM should form a contiguous block within the VCF file. The colon symbol (:) must be absent from all chromosome names to avoid parsing errors when dealing with breakends. (String, no white-space permitted, Required). + \item CHROM - chromosome: An identifier from the reference genome or an angle-bracketed ID String (``$<$ID$>$'') pointing to a contig in the assembly file (cf.\ the \#\#assembly line in the header). All entries for a specific CHROM should form a contiguous block within the VCF file. (String, no white-space permitted, Required). \item POS - position: The reference position, with the 1st base having position 1. Positions are sorted numerically, in increasing order, within each reference sequence CHROM. It is permitted to have multiple records with the same POS. Telomeres are indicated by using positions 0 or N+1, where N is the length of the corresponding chromosome or contig. (Integer, Required) \item ID - identifier: Semi-colon separated list of unique identifiers where available. If this is a dbSNP variant it is encouraged to use the rs number(s). No identifier should be present in more than one data record. If there is no identifier available, then the missing value should be used. (String, no white-space or semi-colons permitted) \item REF - reference base(s): Each base must be one of A,C,G,T,N (case insensitive). Multiple bases are permitted. The value in the POS field refers to the position of the first base in the String. For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT Strings must include the base before the event (which must be reflected in the POS field), unless the event occurs at position 1 on the contig in which case it must include the base after the event; this padding base is not required (although it is permitted) for e.g.\ complex substitutions or other events where all alleles have at least one base represented in their Strings. If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String ``$<$ID$>$'') then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism. Tools processing VCF files are not required to preserve case in the allele Strings. (String, Required). diff --git a/VCFv4.2.tex b/VCFv4.2.tex index c9b2efe39..52e1de546 100644 --- a/VCFv4.2.tex +++ b/VCFv4.2.tex @@ -174,7 +174,7 @@ \subsubsection{Fixed fields} There are 8 fixed fields per record. All data lines are tab-delimited. In all cases, missing values are specified with a dot (`.'). Fixed fields are: \begin{enumerate} - \item CHROM - chromosome: An identifier from the reference genome or an angle-bracketed ID String (``$<$ID$>$'') pointing to a contig in the assembly file (cf.\ the \#\#assembly line in the header). All entries for a specific CHROM should form a contiguous block within the VCF file. The colon symbol (:) must be absent from all chromosome names to avoid parsing errors when dealing with breakends. (String, no white-space permitted, Required). + \item CHROM - chromosome: An identifier from the reference genome or an angle-bracketed ID String (``$<$ID$>$'') pointing to a contig in the assembly file (cf.\ the \#\#assembly line in the header). All entries for a specific CHROM should form a contiguous block within the VCF file. (String, no white-space permitted, Required). \item POS - position: The reference position, with the 1st base having position 1. Positions are sorted numerically, in increasing order, within each reference sequence CHROM. It is permitted to have multiple records with the same POS. Telomeres are indicated by using positions 0 or N+1, where N is the length of the corresponding chromosome or contig. (Integer, Required) \item ID - identifier: Semi-colon separated list of unique identifiers where available. If this is a dbSNP variant it is encouraged to use the rs number(s). No identifier should be present in more than one data record. If there is no identifier available, then the missing value should be used. (String, no white-space or semi-colons permitted) \item REF - reference base(s): Each base must be one of A,C,G,T,N (case insensitive). Multiple bases are permitted. The value in the POS field refers to the position of the first base in the String. For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT Strings must include the base before the event (which must be reflected in the POS field), unless the event occurs at position 1 on the contig in which case it must include the base after the event; this padding base is not required (although it is permitted) for e.g.\ complex substitutions or other events where all alleles have at least one base represented in their Strings. If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String ``$<$ID$>$'') then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism. Tools processing VCF files are not required to preserve case in the allele Strings. (String, Required). diff --git a/VCFv4.3.tex b/VCFv4.3.tex index 34fc3985e..b46956469 100644 --- a/VCFv4.3.tex +++ b/VCFv4.3.tex @@ -226,7 +226,14 @@ \subsubsection{Contig field format} \end{verbatim} \noindent -Valid contig names must follow the reference sequence names allowed by the SAM format ("{\tt [!-)+-\char60\char62-\char126][!-\char126]*}") excluding the characters "\texttt{\textless\textgreater[]:*}" to avoid clashes with symbolic alleles and breakend notation. +Contig names follow the same rules as the SAM format's reference sequence names: +they may contain any printable ASCII characters in the range \verb|[!-~]| apart from `{\tt\verb|\|\,,\,"`'\,()\,[]\,\verb|{}|\,<>}' and may not start with `{\tt *}' or `{\tt =}'. +Thus they match the following regular expression: +\begin{verbatim} + [0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]* +\end{verbatim} +\noindent +In particular, excluding commas facilitates parsing \verb|##contig| lines, and excluding the characters `\verb|<>[]|' and initial~`{\tt *}' avoids clashes with symbolic alleles. The contig names must not use a reserved symbolic allele name. @@ -288,7 +295,6 @@ \subsubsection{Fixed fields} \begin{enumerate} \item CHROM --- chromosome: An identifier from the reference genome or an angle-bracketed ID String (``$<$ID$>$'') pointing to a contig in the assembly file (cf.\ the \#\#assembly line in the header). All entries for a specific CHROM must form a contiguous block within the VCF file. - The colon symbol (:) must be absent from all chromosome names to avoid parsing errors when dealing with breakends. (String, no white-space permitted, Required). \item POS --- position: The reference position, with the 1st base having position 1. Positions are sorted numerically, in increasing order, within each reference sequence CHROM. @@ -2046,6 +2052,14 @@ \subsection{Changes to VCFv4.3} \begin{itemize} \item More strict language: ``should'' replaced with ``must'' where appropriate \item Tables with Type and Number definitions for INFO and FORMAT reserved keys + +\item +The set of characters allowed in VCF contig names is now the same as that allowed in SAM reference sequence names, which was restricted in January 2019. +The characters `{\tt\verb|\|\,,\,"`'\,()\,\verb|{}|}' are now invalid in VCF contig names, while `{\tt *}' is now valid when not the first character. +(The characters `{\tt []\,<>}' and initial~`{\tt *}'/`{\tt =}' were already invalid and remain so.) + +The VCF specification previously disallowed colons (`{\tt :}') in contig names to avoid confusion when parsing breakends, but this was unnecessary. +Even with contig names containing colons, the breakend mate position notation can be unambiguously parsed because the ``{\tt :}\emph{pos}'' part is \textbf{always} present. \end{itemize} \subsection{Changes between VCFv4.2 and VCFv4.3}