-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify description of structured vs unstructured meta-information lines #620
Conversation
Introduce the term "unstructured meta-information line", and reword this section so it describes the two flavours of meta-information line clearly. Specify that an unstructured value must not start with `<` (so that structured/unstructured are easily distinguished) and must be non-empty. Remove `<>` from unstructured `##pedigreeDB` example. PR samtools#88 removed the `<>` from one instance of `##pedigreeDB=<url>` presumably on the grounds that they were merely metasyntactic variable notation and not intended to appear literally, but missed this instance.
The specification now consistently reflects that `##pedigreeDB`'s value should not be delimited by angle brackets (despite being an URL!). Adjust the failed_meta_pedigreedb_002.vcf files as the claimed cause of failure is the invalid URL hostname rather than the angle brackets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good improvement to the meta-data section. I assume we need to apply this to VCF4.4.draft.tex too though.
I also noted in my comments that we have a number of other issues related to quoting and "default" vs "extra" fields. I understand what it's trying to say, but it only really works if we actually define the default fields, which we don't! That's a topic for another PR though probably.
For example: | ||
\begin{verbatim} | ||
##INFO=<ID=ID,Number=number,Type=type,Description="description",Source="description",Version="128"> | ||
##INFO=<ID=ID,Number=number,Type=type,Description="description",Source="source",Version="128"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate this hasn't changed and has come from the earlier version, but given it's an example I think it would be clearer with actual examples. Especially as "number" isn't a valid number, but we're using the verbatim style text rather than italic to indicate it's a placeholder term.
Eg:
##INFO=<ID=varType,Number=1,Type=String,Description="Variant type",Source="example",Version="128">
VCFv4.3.tex
Outdated
\verb|##|\emph{key}\verb|=<|\emph{key}\verb|=|\emph{value}\verb|,|\emph{key}\verb|=|\emph{value}\verb|,|\emph{key}\verb|=|\emph{value}\verb|,|\ldots\verb|>| | ||
\end{quote} | ||
All structured lines require an ID which must be unique within their type, i.e., within all the meta-information lines with the same ``\verb|##|\emph{key}\verb|=|'' prefix. | ||
For all of the structured lines (\verb|##INFO|, \verb|##FORMAT|, \verb|##FILTER|, etc.), extra fields can be included after the default fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the clarification, and later it explains (partially) something that's always been a total mystery to me: why some strings are quoted in VCF and others are not.
However I'm unclear what "default fields" means. Can we assume for the examples shown that everything in the example is classified as "default". For example the definitions of ##INFO
always have ID, Number, Type and Description so it's clear.
However this isn't univerally true for all structured lines. For example, the definition of ##contig
in section 1.4.7 is:
The structured \texttt{contig} field must include the ID attribute and typically includes also sequence length, MD5 checksum, URL tag to indicate where the sequence can be found, etc.
For example:
\begin{verbatim}
##contig=<ID=ctg1,length=81195210,URL=ftp://somewhere.org/assembly.fa,...>
\end{verbatim}
"Typically includes" isn't exactly a tight term for describing what the default fields are. Why do our contig lines not quote all those values, given they're not explicitly defined and are simply examples of things that may occur. Is the fact that our examples always quote "species" but not "md5" is because we view "species" as a non-default field and "md5" as a default field? Or is it because in our examples the species is Homo sapiens and we decided to quote because of the space?
If I had to interpret this with no knowledge of what happens in practice, I'd say the ID, length and URL are default fields and the "..." indicates it can be followed by extra fields, but that's simply not true.
We later have an example here:
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
This ends with taxonomy=x
which isn't quoted, and therefore is considered to be one of the default fields, and by dint of being after species we can determine that is also a default field (using the rule than extra fields always follow the default ones).
Put simply, I appreciate your improvements and are minded to merge as-is, but the entire "default" vs "extra" field is clear as mud. Maybe one of the original authors would care to explain this in a better way?
I think the error here though isn't with this PR, but with elsewhere in the document. Specifically the lack of a proper definition for ##contig
. (Or PEDIGREE
with the "Name_1" .. "Name_N" being used in the example, without quoting.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put simply, I appreciate your improvements and are minded to merge as-is
That is surely a determination to be made by the VCF maintainers…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put simply, I appreciate your improvements and are minded to merge as-is
That is surely a determination to be made by the VCF maintainers…
Agreed, I wasn't actually going to merge, but rather poorly adding my personal view.
As for the other points I made, I note that ##INFO
uses the terms "required" and "recommended" to define "default" and "extra" fields.
However they're not really the same thing. "Default" may mean a field whose key is defined in the VCF spec, but it could be optional. So probably ##INFO
needs to be more explicit too (along with every other structured definition).
As noted in the introductory PR comment, “The maintainers will likely wish to apply similar changes to VCFv4.4.draft.tex”. Most of your comments are IMHO issues for followup clarifications. I hope the VCF maintainers will consider this PR as is and look to merge it within a finite timespan, and consider questions around which fields are default fields etc to be issues they may wish to take up for separate future improvement. |
PR samtools#88 removed the `<>` from `##pedigreeDB=<url>` in VCFv4.3.tex, presumably on the grounds that they were merely metasyntactic variable notation and not intended to appear literally. As some readers still refer to these older documents, remove the misleading notation here too.
One other thought here, we know we can add our own additional meta-information lines. However as these are not formally defined, I assume all keys are therefore mandatory quoted (even if numbers) as they are all to be considered as "extra" and hence string only rather than the "default" keys with differing types? Does that need to be made more explicit somewhere? An example would be (Bcftools only seems to add unstructured lines.) |
Suggestion from the meeting: also mention that there can be structured lines not specified by the specification, and say something about default fields (only |
As per the suggestion in #620 (comment) and at last month's meeting, I've added a sentence mentioning the possibility of other structured meta-information lines beyond those defined in the spec and noting that their only default field is I did not add anything about quoting on these lines, because IMHO the existing text makes an unclear distinction between “extra fields” (which for these lines would be all fields other than ID) and “optional fields“ (which are the ones that require quoting). Going further would require a VCF maintainer to determine what is intended here. |
\verb|##|\emph{key}\verb|=<|\emph{key}\verb|=|\emph{value}\verb|,|\emph{key}\verb|=|\emph{value}\verb|,|\emph{key}\verb|=|\emph{value}\verb|,|\ldots\verb|>| | ||
\end{quote} | ||
All structured lines require an ID which must be unique within their type, i.e., within all the meta-information lines with the same ``\verb|##|\emph{key}\verb|=|'' prefix. | ||
For all of the structured lines (\verb|##INFO|, \verb|##FORMAT|, \verb|##FILTER|, etc.) described in this specification, extra fields can be included after the default fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all of the structured lines (\verb|##INFO|, \verb|##FORMAT|, \verb|##FILTER|, etc.) described in this specification, extra fields can be included after the default fields. | |
For all of the structured lines (\verb|##INFO|, \verb|##FORMAT|, \verb|##FILTER|, etc.) described in this specification, extra fields can be included and are in some case required, after the default fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which ones are you suggesting are required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of Number, Type, and Description for INFO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my mind, ID+Number/Type/Description are the “default fields” for ##INFO
.
IMHO the problem here is that §1.4 starts talking about details of INFO etc that should be in §1.4.2 etc. Disentangling that (and revisiting the mysterious default/extra/optional fields terminology) I think should be a followup to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
PR #88 introduced the term “structured meta-information line”, but was unclear about the relationship between those and other meta-information lines, and §1.4 remains unclear overall — see e.g. #608.
This PR introduces the term “unstructured meta-information line” (an alternative might be “simple meta-information line”), and rewords this section so it describes the two flavours of meta-information line clearly.
This PR does not attempt to tighten up the syntax around what should be quoted etc as advocated for in #608. OTOH it does newly specify that:
<
(so that structured/unstructured are easily distinguished; this is already de facto enforced by the HTSlib implementation)Hence it also removes
<>
from §5.4.11's unstructured##pedigreeDB
example. As discussed in #608 (comment), PR #88 removed the<>
from §1.4.9's instance of##pedigreeDB=<url>
presumably on the grounds that they were merely metasyntactic variable notation and not intended to appear literally, but missed this instance. Adjust the test files accordingly.The test/vcf/4.[123]/passed/complexfile_passed_000.vcf test files still contain another unstructured meta-information line with angle bracket delimiters:
The maintainers should also update the
##AnalysisTitleBrackets
lines, either just removing them or perhaps rewriting them as##AnalysisTitleStructured=<ID="FINRISK: … (DILGOM)">
.The maintainers will likely wish to apply similar changes to VCFv4.4.draft.tex.