-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bcf spec] Empty variable length vector in Genotype encoding #593
Comments
The VCF specification allows to drop sample's trailing FORMAT fields (case 2), but it also allows to use a single MISSING_VALUE as an abbreviated way of expressing how many missing values there are in that field. HTSlib puts in the single missing value. |
Yes, this is correct for VCF, but BCF clearly mandates a fixed length vector and the spec also literally says
There are no values present in the original vector, so shouldn't that result in I think this behaviour would be important to preserve the distinction in VCF when converting back and forth, so that
|
Maybe it should, but it does not have to. HTSlib reads in VCF, internally converts the trailing missing fields to |
Ok, I don't want to drag this on longer than necessary. Since all of these representations in the file are equivalent in what they represent, it might not make a big difference. I would still argue that the current wording of the spec suggests a specific representation (
In general, I think that the description of vectors in the spec is not very precise. The words "typed vector" and "vector" are used interchangeably in some sections and not in others. There is also the term "array of values" ... |
The VCF spec has the following genotype fields in the example records
For the last sample, HQ is missing from all three records, but this is encoded differently:
It is perfectly clear to me how BCF represents cases 1. and 3., but what about number 2?
I had assumed that "empty vector" would basically be encoded as a vector consisting of two END_OF_VECTOR values as the spec says:
However,
bcftools convert
creates a vector of[MISSING_VALUE, END_OF_VECTOR]
.Which is correct? Does my implementation need to emulate bcftools to be compatible?
And could this maybe be made clearer in the spec? I think it is already confusing enough that there are so many different ways to encode the absence of something 🙄
Thank you!
The text was updated successfully, but these errors were encountered: