Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bcf spec] Empty variable length vector in Genotype encoding #593

Open
h-2 opened this issue Aug 31, 2021 · 4 comments
Open

[bcf spec] Empty variable length vector in Genotype encoding #593

h-2 opened this issue Aug 31, 2021 · 4 comments
Labels

Comments

@h-2
Copy link

h-2 commented Aug 31, 2021

The VCF spec has the following genotype fields in the example records

GT:GQ:DP:HQ	0|0:48:1:51,51	1|0:48:8:51,51	1/1:43:5:.,.
GT:GQ:DP:HQ	0|0:49:3:58,50	0|1:3:5:65,3	0/0:41:3
[...]
GT:GQ:DP	0/1:35:4	0/2:17:2	1/1:40:3

For the last sample, HQ is missing from all three records, but this is encoded differently:

  1. Two missing values
  2. HQ is absent for this sample
  3. HQ is absent for all samples

It is perfectly clear to me how BCF represents cases 1. and 3., but what about number 2?
I had assumed that "empty vector" would basically be encoded as a vector consisting of two END_OF_VECTOR values as the spec says:

In the situation when a genotype
field contain vector values of different lengths, these are represented in BCF2 by a vector of the maximum length
per sample, with all values in the each vector aligned to the left, and END OF VECTOR values assigned to all
values not present in the original vector. 

However, bcftools convert creates a vector of [MISSING_VALUE, END_OF_VECTOR].

Which is correct? Does my implementation need to emulate bcftools to be compatible?

And could this maybe be made clearer in the spec? I think it is already confusing enough that there are so many different ways to encode the absence of something 🙄

Thank you!

@pd3
Copy link
Member

pd3 commented Sep 1, 2021

The VCF specification allows to drop sample's trailing FORMAT fields (case 2), but it also allows to use a single MISSING_VALUE as an abbreviated way of expressing how many missing values there are in that field. HTSlib puts in the single missing value.

@h-2
Copy link
Author

h-2 commented Sep 1, 2021

The VCF specification allows to drop sample's trailing FORMAT fields (case 2), but it also allows to use a single MISSING_VALUE as an abbreviated way of expressing how many missing values there are in that field.

Yes, this is correct for VCF, but BCF clearly mandates a fixed length vector and the spec also literally says

END OF VECTOR values assigned to all values not present in the original vector.

There are no values present in the original vector, so shouldn't that result in [END_OF_VECTOR, END_OF_VECTOR]?

I think this behaviour would be important to preserve the distinction in VCF when converting back and forth, so that

  • VCF .,. -> BCF [MISSING_VALUE, MISSING_VALUE] -> VCF .,. (all values explicitly marked as missing)
  • VCF . -> BCF [MISSING_VALUE, END_OF_VECTOR] -> VCF . (one missing value for whole vector)
  • VCF -> BCF [END_OF_VECTOR, END_OF_VECTOR] -> VCF (trailing missing values dropped)

@jmarshall jmarshall added the vcf label Sep 1, 2021
@pd3
Copy link
Member

pd3 commented Sep 1, 2021

Maybe it should, but it does not have to. HTSlib reads in VCF, internally converts the trailing missing fields to ., then outputs it as MISSING_VALUE,END_OF_VECTOR, for both VCF and BCF outputs. In other words, it does not preserve the dropped trailing fields but explicitly adds a missing value. This behavior is allowed by VCF. Preserving END_OF_VECTOR,END_OF_VECTOR is not important, I haven't encountered a practical use case where this would matter. However, in your implementation you are certainly free to preserve it and if HTSlib chokes on it, then it must be considered a bug.

@h-2
Copy link
Author

h-2 commented Sep 2, 2021

Ok, I don't want to drag this on longer than necessary. Since all of these representations in the file are equivalent in what they represent, it might not make a big difference.

I would still argue that the current wording of the spec suggests a specific representation (END OF VECTOR values assigned to all values not present in the original vector.) and I would prefer if the spec be clarified in this regard, e.g.

Vectors that contain zero or more MISSING values followed by zero or more END_OF_VECTOR values are all treated as an "empty vector". Implementations are not required to preserve the exact representation of "empty vector".

In general, I think that the description of vectors in the spec is not very precise. The words "typed vector" and "vector" are used interchangeably in some sections and not in others. There is also the term "array of values" ...
Important bits are described in "Genotype encocding" (before the sections on "vectors" and "vectors of mixed length" are even introduced) -- and what actually is a "vector of mixed length"? What it means is a vector of vectors where the inner vectors are not the same length. IMHO it would make much more sense to call the paragraph "vector of vectors" and move things from "genotype encoding" here to describe in general how "vectors of vectors" are implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Progressing
Development

No branches or pull requests

3 participants