[bcf spec] Empty variable length vector in Genotype encoding #593

h-2 · 2021-08-31T15:04:08Z

The VCF spec has the following genotype fields in the example records

GT:GQ:DP:HQ	0|0:48:1:51,51	1|0:48:8:51,51	1/1:43:5:.,.
GT:GQ:DP:HQ	0|0:49:3:58,50	0|1:3:5:65,3	0/0:41:3
[...]
GT:GQ:DP	0/1:35:4	0/2:17:2	1/1:40:3

For the last sample, HQ is missing from all three records, but this is encoded differently:

Two missing values
HQ is absent for this sample
HQ is absent for all samples

It is perfectly clear to me how BCF represents cases 1. and 3., but what about number 2?
I had assumed that "empty vector" would basically be encoded as a vector consisting of two END_OF_VECTOR values as the spec says:

In the situation when a genotype
field contain vector values of different lengths, these are represented in BCF2 by a vector of the maximum length
per sample, with all values in the each vector aligned to the left, and END OF VECTOR values assigned to all
values not present in the original vector.

However, bcftools convert creates a vector of [MISSING_VALUE, END_OF_VECTOR].

Which is correct? Does my implementation need to emulate bcftools to be compatible?

And could this maybe be made clearer in the spec? I think it is already confusing enough that there are so many different ways to encode the absence of something 🙄

Thank you!

The text was updated successfully, but these errors were encountered:

pd3 · 2021-09-01T08:52:12Z

The VCF specification allows to drop sample's trailing FORMAT fields (case 2), but it also allows to use a single MISSING_VALUE as an abbreviated way of expressing how many missing values there are in that field. HTSlib puts in the single missing value.

h-2 · 2021-09-01T11:22:55Z

The VCF specification allows to drop sample's trailing FORMAT fields (case 2), but it also allows to use a single MISSING_VALUE as an abbreviated way of expressing how many missing values there are in that field.

Yes, this is correct for VCF, but BCF clearly mandates a fixed length vector and the spec also literally says

END OF VECTOR values assigned to all values not present in the original vector.

There are no values present in the original vector, so shouldn't that result in [END_OF_VECTOR, END_OF_VECTOR]?

I think this behaviour would be important to preserve the distinction in VCF when converting back and forth, so that

VCF .,. -> BCF [MISSING_VALUE, MISSING_VALUE] -> VCF .,. (all values explicitly marked as missing)
VCF . -> BCF [MISSING_VALUE, END_OF_VECTOR] -> VCF . (one missing value for whole vector)
VCF -> BCF [END_OF_VECTOR, END_OF_VECTOR] -> VCF (trailing missing values dropped)

pd3 · 2021-09-01T13:06:30Z

Maybe it should, but it does not have to. HTSlib reads in VCF, internally converts the trailing missing fields to ., then outputs it as MISSING_VALUE,END_OF_VECTOR, for both VCF and BCF outputs. In other words, it does not preserve the dropped trailing fields but explicitly adds a missing value. This behavior is allowed by VCF. Preserving END_OF_VECTOR,END_OF_VECTOR is not important, I haven't encountered a practical use case where this would matter. However, in your implementation you are certainly free to preserve it and if HTSlib chokes on it, then it must be considered a bug.

h-2 · 2021-09-02T11:04:44Z

Ok, I don't want to drag this on longer than necessary. Since all of these representations in the file are equivalent in what they represent, it might not make a big difference.

I would still argue that the current wording of the spec suggests a specific representation (END OF VECTOR values assigned to all values not present in the original vector.) and I would prefer if the spec be clarified in this regard, e.g.

Vectors that contain zero or more MISSING values followed by zero or more END_OF_VECTOR values are all treated as an "empty vector". Implementations are not required to preserve the exact representation of "empty vector".

In general, I think that the description of vectors in the spec is not very precise. The words "typed vector" and "vector" are used interchangeably in some sections and not in others. There is also the term "array of values" ...
Important bits are described in "Genotype encocding" (before the sections on "vectors" and "vectors of mixed length" are even introduced) -- and what actually is a "vector of mixed length"? What it means is a vector of vectors where the inner vectors are not the same length. IMHO it would make much more sense to call the paragraph "vector of vectors" and move things from "genotype encoding" here to describe in general how "vectors of vectors" are implemented.

as triggered by samtools/bcftools#1622 samtools#593

jmarshall added the vcf label Sep 1, 2021

andersleung mentioned this issue Nov 30, 2021

bcftools view outputs empty string for non-trailing missing genotype field represented in BCF as [EOV, EOV] samtools/bcftools#1622

Closed

pd3 added a commit to pd3/hts-specs that referenced this issue Dec 16, 2021

Clarification of empty vector representation

9a5fdb0

as triggered by samtools/bcftools#1622 samtools#593

pd3 mentioned this issue Dec 16, 2021

Clarification of empty vector representation #617

Open

h-2 mentioned this issue Jan 25, 2022

BCF Character/String type MISSING/EOV encoding #618

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bcf spec] Empty variable length vector in Genotype encoding #593

[bcf spec] Empty variable length vector in Genotype encoding #593

h-2 commented Aug 31, 2021

pd3 commented Sep 1, 2021

h-2 commented Sep 1, 2021

pd3 commented Sep 1, 2021

h-2 commented Sep 2, 2021

[bcf spec] Empty variable length vector in Genotype encoding #593

[bcf spec] Empty variable length vector in Genotype encoding #593

Comments

h-2 commented Aug 31, 2021

pd3 commented Sep 1, 2021

h-2 commented Sep 1, 2021

pd3 commented Sep 1, 2021

h-2 commented Sep 2, 2021