{tabix,csi}: specs are incomplete #70

kortschak · 2015-03-04T09:08:39Z

The specifications for tabix and csi are incomplete, currently being variously only essentially a struct layout with no semantic explanation or restrictions, or a description of the underlying storage format. Files in the wild in CSI and tabix (produced by samtools) include data that is not described in the specification at all.

It is clear from examination of .csi files that they are stored as BGZF (why?), although this is not mentioned and is at odds with the current behaviour of BAI.
Both formats include a stats dummy bin, though this is not mentioned in either spec. In CSI indexes it is possible for valid bins to be at or above 0x924a (in the SAM spec this is explained as: "bin number 37450 (which is beyond the normal range)", but it is not explained how an index stats dummy bin should be encoded in CSI with values for min_shift/depth that, currently legally, break this invariant of BAI (e.g. indexes with a depth >=6 - if this is not allowed it should be specified).
TBI has potentially conflicting fields. The spec does not describe the precedence of these fields (the format field may conflict with col_{seq,beg,end}, meta and skip).
The format field of tabix has a 0-based half-open flag, but the semantics of this flag are not explained: is it an indication to the client or to the tabix handling code?

kortschak · 2015-05-16T05:13:36Z

ping @lh3?

bicycle1885 · 2016-07-14T10:07:19Z

No updates here? I often encounter the same problem due to ambiguous descriptions in the specs.

brentp · 2016-11-07T15:33:18Z

Is there any update on this?

The indexer in htslib for CSI stores a tbx_conf_t and all the sequence names in the free-form meta/auxiliary field, but that behavior is completely undocumented.

kortschak · 2017-12-21T03:12:23Z

An interesting issue arises here; the SAM spec states that the BAI is 6 levels deep, but the code in htslib and a comment here indicates that internally it is considered 5 levels deep (and the default depth for CSI is 5 - presumably to match BAI's tree). This means that BAI and CSI do not agree on the meaning of depth. This is sort of why this kind of documentation is important, for actually doing science.

lbergelson · 2018-06-13T17:58:34Z

@jkbonfield @yfarjoun @jmarshall I've run into this same problem of underspecified CSI when reviewing an htsjdk pull request to add support for CSI indexes.
(See samtools/htsjdk#1040. )

Particularly, the details of the dummy bin in CSI are completely missing from the spec.

jkbonfield · 2018-06-14T07:42:25Z

I'll have to dig into the code to figure out what goes in the dummy bin, and also to see how the levels compare. CSI isn't something I've dealt with, although I'll be exploring it soon in other work.

This is really the realm of @pd3 though as I think he worked more on CSI.

jmarshall · 2018-06-14T08:34:08Z

The dummy bin contents are the same as in BAI and calculating its bin number is the natural generalisation from the BAI calculation, and came up on samtools-help last December.

Over the last few weeks I've been dusting off my old draft of adding CSI and tabix index documentation alongside the BAI documentation. That should be a PR soon…

jmarshall · 2018-06-14T10:44:45Z

At present the page counts of our main specification documents are:

Format	#pages
SAM	20
SAMtags	7
CRAM	26
VCF	36

As BGZF compression is used by other formats in addition to BAM, I've been thinking of pulling the descriptions of BGZF and indexing out of the SAM spec into a separate document. This would remove ~5 pages from SAMv1.pdf into a new ~8 page BGZF/indexing spec. Ideologically-nice to separate them, but it's not like SAMv1.pdf is one of our bigger documents to start with…

Thoughts? 👍 for a separate BGZF.pdf document, 👎 for leaving it where it is within SAMv1.pdf.

magicDGS · 2018-06-14T11:13:46Z

:+1 for separating. For me it will be clear that way, because the formats using bgzip are not SAM-specific. In addition, the BGZF.pdf document might be a good place to include tabix index instead of in it's own document.

Another option might be to have a "Indexing.pdf" document, for all indexes (.bai, .crai, .csi, tabix...).

jkbonfield · 2018-06-14T11:17:30Z

I'd agree for making it a separate spec too. It always felt like an odd bolt-on.

That reminds me - the CRAM spec wants a major revamp. What we have right now is the scary child of Microsoft, EBI and an xslt config which converted the XML embedded in the original EBI cram.docx file to LaTeX. Horrific! :-)

jmarshall · 2018-06-14T11:30:40Z

@magicDGS: Yes, my thought process moved in that same direction — improve CSI.pdf, tabix.pdf so they have actual descriptive text ⇒ put that CSI, tabix descriptive text alongside the BAI text they mostly duplicate & nuke the CSI*.pdf, tabix.pdf micro-documents ⇒ split out BAI/CSI/tabix indexing into a single separate document ⇒ move BGZF into that separate document too.

(.crai remains CRAM-specific.)

Thanks both for using the vote button on the previous comment 😛

jmarshall · 2018-06-28T14:10:03Z

An interesting issue arises here; the SAM spec states that the BAI is 6 levels deep, but the code in htslib and a comment <here> indicates that internally it is considered 5 levels deep (and the default depth for CSI is 5 - presumably to match BAI's tree).

As for levels and depth: there are indeed six different levels for the bins in a BAI index, and the htslib code tends to think about them as level0 … level5 — and the depth parameter in the functions in the ancient embryonic CSIv1.pdf document is zero-based accordingly (which is unhelpfully unnoted).

So there's no actual problem or inconsistency here, just a lack of clarity.

kortschak · 2020-01-08T04:29:24Z

The code quoted here:

/* calculate maximum bin number -- valid bin numbers range within [0,bin_limit) */
int bin_limit(int min_shift, int depth)
{
    return ((1 << (depth+1)*3) - 1) / 7;
}

gives 299593 when the SAM spec-specified depth of 6 is used, instead of SAM spec-specified value of 37449 which is given with a depth of 5.

This is beyond absence of clarity.

kortschak · 2020-01-08T04:31:08Z

Is there any chance that the substantive issues that are the OP will be addressed? This is nearly 5 years old now.

This isn't actually in the formal spec (The Tabix index file format (2019-04-09)), and it's likely because the specs for BAI, CSI, and tabix are fragmented (see samtools/hts-specs#70). However, upon inspecting a tabix file generated by tabix (htslib) 1.11, it does include the metadata pseudo-bin.

kortschak mentioned this issue Sep 15, 2015

tabix query missing bytes. biogo/hts#10

Closed

kortschak mentioned this issue Nov 7, 2016

tabix: change Chunks signature to resemble csi and bam biogo/hts#46

Merged

lomereiter mentioned this issue Mar 7, 2017

Support for CSI indexes biod/sambamba#284

Closed

kortschak mentioned this issue Dec 13, 2017

csi: malformed dummy bin header biogo/hts#115

Closed

kortschak mentioned this issue Dec 21, 2017

csi,internal: fix stats dummy number handling biogo/hts#118

Merged

jmarshall mentioned this issue Dec 2, 2021

Gathering contig information from CSI files #611

Closed

zaeleus mentioned this issue Apr 15, 2024

CSI file is BGZF compressed but this is not mentioned in the CSV1 spec #765

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{tabix,csi}: specs are incomplete #70

{tabix,csi}: specs are incomplete #70

kortschak commented Mar 4, 2015

kortschak commented May 16, 2015

bicycle1885 commented Jul 14, 2016

brentp commented Nov 7, 2016

kortschak commented Dec 21, 2017

lbergelson commented Jun 13, 2018

jkbonfield commented Jun 14, 2018 •

edited

Loading

jmarshall commented Jun 14, 2018

jmarshall commented Jun 14, 2018

magicDGS commented Jun 14, 2018

jkbonfield commented Jun 14, 2018

jmarshall commented Jun 14, 2018

jmarshall commented Jun 28, 2018

kortschak commented Jan 8, 2020

kortschak commented Jan 8, 2020

{tabix,csi}: specs are incomplete #70

{tabix,csi}: specs are incomplete #70

Comments

kortschak commented Mar 4, 2015

kortschak commented May 16, 2015

bicycle1885 commented Jul 14, 2016

brentp commented Nov 7, 2016

kortschak commented Dec 21, 2017

lbergelson commented Jun 13, 2018

jkbonfield commented Jun 14, 2018 • edited Loading

jmarshall commented Jun 14, 2018

jmarshall commented Jun 14, 2018

magicDGS commented Jun 14, 2018

jkbonfield commented Jun 14, 2018

jmarshall commented Jun 14, 2018

jmarshall commented Jun 28, 2018

kortschak commented Jan 8, 2020

kortschak commented Jan 8, 2020

jkbonfield commented Jun 14, 2018 •

edited

Loading