Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{tabix,csi}: specs are incomplete #70

Open
kortschak opened this issue Mar 4, 2015 · 14 comments
Open

{tabix,csi}: specs are incomplete #70

kortschak opened this issue Mar 4, 2015 · 14 comments

Comments

@kortschak
Copy link
Contributor

The specifications for tabix and csi are incomplete, currently being variously only essentially a struct layout with no semantic explanation or restrictions, or a description of the underlying storage format. Files in the wild in CSI and tabix (produced by samtools) include data that is not described in the specification at all.

  • It is clear from examination of .csi files that they are stored as BGZF (why?), although this is not mentioned and is at odds with the current behaviour of BAI.
  • Both formats include a stats dummy bin, though this is not mentioned in either spec. In CSI indexes it is possible for valid bins to be at or above 0x924a (in the SAM spec this is explained as: "bin number 37450 (which is beyond the normal range)", but it is not explained how an index stats dummy bin should be encoded in CSI with values for min_shift/depth that, currently legally, break this invariant of BAI (e.g. indexes with a depth >=6 - if this is not allowed it should be specified).
  • TBI has potentially conflicting fields. The spec does not describe the precedence of these fields (the format field may conflict with col_{seq,beg,end}, meta and skip).
  • The format field of tabix has a 0-based half-open flag, but the semantics of this flag are not explained: is it an indication to the client or to the tabix handling code?
@kortschak
Copy link
Contributor Author

ping @lh3?

@bicycle1885
Copy link
Contributor

No updates here? I often encounter the same problem due to ambiguous descriptions in the specs.

@brentp
Copy link

brentp commented Nov 7, 2016

Is there any update on this?

The indexer in htslib for CSI stores a tbx_conf_t and all the sequence names in the free-form meta/auxiliary field, but that behavior is completely undocumented.

@kortschak
Copy link
Contributor Author

An interesting issue arises here; the SAM spec states that the BAI is 6 levels deep, but the code in htslib and a comment here indicates that internally it is considered 5 levels deep (and the default depth for CSI is 5 - presumably to match BAI's tree). This means that BAI and CSI do not agree on the meaning of depth. This is sort of why this kind of documentation is important, for actually doing science.

@lbergelson
Copy link
Member

@jkbonfield @yfarjoun @jmarshall I've run into this same problem of underspecified CSI when reviewing an htsjdk pull request to add support for CSI indexes.
(See samtools/htsjdk#1040. )

Particularly, the details of the dummy bin in CSI are completely missing from the spec.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jun 14, 2018

I'll have to dig into the code to figure out what goes in the dummy bin, and also to see how the levels compare. CSI isn't something I've dealt with, although I'll be exploring it soon in other work.

This is really the realm of @pd3 though as I think he worked more on CSI.

@jmarshall
Copy link
Member

The dummy bin contents are the same as in BAI and calculating its bin number is the natural generalisation from the BAI calculation, and came up on samtools-help last December.

Over the last few weeks I've been dusting off my old draft of adding CSI and tabix index documentation alongside the BAI documentation. That should be a PR soon…

@jmarshall
Copy link
Member

At present the page counts of our main specification documents are:

Format #pages
SAM 20
SAMtags 7
CRAM 26
VCF 36

As BGZF compression is used by other formats in addition to BAM, I've been thinking of pulling the descriptions of BGZF and indexing out of the SAM spec into a separate document. This would remove ~5 pages from SAMv1.pdf into a new ~8 page BGZF/indexing spec. Ideologically-nice to separate them, but it's not like SAMv1.pdf is one of our bigger documents to start with…

Thoughts? 👍 for a separate BGZF.pdf document, 👎 for leaving it where it is within SAMv1.pdf.

@magicDGS
Copy link
Member

:+1 for separating. For me it will be clear that way, because the formats using bgzip are not SAM-specific. In addition, the BGZF.pdf document might be a good place to include tabix index instead of in it's own document.

Another option might be to have a "Indexing.pdf" document, for all indexes (.bai, .crai, .csi, tabix...).

@jkbonfield
Copy link
Contributor

I'd agree for making it a separate spec too. It always felt like an odd bolt-on.

That reminds me - the CRAM spec wants a major revamp. What we have right now is the scary child of Microsoft, EBI and an xslt config which converted the XML embedded in the original EBI cram.docx file to LaTeX. Horrific! :-)

@jmarshall
Copy link
Member

@magicDGS: Yes, my thought process moved in that same direction — improve CSI.pdf, tabix.pdf so they have actual descriptive text ⇒ put that CSI, tabix descriptive text alongside the BAI text they mostly duplicate & nuke the CSI*.pdf, tabix.pdf micro-documents ⇒ split out BAI/CSI/tabix indexing into a single separate document ⇒ move BGZF into that separate document too.

(.crai remains CRAM-specific.)

Thanks both for using the vote button on the previous comment 😛

@jmarshall
Copy link
Member

An interesting issue arises here; the SAM spec states that the BAI is 6 levels deep, but the code in htslib and a comment <here> indicates that internally it is considered 5 levels deep (and the default depth for CSI is 5 - presumably to match BAI's tree).

As for levels and depth: there are indeed six different levels for the bins in a BAI index, and the htslib code tends to think about them as level0level5 — and the depth parameter in the functions in the ancient embryonic CSIv1.pdf document is zero-based accordingly (which is unhelpfully unnoted).

So there's no actual problem or inconsistency here, just a lack of clarity.

@kortschak
Copy link
Contributor Author

The code quoted here:

/* calculate maximum bin number -- valid bin numbers range within [0,bin_limit) */
int bin_limit(int min_shift, int depth)
{
    return ((1 << (depth+1)*3) - 1) / 7;
}

gives 299593 when the SAM spec-specified depth of 6 is used, instead of SAM spec-specified value of 37449 which is given with a depth of 5.

This is beyond absence of clarity.

@kortschak
Copy link
Contributor Author

Is there any chance that the substantive issues that are the OP will be addressed? This is nearly 5 years old now.

zaeleus added a commit to zaeleus/noodles that referenced this issue Jan 30, 2021
This isn't actually in the formal spec (The Tabix index file format
(2019-04-09)), and it's likely because the specs for BAI, CSI, and tabix
are fragmented (see samtools/hts-specs#70). However, upon inspecting a
tabix file generated by tabix (htslib) 1.11, it does include the
metadata pseudo-bin.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants