-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{tabix,csi}: specs are incomplete #70
Comments
ping @lh3? |
No updates here? I often encounter the same problem due to ambiguous descriptions in the specs. |
Is there any update on this? The indexer in htslib for CSI stores a tbx_conf_t and all the sequence names in the free-form meta/auxiliary field, but that behavior is completely undocumented. |
An interesting issue arises here; the SAM spec states that the BAI is 6 levels deep, but the code in htslib and a comment here indicates that internally it is considered 5 levels deep (and the default depth for CSI is 5 - presumably to match BAI's tree). This means that BAI and CSI do not agree on the meaning of depth. This is sort of why this kind of documentation is important, for actually doing science. |
@jkbonfield @yfarjoun @jmarshall I've run into this same problem of underspecified CSI when reviewing an htsjdk pull request to add support for CSI indexes. Particularly, the details of the dummy bin in CSI are completely missing from the spec. |
I'll have to dig into the code to figure out what goes in the dummy bin, and also to see how the levels compare. CSI isn't something I've dealt with, although I'll be exploring it soon in other work. This is really the realm of @pd3 though as I think he worked more on CSI. |
The dummy bin contents are the same as in BAI and calculating its bin number is the natural generalisation from the BAI calculation, and came up on samtools-help last December. Over the last few weeks I've been dusting off my old draft of adding CSI and tabix index documentation alongside the BAI documentation. That should be a PR soon… |
At present the page counts of our main specification documents are:
As BGZF compression is used by other formats in addition to BAM, I've been thinking of pulling the descriptions of BGZF and indexing out of the SAM spec into a separate document. This would remove ~5 pages from SAMv1.pdf into a new ~8 page BGZF/indexing spec. Ideologically-nice to separate them, but it's not like SAMv1.pdf is one of our bigger documents to start with… Thoughts? 👍 for a separate BGZF.pdf document, 👎 for leaving it where it is within SAMv1.pdf. |
:+1 for separating. For me it will be clear that way, because the formats using bgzip are not SAM-specific. In addition, the Another option might be to have a "Indexing.pdf" document, for all indexes (.bai, .crai, .csi, tabix...). |
I'd agree for making it a separate spec too. It always felt like an odd bolt-on. That reminds me - the CRAM spec wants a major revamp. What we have right now is the scary child of Microsoft, EBI and an xslt config which converted the XML embedded in the original EBI cram.docx file to LaTeX. Horrific! :-) |
@magicDGS: Yes, my thought process moved in that same direction — improve CSI.pdf, tabix.pdf so they have actual descriptive text ⇒ put that CSI, tabix descriptive text alongside the BAI text they mostly duplicate & nuke the CSI*.pdf, tabix.pdf micro-documents ⇒ split out BAI/CSI/tabix indexing into a single separate document ⇒ move BGZF into that separate document too. (.crai remains CRAM-specific.) Thanks both for using the vote button on the previous comment 😛 |
As for levels and depth: there are indeed six different levels for the bins in a BAI index, and the htslib code tends to think about them as level0 … level5 — and the So there's no actual problem or inconsistency here, just a lack of clarity. |
The code quoted here:
gives 299593 when the SAM spec-specified depth of 6 is used, instead of SAM spec-specified value of 37449 which is given with a depth of 5. This is beyond absence of clarity. |
Is there any chance that the substantive issues that are the OP will be addressed? This is nearly 5 years old now. |
This isn't actually in the formal spec (The Tabix index file format (2019-04-09)), and it's likely because the specs for BAI, CSI, and tabix are fragmented (see samtools/hts-specs#70). However, upon inspecting a tabix file generated by tabix (htslib) 1.11, it does include the metadata pseudo-bin.
The specifications for tabix and csi are incomplete, currently being variously only essentially a struct layout with no semantic explanation or restrictions, or a description of the underlying storage format. Files in the wild in CSI and tabix (produced by samtools) include data that is not described in the specification at all.
The text was updated successfully, but these errors were encountered: