-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify use of "blocks" in the Container Header fields #396
Comments
This is all related to the complex series of pictures. It's always felt a bit strange that we have the BAM header inside a container as it gives two purposes for containers (reads and headers), but I guess it is what it is. It's hard to describe these in a logical order. Maybe we should describe container structure (7.0), block structure (8.0 and 8.1). Then we can have a sections on "Header container" (sections 7.1 and 8.3) and "Sequence containers" (compression header block, 8.4), then slices, etc. It's a major rejuggling and I'm still not sure it's any clearer. Fundamentally our structures don't map too well to the order of data. Earlier CRAM descriptions matched what actually happens closer, but were clear as mud. (Ie a container is a series of blocks, and then some of the blocks are lumped together into slices; rather than an easier to understand view of containers -> slices -> blocks.) We do need to be more explicit about container and block fields that have no meaning in certain cases too. In answer to your questions, I think length is summation of all blocks in the container, regardless of whether it is bam header, compression header, slice header, CORE / EXTERNAL blocks, etc. Similarly for number of blocks. The landmarks are offsets after the end of the container header struct. I've always found the term a bit confusing personally. I assume it meant something subtly different at some stage. Basically it means we can read the container header struct, and then seek L bytes into it to get to the Nth slice. Eg a file with 1000 reads per slice, 3 slices per container, has this in
So the container struct is 30 bytes (compression header at 9549, struct at 9519). The slices have 36, 33 and 35 blocks each (104), plus the slice header block (1 each x 3) and the compression header (1) => 108 blocks; matching container struct. 2nd container pos minus 1st container pos (211121 - 9519) is 30 bytes larger than the first container size. Thus this size is the size of ALL remaining blocks in the container after decoding the container .struct. It's currently described as "byte size of the container data (blocks)", but maybe this could be more explicit. Landmark 1 is 1321, which is offset of slice - offset of compression header; or offset of slice - (offset of container + sizeof container struct). I don't really understand why it doesn't count from the end of the compression header, as you need to read this whichever slice you're seeking to, but it's been that way since before my involvement. The landmark description is woeful. It should be updated along the lines of "Landmarks are the distance to the start of each slice from the end of the container structure." |
For my own sake, cram_index on the above file shows: (ref-id pos-start pos-span container-offset slice-offset slice-size)
This 1321, 68927, 135560 in index are the same as the landmarks in the container and 1st slice size 67606 is slice 2/3 pos 78476-10870. I'll make sure to mention the connection to landmarks in the index update. |
Thank you for this clarification! I have a PR incorporating these suggestions: #398 |
- added comments clarifying the real situation - see samtools/hts-specs#396 - see samtools/hts-specs#398
* Revert #1326 because it was correct before - added comments clarifying the real situation - see samtools/hts-specs#396 - see samtools/hts-specs#398 - comments and rename CRAIEntry.sliceByteOffset - blockCount comments
This was added as clarification in samtools#398 after discussion in samtools#396, but this was in error. In our attempts to clarify and nail down these corner cases, we failed to recall that the SAM header is permitted to be padded out by non-block allocated space. History on this decision dates back to 2013 and is show in Samtools issue samtools/samtools#1852. There are good reasons for changing away from the decision of padding via a second block, as changing block sizes can also change block structure size (if we're using a generic shared piece of code, due to ITF8 being a variable length integer), and this in turn makes it cumbersome to handle every possible change in SAM header size. It is far easier and simpler to just have unallocated space after the block and before the end of the container. This is how htslib works since CRAM 3.0 and I believe how CRAMtools.jar works. Fixes samtools/samtools#1852.
This was added as clarification in samtools#398 after discussion in samtools#396, but this was in error. In our attempts to clarify and nail down these corner cases, we failed to recall that the SAM header is permitted to be padded out by non-block allocated space. History on this decision dates back to 2013 and is show in Samtools issue samtools/samtools#1852. There are good reasons for changing away from the decision of padding via a second block, as changing block sizes can also change block structure size (if we're using a generic shared piece of code, due to ITF8 being a variable length integer), and this in turn makes it cumbersome to handle every possible change in SAM header size. It is far easier and simpler to just have unallocated space after the block and before the end of the container. This is how htslib works since CRAM 3.0 and I believe how CRAMtools.jar works. Fixes samtools/samtools#1852.
This was added as clarification in #398 after discussion in #396, but this was in error. In our attempts to clarify and nail down these corner cases, we failed to recall that the SAM header is permitted to be padded out by non-block allocated space. History on this decision dates back to 2013 and is show in Samtools issue samtools/samtools#1852. There are good reasons for changing away from the decision of padding via a second block, as changing block sizes can also change block structure size (if we're using a generic shared piece of code, due to ITF8 being a variable length integer), and this in turn makes it cumbersome to handle every possible change in SAM header size. It is far easier and simpler to just have unallocated space after the block and before the end of the container. This is how htslib works since CRAM 3.0 and I believe how CRAMtools.jar works. Fixes samtools/samtools#1852.
Containers consist of a Container Header and one or more Blocks. There are three scenarios for the division of Blocks in a Container:
Update the Container Header description to be explicit about how to derive the values for the following Container Header fields:
length: Does this refer strictly to Core and External Data Blocks, or does it include everything beyond the Container Header? (i.e. does it include the File Header and/or the Compression Header lengths, when applicable?)
number of blocks: same question
landmarks: Add text indicating that this refers strictly to Slice start boundaries, and no other types of blocks.
The text was updated successfully, but these errors were encountered: