Clarify use of "blocks" in the Container Header fields #396

jmthibault79 · 2019-03-25T18:42:22Z

Containers consist of a Container Header and one or more Blocks. There are three scenarios for the division of Blocks in a Container:

A BAM Header Block only, of Block Content Type FILE_HEADER.
A COMPRESSION_HEADER Block, plus one or more Slices, which each contain a SLICE_HEADER Block, a CORE_DATA Block, and any number of EXTERNAL_DATA Blocks.
A COMPRESSION_HEADER Block only, to indicate EOF.

Update the Container Header description to be explicit about how to derive the values for the following Container Header fields:

length: Does this refer strictly to Core and External Data Blocks, or does it include everything beyond the Container Header? (i.e. does it include the File Header and/or the Compression Header lengths, when applicable?)

number of blocks: same question

landmarks: Add text indicating that this refers strictly to Slice start boundaries, and no other types of blocks.

jkbonfield · 2019-03-28T15:07:49Z

This is all related to the complex series of pictures. It's always felt a bit strange that we have the BAM header inside a container as it gives two purposes for containers (reads and headers), but I guess it is what it is.

It's hard to describe these in a logical order. Maybe we should describe container structure (7.0), block structure (8.0 and 8.1). Then we can have a sections on "Header container" (sections 7.1 and 8.3) and "Sequence containers" (compression header block, 8.4), then slices, etc. It's a major rejuggling and I'm still not sure it's any clearer. Fundamentally our structures don't map too well to the order of data. Earlier CRAM descriptions matched what actually happens closer, but were clear as mud. (Ie a container is a series of blocks, and then some of the blocks are lumped together into slices; rather than an easier to understand view of containers -> slices -> blocks.)

We do need to be more explicit about container and block fields that have no meaning in certain cases too.

In answer to your questions, I think length is summation of all blocks in the container, regardless of whether it is bam header, compression header, slice header, CORE / EXTERNAL blocks, etc. Similarly for number of blocks.

The landmarks are offsets after the end of the container header struct. I've always found the term a bit confusing personally. I assume it meant something subtly different at some stage. Basically it means we can read the container header struct, and then seek L bytes into it to get to the Nth slice.

Eg a file with 1000 reads per slice, 3 slices per container, has this in cram_dump (culled for brevity).

Container pos 9519 size 201572
    No. recs:          3000
    No. blocks:        108
    No. landmarks:     3
    {1321, 68927, 135560}

    Container_header block pos 9549
    ...

    Slice 1/3, container offset 1321, file offset 10870
        No. records      1000
        No. blocks       36

    Slice 2/3, container offset 68927, file offset 78476
        No. records      1000
        No. blocks       33

    Slice 3/3, container offset 135560, file offset 145109
        No. records      1000
        No. blocks       35

Container pos 211121 size 188368

So the container struct is 30 bytes (compression header at 9549, struct at 9519).

The slices have 36, 33 and 35 blocks each (104), plus the slice header block (1 each x 3) and the compression header (1) => 108 blocks; matching container struct.

2nd container pos minus 1st container pos (211121 - 9519) is 30 bytes larger than the first container size. Thus this size is the size of ALL remaining blocks in the container after decoding the container .struct. It's currently described as "byte size of the container data (blocks)", but maybe this could be more explicit.

Landmark 1 is 1321, which is offset of slice - offset of compression header; or offset of slice - (offset of container + sizeof container struct). I don't really understand why it doesn't count from the end of the compression header, as you need to read this whichever slice you're seeking to, but it's been that way since before my involvement. The landmark description is woeful. It should be updated along the lines of "Landmarks are the distance to the start of each slice from the end of the container structure."

jkbonfield · 2019-03-28T15:22:34Z

For my own sake, cram_index on the above file shows:

(ref-id pos-start pos-span container-offset slice-offset slice-size)

0	10001	37484	9519	1321	67606
0	47388	25499	9519	68927	66633
0	72794	32420	9519	135560	66012
0	105121	44981	211121	1016	61529

This 1321, 68927, 135560 in index are the same as the landmarks in the container and 1st slice size 67606 is slice 2/3 pos 78476-10870. I'll make sure to mention the connection to landmarks in the index update.

jmthibault79 · 2019-03-29T19:05:23Z

Thank you for this clarification! I have a PR incorporating these suggestions: #398

- added comments clarifying the real situation - see samtools/hts-specs#396 - see samtools/hts-specs#398

* Revert #1326 because it was correct before - added comments clarifying the real situation - see samtools/hts-specs#396 - see samtools/hts-specs#398 - comments and rename CRAIEntry.sliceByteOffset - blockCount comments

This was added as clarification in samtools#398 after discussion in samtools#396, but this was in error. In our attempts to clarify and nail down these corner cases, we failed to recall that the SAM header is permitted to be padded out by non-block allocated space. History on this decision dates back to 2013 and is show in Samtools issue samtools/samtools#1852. There are good reasons for changing away from the decision of padding via a second block, as changing block sizes can also change block structure size (if we're using a generic shared piece of code, due to ITF8 being a variable length integer), and this in turn makes it cumbersome to handle every possible change in SAM header size. It is far easier and simpler to just have unallocated space after the block and before the end of the container. This is how htslib works since CRAM 3.0 and I believe how CRAMtools.jar works. Fixes samtools/samtools#1852.

This was added as clarification in #398 after discussion in #396, but this was in error. In our attempts to clarify and nail down these corner cases, we failed to recall that the SAM header is permitted to be padded out by non-block allocated space. History on this decision dates back to 2013 and is show in Samtools issue samtools/samtools#1852. There are good reasons for changing away from the decision of padding via a second block, as changing block sizes can also change block structure size (if we're using a generic shared piece of code, due to ITF8 being a variable length integer), and this in turn makes it cumbersome to handle every possible change in SAM header size. It is far easier and simpler to just have unallocated space after the block and before the end of the container. This is how htslib works since CRAM 3.0 and I believe how CRAMtools.jar works. Fixes samtools/samtools#1852.

jmthibault79 added the cram label Mar 25, 2019

jmthibault79 assigned cmnbroad Mar 25, 2019

jmthibault79 added a commit that referenced this issue Mar 29, 2019

Clarify container header fields. Resolves #396

9bc6025

yfarjoun pushed a commit that referenced this issue Mar 30, 2019

Clarify container header fields. Resolves #396

5e8e3a6

cmnbroad mentioned this issue Apr 1, 2019

Clarify container header fields. Resolves #396 #398

Merged

jmthibault79 added a commit to samtools/htsjdk that referenced this issue Apr 1, 2019

Revert #1326 because it was correct before

3c5de2a

- added comments clarifying the real situation - see samtools/hts-specs#396 - see samtools/hts-specs#398

jmthibault79 closed this as completed in c689122 Apr 4, 2019

This was referenced Jun 27, 2023

Revert defn of CRAM container size to be sum of block sizes (PR#731) #731

Merged

in-place reheadering doesn't follow CRAM standard? samtools/samtools#1852

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify use of "blocks" in the Container Header fields #396

Clarify use of "blocks" in the Container Header fields #396

jmthibault79 commented Mar 25, 2019

jkbonfield commented Mar 28, 2019

jkbonfield commented Mar 28, 2019

jmthibault79 commented Mar 29, 2019

Clarify use of "blocks" in the Container Header fields #396

Clarify use of "blocks" in the Container Header fields #396

Comments

jmthibault79 commented Mar 25, 2019

jkbonfield commented Mar 28, 2019

jkbonfield commented Mar 28, 2019

jmthibault79 commented Mar 29, 2019