Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify use of "blocks" in the Container Header fields #396

Closed
jmthibault79 opened this issue Mar 25, 2019 · 3 comments
Closed

Clarify use of "blocks" in the Container Header fields #396

jmthibault79 opened this issue Mar 25, 2019 · 3 comments
Assignees
Labels

Comments

@jmthibault79
Copy link
Contributor

Containers consist of a Container Header and one or more Blocks. There are three scenarios for the division of Blocks in a Container:

  1. A BAM Header Block only, of Block Content Type FILE_HEADER.
  2. A COMPRESSION_HEADER Block, plus one or more Slices, which each contain a SLICE_HEADER Block, a CORE_DATA Block, and any number of EXTERNAL_DATA Blocks.
  3. A COMPRESSION_HEADER Block only, to indicate EOF.

Update the Container Header description to be explicit about how to derive the values for the following Container Header fields:

length: Does this refer strictly to Core and External Data Blocks, or does it include everything beyond the Container Header? (i.e. does it include the File Header and/or the Compression Header lengths, when applicable?)

number of blocks: same question

landmarks: Add text indicating that this refers strictly to Slice start boundaries, and no other types of blocks.

@jkbonfield
Copy link
Contributor

This is all related to the complex series of pictures. It's always felt a bit strange that we have the BAM header inside a container as it gives two purposes for containers (reads and headers), but I guess it is what it is.

It's hard to describe these in a logical order. Maybe we should describe container structure (7.0), block structure (8.0 and 8.1). Then we can have a sections on "Header container" (sections 7.1 and 8.3) and "Sequence containers" (compression header block, 8.4), then slices, etc. It's a major rejuggling and I'm still not sure it's any clearer. Fundamentally our structures don't map too well to the order of data. Earlier CRAM descriptions matched what actually happens closer, but were clear as mud. (Ie a container is a series of blocks, and then some of the blocks are lumped together into slices; rather than an easier to understand view of containers -> slices -> blocks.)

We do need to be more explicit about container and block fields that have no meaning in certain cases too.

In answer to your questions, I think length is summation of all blocks in the container, regardless of whether it is bam header, compression header, slice header, CORE / EXTERNAL blocks, etc. Similarly for number of blocks.

The landmarks are offsets after the end of the container header struct. I've always found the term a bit confusing personally. I assume it meant something subtly different at some stage. Basically it means we can read the container header struct, and then seek L bytes into it to get to the Nth slice.

Eg a file with 1000 reads per slice, 3 slices per container, has this in cram_dump (culled for brevity).

Container pos 9519 size 201572
    No. recs:          3000
    No. blocks:        108
    No. landmarks:     3
    {1321, 68927, 135560}

    Container_header block pos 9549
    ...

    Slice 1/3, container offset 1321, file offset 10870
        No. records      1000
        No. blocks       36

    Slice 2/3, container offset 68927, file offset 78476
        No. records      1000
        No. blocks       33

    Slice 3/3, container offset 135560, file offset 145109
        No. records      1000
        No. blocks       35

Container pos 211121 size 188368

So the container struct is 30 bytes (compression header at 9549, struct at 9519).

The slices have 36, 33 and 35 blocks each (104), plus the slice header block (1 each x 3) and the compression header (1) => 108 blocks; matching container struct.

2nd container pos minus 1st container pos (211121 - 9519) is 30 bytes larger than the first container size. Thus this size is the size of ALL remaining blocks in the container after decoding the container .struct. It's currently described as "byte size of the container data (blocks)", but maybe this could be more explicit.

Landmark 1 is 1321, which is offset of slice - offset of compression header; or offset of slice - (offset of container + sizeof container struct). I don't really understand why it doesn't count from the end of the compression header, as you need to read this whichever slice you're seeking to, but it's been that way since before my involvement. The landmark description is woeful. It should be updated along the lines of "Landmarks are the distance to the start of each slice from the end of the container structure."

@jkbonfield
Copy link
Contributor

For my own sake, cram_index on the above file shows:

(ref-id pos-start pos-span container-offset slice-offset slice-size)

0	10001	37484	9519	1321	67606
0	47388	25499	9519	68927	66633
0	72794	32420	9519	135560	66012
0	105121	44981	211121	1016	61529

This 1321, 68927, 135560 in index are the same as the landmarks in the container and 1st slice size 67606 is slice 2/3 pos 78476-10870. I'll make sure to mention the connection to landmarks in the index update.

@jmthibault79
Copy link
Contributor Author

Thank you for this clarification! I have a PR incorporating these suggestions: #398

jmthibault79 added a commit to samtools/htsjdk that referenced this issue Apr 1, 2019
- added comments clarifying the real situation
- see samtools/hts-specs#396
- see samtools/hts-specs#398
jmthibault79 added a commit to samtools/htsjdk that referenced this issue Apr 3, 2019
* Revert #1326 because it was correct before
- added comments clarifying the real situation
- see samtools/hts-specs#396
- see samtools/hts-specs#398
- comments and rename CRAIEntry.sliceByteOffset
- blockCount comments
jkbonfield added a commit to jkbonfield/hts-specs that referenced this issue Jun 27, 2023
This was added as clarification in samtools#398 after discussion in samtools#396, but
this was in error.  In our attempts to clarify and nail down these
corner cases, we failed to recall that the SAM header is permitted to
be padded out by non-block allocated space.

History on this decision dates back to 2013 and is show in Samtools
issue samtools/samtools#1852.

There are good reasons for changing away from the decision of padding
via a second block, as changing block sizes can also change block
structure size (if we're using a generic shared piece of code, due to
ITF8 being a variable length integer), and this in turn makes it
cumbersome to handle every possible change in SAM header size.  It is
far easier and simpler to just have unallocated space after the block
and before the end of the container.  This is how htslib works since
CRAM 3.0 and I believe how CRAMtools.jar works.

Fixes samtools/samtools#1852.
jkbonfield added a commit to jkbonfield/hts-specs that referenced this issue Sep 27, 2023
This was added as clarification in samtools#398 after discussion in samtools#396, but
this was in error.  In our attempts to clarify and nail down these
corner cases, we failed to recall that the SAM header is permitted to
be padded out by non-block allocated space.

History on this decision dates back to 2013 and is show in Samtools
issue samtools/samtools#1852.

There are good reasons for changing away from the decision of padding
via a second block, as changing block sizes can also change block
structure size (if we're using a generic shared piece of code, due to
ITF8 being a variable length integer), and this in turn makes it
cumbersome to handle every possible change in SAM header size.  It is
far easier and simpler to just have unallocated space after the block
and before the end of the container.  This is how htslib works since
CRAM 3.0 and I believe how CRAMtools.jar works.

Fixes samtools/samtools#1852.
jkbonfield added a commit that referenced this issue Aug 6, 2024
This was added as clarification in #398 after discussion in #396, but
this was in error.  In our attempts to clarify and nail down these
corner cases, we failed to recall that the SAM header is permitted to
be padded out by non-block allocated space.

History on this decision dates back to 2013 and is show in Samtools
issue samtools/samtools#1852.

There are good reasons for changing away from the decision of padding
via a second block, as changing block sizes can also change block
structure size (if we're using a generic shared piece of code, due to
ITF8 being a variable length integer), and this in turn makes it
cumbersome to handle every possible change in SAM header size.  It is
far easier and simpler to just have unallocated space after the block
and before the end of the container.  This is how htslib works since
CRAM 3.0 and I believe how CRAMtools.jar works.

Fixes samtools/samtools#1852.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants