Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BAM v2 idea tracker #240

Open
jkbonfield opened this issue Sep 12, 2017 · 17 comments
Open

BAM v2 idea tracker #240

jkbonfield opened this issue Sep 12, 2017 · 17 comments

Comments

@jkbonfield
Copy link
Contributor

jkbonfield commented Sep 12, 2017

Given #40 has a new BAMv2 as an option, we should start thinking about what else may be suitable to add. Some ideas to get the ball rolling.

  1. A minor version number. Both major and minor are forcibly checked against, so we don't get in the same pickle as before.

  2. More than 65535 cigar operations, obviously!

  3. Remove the bin field. It doesn't work for anything with large chromosomes anyway, and only makes sense for BAI indices. Best computed on the fly. Suggestion: this frees up the extra bits needed for cigar, with a little shuffling of flag fields.

  4. Swap cigar and name fields. Right now we have all the fixed size fields together, followed by the variable size read name and the multiple of 4 sized cigar field. BAM was intended to be workable within C/C++ by simply loading it straight into memory and dereferencing, but some compiler optimisations generate SIMD instructions for the cigar fields and then crash as it's unaligned data. (In memory we resolve this by adding 1-3 nuls to the read name instead of just 1.) Swapping the two fields cures this problem.

What else?

@tfenne
Copy link
Member

tfenne commented Sep 12, 2017

We could consider adding .csi as the official index format for BAMv2 given that .bai doesn't support longer contigs.

@d-cameron
Copy link
Contributor

Supporting 64 bit reference sequence lengths would useful even if it takes a while for the rest of the ecosystem to catch up.

@daviesrob
Copy link
Member

@tfenne I think indexing is a separate issue, given that .csi works for BAM, and we should be recommending it anyway due to the limitations in .bai.

@d-cameron Are there any reference sequences that get close to or beyond 4 Gbases? Switching to unsigned is easy (in fact HTSlib already reads l_ref as unsigned); increasing the size is a bit more work, albeit not that difficult.

I would also suggest changing some other fields from signed to unsigned where this makes sense.

Would anyone want to increase the header length to 64-bit? I think I've seen at least one case where someone made a header that did not fit in the current limit, but it was a while ago and I can't easily find it at the moment. It was an assembly with a large number of contigs that had fairly long names if I remember correctly.

@tfenne
Copy link
Member

tfenne commented Sep 12, 2017

@daviesrob unsigned types are a real PITA for htsjdk. Java has fairly lame support for unsigned types, making them very hard to work with. From that side it's always preferable to just bump up to 64-bit signed and let compression do the job of squeezing out any empty bits.

I agree .csi indexing could be separate, but if we do a BAMv2 I think we should consider making that the standard index type instead of .bai.

@jkbonfield
Copy link
Contributor Author

Length of reference may be permitted to be unsigned, but POSition within a ref is still signed as it is 0-based with -1 as unmapped. Similarly TLEN doesn't work properly as has to be signed. So there is nothing to be gained by changing one field without changing all of them, hence >= 2Gb is the point to be considering. Does this happen? Not sure, but it seems likely (see discussion in https://www.biostars.org/p/12560/)

As far as compressing 64-bit values, I don't know the impact, but it's not completely free as gzip is pretty rubbish plus you have slightly fewer records per bgzf block. I expect it's only a minor impact though. Try it and see.

@daviesrob
Copy link
Member

One of the main reasons to use unsigned types would be to disallow negative values which make no sense, so even if going to 64 bits I would still want the types to be unsigned. We can add a note to state that the maximum usable value is limited due to the use of signed values elsewhere. I'd also expect other implementation-defined limits, for example HTSlib can't quite do the full l_read_name at the moment for internal reasons.

Allowing 64 bit r_len raises a few other issues. pos, next_pos and tlen also get bigger. We may want to shuffle a few fields so that they are properly aligned to an 8-byte boundary so we can avoid holes in the in-memory structure. This would allow for writing the record in a single chunk, which is a useful property of the original format (although at least one of BAM and BAM2 would need to be broken in this respect if they want to use the same data structures internally).

@d-cameron
Copy link
Contributor

d-cameron commented Sep 13, 2017

Are there any reference sequences that get close to or beyond 4 Gbases?

People tend to run into the 512Mb BAM index limit and chop up their reference so they can actually use programs.

Even moving from signed to unsigned doesn't give much wiggle room. With 100+Gbase genomes reported, there may well be real sequences greater that 4Gb so it would be good to be able to support them.

@daviesrob
Copy link
Member

I've had a go at implementing a BAM2 with 64-bit reference lengths. The results so far can be found in my bam2 branch if anyone wants to try it. Note that it's very much not for production - the implementation will change.

The impact on file size isn't too bad. On my test file (original 587575801 bytes), I got:

Uncompressed BAM: 2386240221 bytes
Uncompressed BAM2: 2466117766 bytes (3% bigger)
Compressed BAM: 587575801 bytes
Compressed BAM2: 591009870 bytes (0.6% bigger)

While implementing it, I decided that a very useful addition would be a bam_extras field in the alignment records. This would allow for future enhancements in a more backwards-compatible way. It would be a set of flags that indicate extensions to the base BAM2 format. Writers should set unused bits to zero. If readers find a bit that they cannot interpret set, they should return an error. I think this would make it much less likely that we would need to increment the BAM version number again.

@d-cameron
Copy link
Contributor

Can we just remove the TLEN field? From what I can see, nobody trusts the value written by any other tool so, practically speaking, it doesn't serve any purpose.

@daviesrob
Copy link
Member

@d-cameron Getting rid of TLEN would be a separate issue. It exists in SAM, so it has to be in BAM as well.

We can get rid of bin because it's easy to calculate and doesn't actually appear in SAM.

@jkbonfield
Copy link
Contributor Author

Bin also has he issue that it is inherently tied to the BAI index, so it leaks some of the index format into BAM format. This is an issue if we use CSI indices, which realistically we have to (or something equivalent) if we switch to longer chromosome lengths. Bin is basically dead and buried already as far as I'm concerned as it only works on Human data and smaller.

@yfarjoun
Copy link
Contributor

yfarjoun commented Sep 21, 2017 via email

@lh3
Copy link
Member

lh3 commented Sep 21, 2017

I've had a go at implementing a BAM2 with 64-bit reference lengths. The results so far can be found in my bam2 branch if anyone wants to try it.

Have you pushed your local changes to github? It seems that bam1_core_t still uses 32-bit pos. Moving to 64-bit will require quite a lot of changes, including the header, CIGAR, .csi and APIs.

@daviesrob
Copy link
Member

@lh3 The purpose of the experiment was to see how much bigger the file got when writing 64 bit values (not much). In the implementation I kept the size of the internal structures the same for ABI compatibility, but wrote out 64 bit values for pos etc. to the file. At some point in the future we can increase the size of the internal structures, but there's no particular urgency so we can do that at a time of our choosing (and bundle in lots of other ABI-breaking changes at the same time). The important thing is that the format would no longer be the limitation on what we can do.

In HTSlib, n_cigar was made 32-bit internally on our last ABI-breaking release (1.4). The CSIv1 specification includes example code that uses int64_t values for the indexed range. I think CSI should be fine, as long as min_shift and depth are chosen to keep the number of bins to within the size of an int32_t.

@tfenne
Copy link
Member

tfenne commented Sep 22, 2017

@d-cameron I'd hate to lose TLEN. I know exactly where it comes from in my pipelines, am happy with it, and can't recompute it without co-locating read pairs which is a huge PITA in a non-queryname sorted BAM.

@jkbonfield
Copy link
Contributor Author

Agreed on TLEN. While it may be ill defined, within your own pipeline you can both define and use it in a robust manner. If it wasn't in its own column it'd need to be in an auxiliary tag instead. (Arguably better, but it's here already.)

@lh3
Copy link
Member

lh3 commented Sep 26, 2017

I am thinking about a different way to implement BAMv2. It works on two conditions:

  1. The BAM version string is put into an uncompressed BGZF block, such that it can be modified afterwards.

  2. A newer BAM reader can always parse older BAMs without looking at the version string in the header.

In this design, samtools writes BAMv1 by default. If samtools is writing to a file and there are long cigars, it writes the long cigars with, for example, the 0x8000 flag (or other approaches). At the end of the process, samtools seeks back and updates the version string to v2. If samtools is writing to a stream, it prints a warning in the presence of long cigars. Conversely, if a user sets the output format to BAMv2 but samtools doesn't see any long cigars, it either updates the version string to v1 (if to a file) or gives a warning at the end (if to a stream). We also develop a tool that reads through a BAM, analyzes each record and sets the version string to the minimal version that is compatible with all records. The 2nd condition is necessary here.

This design reduces the chance of writing BAMv2 unnecessarily. Even if the BAM version in the header is not appropriate, we can fix it later without too much computation. These will help to reduce fragmentation.

The above is just a brainstorm idea with details missing, but I think it could work, in theory.

EDIT: if we go further along condition 2, we may abandon global versioning. We give each record a local version instead. The version can be encoded with block_size, following @daviesrob's idea. For example, positive for v1, -12 for v2, -13 for v3, etc. The encoding of the following block is fully determined by block_size and described in the spec. For each record, a BAM writer is required to choose the oldest encoding that is compatible with the record.

Local versioning is essentially a generalization of option 3 in the other thread. It avoids the complication of setting the right global version on the command line (e.g. users may choose to always output BAMv2 even if not necessary) and allows us to extend BAM frequently without unnecessary fragmentation – ultimately, it is unnecessary fragmentation that worries me most. The drawback is also obvious, though, in that we don't know if a tool is compatible with a BAM by just looking at the header. Condition 1 will help, but it is not perfect, either, and is tricky to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants