-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BAM v2 idea tracker #240
Comments
We could consider adding |
Supporting 64 bit reference sequence lengths would useful even if it takes a while for the rest of the ecosystem to catch up. |
@tfenne I think indexing is a separate issue, given that @d-cameron Are there any reference sequences that get close to or beyond 4 Gbases? Switching to unsigned is easy (in fact HTSlib already reads I would also suggest changing some other fields from signed to unsigned where this makes sense. Would anyone want to increase the header length to 64-bit? I think I've seen at least one case where someone made a header that did not fit in the current limit, but it was a while ago and I can't easily find it at the moment. It was an assembly with a large number of contigs that had fairly long names if I remember correctly. |
@daviesrob unsigned types are a real PITA for htsjdk. Java has fairly lame support for unsigned types, making them very hard to work with. From that side it's always preferable to just bump up to 64-bit signed and let compression do the job of squeezing out any empty bits. I agree .csi indexing could be separate, but if we do a BAMv2 I think we should consider making that the standard index type instead of .bai. |
Length of reference may be permitted to be unsigned, but POSition within a ref is still signed as it is 0-based with -1 as unmapped. Similarly TLEN doesn't work properly as has to be signed. So there is nothing to be gained by changing one field without changing all of them, hence >= 2Gb is the point to be considering. Does this happen? Not sure, but it seems likely (see discussion in https://www.biostars.org/p/12560/) As far as compressing 64-bit values, I don't know the impact, but it's not completely free as gzip is pretty rubbish plus you have slightly fewer records per bgzf block. I expect it's only a minor impact though. Try it and see. |
One of the main reasons to use unsigned types would be to disallow negative values which make no sense, so even if going to 64 bits I would still want the types to be unsigned. We can add a note to state that the maximum usable value is limited due to the use of signed values elsewhere. I'd also expect other implementation-defined limits, for example HTSlib can't quite do the full Allowing 64 bit |
People tend to run into the 512Mb BAM index limit and chop up their reference so they can actually use programs. Even moving from signed to unsigned doesn't give much wiggle room. With 100+Gbase genomes reported, there may well be real sequences greater that 4Gb so it would be good to be able to support them. |
I've had a go at implementing a BAM2 with 64-bit reference lengths. The results so far can be found in my bam2 branch if anyone wants to try it. Note that it's very much not for production - the implementation will change. The impact on file size isn't too bad. On my test file (original 587575801 bytes), I got: Uncompressed BAM: 2386240221 bytes While implementing it, I decided that a very useful addition would be a |
Can we just remove the TLEN field? From what I can see, nobody trusts the value written by any other tool so, practically speaking, it doesn't serve any purpose. |
@d-cameron Getting rid of We can get rid of |
Bin also has he issue that it is inherently tied to the BAI index, so it leaks some of the index format into BAM format. This is an issue if we use CSI indices, which realistically we have to (or something equivalent) if we switch to longer chromosome lengths. Bin is basically dead and buried already as far as I'm concerned as it only works on Human data and smaller. |
The fact that you might want to recalculate TLEN, doesn't mean that there
shouldn't be a good place to store it....
…On Thu, Sep 21, 2017 at 6:36 PM, Daniel Cameron ***@***.***> wrote:
Can we just remove the TLEN field? From what I can see, nobody trusts the
value written by any other tool so, practically speaking, it doesn't serve
any purpose.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#240 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACnk0ppIB_VvA0QErg_cXb_kE2x4-O-hks5skoJigaJpZM4PUSTX>
.
|
Have you pushed your local changes to github? It seems that |
@lh3 The purpose of the experiment was to see how much bigger the file got when writing 64 bit values (not much). In the implementation I kept the size of the internal structures the same for ABI compatibility, but wrote out 64 bit values for In HTSlib, |
@d-cameron I'd hate to lose TLEN. I know exactly where it comes from in my pipelines, am happy with it, and can't recompute it without co-locating read pairs which is a huge PITA in a non-queryname sorted BAM. |
Agreed on TLEN. While it may be ill defined, within your own pipeline you can both define and use it in a robust manner. If it wasn't in its own column it'd need to be in an auxiliary tag instead. (Arguably better, but it's here already.) |
I am thinking about a different way to implement BAMv2. It works on two conditions:
In this design, samtools writes BAMv1 by default. If samtools is writing to a file and there are long cigars, it writes the long cigars with, for example, the 0x8000 flag (or other approaches). At the end of the process, samtools seeks back and updates the version string to v2. If samtools is writing to a stream, it prints a warning in the presence of long cigars. Conversely, if a user sets the output format to BAMv2 but samtools doesn't see any long cigars, it either updates the version string to v1 (if to a file) or gives a warning at the end (if to a stream). We also develop a tool that reads through a BAM, analyzes each record and sets the version string to the minimal version that is compatible with all records. The 2nd condition is necessary here. This design reduces the chance of writing BAMv2 unnecessarily. Even if the BAM version in the header is not appropriate, we can fix it later without too much computation. These will help to reduce fragmentation. The above is just a brainstorm idea with details missing, but I think it could work, in theory. EDIT: if we go further along condition 2, we may abandon global versioning. We give each record a local version instead. The version can be encoded with Local versioning is essentially a generalization of option 3 in the other thread. It avoids the complication of setting the right global version on the command line (e.g. users may choose to always output BAMv2 even if not necessary) and allows us to extend BAM frequently without unnecessary fragmentation – ultimately, it is unnecessary fragmentation that worries me most. The drawback is also obvious, though, in that we don't know if a tool is compatible with a BAM by just looking at the header. Condition 1 will help, but it is not perfect, either, and is tricky to implement. |
Given #40 has a new BAMv2 as an option, we should start thinking about what else may be suitable to add. Some ideas to get the ball rolling.
A minor version number. Both major and minor are forcibly checked against, so we don't get in the same pickle as before.
More than 65535 cigar operations, obviously!
Remove the bin field. It doesn't work for anything with large chromosomes anyway, and only makes sense for BAI indices. Best computed on the fly. Suggestion: this frees up the extra bits needed for cigar, with a little shuffling of flag fields.
Swap cigar and name fields. Right now we have all the fixed size fields together, followed by the variable size read name and the multiple of 4 sized cigar field. BAM was intended to be workable within C/C++ by simply loading it straight into memory and dereferencing, but some compiler optimisations generate SIMD instructions for the cigar fields and then crash as it's unaligned data. (In memory we resolve this by adding 1-3 nuls to the read name instead of just 1.) Swapping the two fields cures this problem.
What else?
The text was updated successfully, but these errors were encountered: