-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lift internal CIGAR operation count limit from 16 bits to 32 bits #437
Comments
The problem is indeed in the BAM specification: the format only has 16 bits available to store the number of CIGAR operations. This is samtools/hts-specs#40. |
Squashing X and = operators into M's and breaking alignments at On Mon, Nov 21, 2016 at 3:59 PM, John Marshall [email protected]
|
There was discussion on how to avoid this by abusing the internal "bin" field instead, given this can be computed on-the-fly unlike cigar_len. Io_lib does something like this already it looks (grep for BAM_CIGAR32 flag), and I'm sure this was based on a discussion on the forums where this was considered to be the appropriate solution. [Edit no matter - just saw the linked post. Yeah that! It's already implemented elsewhere.] I think perhaps we could also consider changing the macros (thus breaking the ABI) to use an extra 32-bit variable rather than a 16-bit one. This permits formats other than BAM to then start supporting more than 65536 cigar operations. Right now CRAM supports 2 billion (if not 4), but gets squashed back to 64k due to having to construct a BAM record in memory. |
Let's keep discussion of how to shoehorn this into the BAM format over in the hts-specs issue. (It's interesting to hear the repurpose-bin proposal has actually been implemented in io_lib.) @jkbonfield Thanks for confirming that the CRAM format and its implementation in HTSlib lift this limit. So there's no reason why HTSlib shouldn't be capable of converting records with large CIGARs between SAM and CRAM. At present HTSlib's internal representation corresponds to the BAM format so shares its 64K CIGAR limit, but there is no intrinsic reason for this. So we should change |
Greetings,
We are commonly aligning contigs greater than 10Mb to chromosome arms. Aligners like Blasr use the 'X' operation in the cigar and this means the number of cigar operations exceed 16 bits. Thus, HTSLIB throws:
This leaves me with two options:
Cut contigs into pieces before alignment. This will result in lower alignment accuracy especially in repeat regions.
Create multiple alignment records from a single alignment.
Both of these options are not ideal.
I understand the problem is due to the spec. Perhaps someone should consider adding some extra bits for really large alignments?
Discussion on biostars.org:
https://www.biostars.org/p/223290/#223301
The text was updated successfully, but these errors were encountered: