Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lift internal CIGAR operation count limit from 16 bits to 32 bits #437

Closed
zeeev opened this issue Nov 21, 2016 · 4 comments
Closed

Lift internal CIGAR operation count limit from 16 bits to 32 bits #437

zeeev opened this issue Nov 21, 2016 · 4 comments

Comments

@zeeev
Copy link

zeeev commented Nov 21, 2016

Greetings,

We are commonly aligning contigs greater than 10Mb to chromosome arms. Aligners like Blasr use the 'X' operation in the cigar and this means the number of cigar operations exceed 16 bits. Thus, HTSLIB throws:

[E::sam_parse1] too many CIGAR operations
[W::sam_read1] parse error at line 201

This leaves me with two options:

  1. Cut contigs into pieces before alignment. This will result in lower alignment accuracy especially in repeat regions.

  2. Create multiple alignment records from a single alignment.

Both of these options are not ideal.

I understand the problem is due to the spec. Perhaps someone should consider adding some extra bits for really large alignments?

Discussion on biostars.org:

https://www.biostars.org/p/223290/#223301

@jmarshall
Copy link
Member

The problem is indeed in the BAM specification: the format only has 16 bits available to store the number of CIGAR operations. This is samtools/hts-specs#40.

@bioinformed
Copy link

Squashing X and = operators into M's and breaking alignments at
sufficiently long D's could help reduce the length, if encoding as BAM is
important.

On Mon, Nov 21, 2016 at 3:59 PM, John Marshall [email protected]
wrote:

The problem is indeed in the BAM specification: the format only has 16
bits available to store the number of CIGAR operations. This is
samtools/hts-specs#40 samtools/hts-specs#40.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#437 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABM-v73fF4AdJSijlUvLxHjMA3ybQ0Tvks5rAiJGgaJpZM4K4oVy
.

@jkbonfield
Copy link
Contributor

jkbonfield commented Nov 22, 2016

There was discussion on how to avoid this by abusing the internal "bin" field instead, given this can be computed on-the-fly unlike cigar_len. Io_lib does something like this already it looks (grep for BAM_CIGAR32 flag), and I'm sure this was based on a discussion on the forums where this was considered to be the appropriate solution.

[Edit no matter - just saw the linked post. Yeah that! It's already implemented elsewhere.]

I think perhaps we could also consider changing the macros (thus breaking the ABI) to use an extra 32-bit variable rather than a 16-bit one. This permits formats other than BAM to then start supporting more than 65536 cigar operations. Right now CRAM supports 2 billion (if not 4), but gets squashed back to 64k due to having to construct a BAM record in memory.

@jmarshall
Copy link
Member

Let's keep discussion of how to shoehorn this into the BAM format over in the hts-specs issue. (It's interesting to hear the repurpose-bin proposal has actually been implemented in io_lib.)

@jkbonfield Thanks for confirming that the CRAM format and its implementation in HTSlib lift this limit. So there's no reason why HTSlib shouldn't be capable of converting records with large CIGARs between SAM and CRAM. At present HTSlib's internal representation corresponds to the BAM format so shares its 64K CIGAR limit, but there is no intrinsic reason for this. So we should change bam1_core_t's n_cigar field and/or its accessor macros to be able to represent 32 bits, and now is the time to do it as we are in an ABI-breaking window.

@jmarshall jmarshall reopened this Nov 22, 2016
@jmarshall jmarshall changed the title [E::sam_parse1] too many CIGAR operations Lift internal CIGAR operation count limit from 16 bits to 32 bits Nov 22, 2016
@jmarshall jmarshall added this to the 1.4 milestone Nov 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants