Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More than 65535 cigar opcodes #40

Closed
jkbonfield opened this issue Sep 11, 2014 · 195 comments
Closed

More than 65535 cigar opcodes #40

jkbonfield opened this issue Sep 11, 2014 · 195 comments
Labels

Comments

@jkbonfield
Copy link
Contributor

There was discussion on the samtools-dev mailing list about this last year:

http://sourceforge.net/p/samtools/mailman/message/30672431/

The main crux of the discussion there was to reuse the 16bit bin field to act as the top 16 bits of ncigar, possibly using the top flag bit as an indicator.

There are some other discussions (internal) regarding this, including possibly removing bin completely given that it has no meaning with CSI indices, so this issue is largely just a note to track the issue and collate ideas/fixes.

Note that the problem is definitely a real one. I have hit this with real projects, caused when a user merged consensus sequences from an assembly into a BAM file, but it is also not too far away with actual sequence reads too. Newer technologies (PacBio, ONT, maybe more) offer substantially longer read lengths but also with higher indel rates leading to substantially more cigar opcodes. A 320Kb with with a 10% indel rate would lead to 2 changes (D/I to M) every 10 bases, giving 64k cigar ops. (Those aren't figures for real technologies, but it's not inconceivable.)

@lh3
Copy link
Member

lh3 commented Sep 11, 2014

We should implement this feature. I would prefer to keep bin for backward compatibility. This solution is a little awkward, but compatibility seems more important to me.

@nh13
Copy link
Member

nh13 commented Sep 22, 2014

We do not use CSI indexes within HTSJDK and Picard, so I would have to think about this in our production pipeline, even as we are moving to CRAM. My apologies for not having a definitive yes/no answer at the moment.

@jkbonfield
Copy link
Contributor Author

Do you trust and use the bin field though, or recompute it on the fly?

@nh13
Copy link
Member

nh13 commented Sep 23, 2014

Looking at our code, it currently reads the bin from from the BAM file. If validation stringency is turned on, which is most of the time, then it checks that the bin is correctly set. If the alignment is modified in memory, then the bin is recomputed before being written. We trust the bin fields since we are most likely the one's that computed the bin field in the first place.

@jmarshall jmarshall added the sam label Nov 4, 2014
@nh13
Copy link
Member

nh13 commented Dec 18, 2014

So looking through our code, we could also not use the bin field, and re-calculate it on-the-fly. We could then implement using a bit in the flag field to say whether or not to use the 16 bits in the bin field. I would suggest we then get rid of it in the specification. Please let us know how you would like us to proceed.

jmarshall added a commit to samtools/htslib that referenced this issue Dec 21, 2016
As various CIGAR-related loops and API functions use int, the practical
limit without further changes is 2^31-1 rather than 2^32-1.  As bam_init1()
uses calloc(), the unused field is initialised to 0 and we don't need to
bother re-zeroing it in sam_read1() et al.

Until the BAM format is extended to deal with it (samtools/hts-specs#40),
trying to write >64K CIGAR operations to a BAM file remains an error.

Implements and fixes #437.  (Update htslib/sam.h copyright notice for
changes in 2015 and 2016.  MinGW does not have EOVERFLOW.)
@nickloman
Copy link

This is now an active issue with ONT data, I've provided a dataset in FASTA that will routinely run into this limit here when aligned with BWA-MEM:

http://s3.climb.ac.uk/nanopore/ecoli_allreads.fasta

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Mar 6, 2017

Thanks for the real world examples Nick.

The htslib fix John made above solves this for SAM and CRAM, but not BAM due to file format limitations. I verified this on a CRAM file with 178801 cigar operations and the round trip from SAM -> CRAM -> SAM is identical in the latest develop branch.

Following discussions on samtools-devel several years back https://sourceforge.net/p/samtools/mailman/message/30668698/ I implemented this out of necessity in Staden "io_lib", so scramble can read and write BAM records too with more than 64k entries too. I'm not advocating using that, but it's a practical demonstration that the BAM spec could (and should?) be updated if there is pressure.

In the interim, have you thought of just using CRAM or compressed SAM? Note that you can use CRAM without needing to specify a reference, which may be more efficient with lots of reference differences anyway:

samtools view -O cram,no_ref -o test.cram test.sam

@jkbonfield
Copy link
Contributor Author

A query on the proposed extra FLAG field. Is this only the BAM format we are changing, or is it SAM also? If SAM, do we therefore expose it? Eg a record with 100,000 cigar ops could have the SAM FLAG as >= 32768.

It feels like this is an internal BAM limitation and it should be in the BAM part of the spec and absent in the SAM part, as this is a binary encoding issue only.

@peterjc
Copy link
Contributor

peterjc commented Mar 6, 2017

I would define the FLAG bit as reserved for BAM use as described, and say the bit should be clear in SAM.

@tfenne
Copy link
Member

tfenne commented Mar 6, 2017

Given that we're likely to see more and more of this going forward, would it be crazy to consider a non-backwards compatible change to the BAM spec along with rev'ing the major version number? I think that would be much preferable in the long run to having a flag field that is reserved for BAM use and whose meaning is questionable for SAM/CRAM use. The latter means that all programs that convert between formats need to be smart about modifying the records prior to writing.

I would much prefer to see this done with a version bump in the spec tied to widening the ncigar field to 32 bits. Then implementations just need to know whether they are parsing a version < n or >= n BAM and expect the field of the appropriate width.

@lh3
Copy link
Member

lh3 commented Mar 8, 2017

We have maintained backward and nearly forward compatibility of BAM for over 8 years. In my opinion, we should keep it this way as long as there is a solution -- I am always on the compatibility side. Quite a few tools specifically depend on an old version of samtools or use bamtools, native golang, javascript etc to parse BAMs. @jkbonfield's solution will keep them usable, which I think is very important.

@tfenne
Copy link
Member

tfenne commented Mar 8, 2017

I guess I'm not sure I follow. Won't all those implementations break given @jkbonfield's solution and a BAM file that actually contains > 65535 cigar operations for at least one record? Unless those implementations (perhaps conditionally on a flag bit) know to use the 16-bits of ncigar plus the 16-bit bin field to make one 32-bit number, they'll just read the exiting 16-bit ncigar field, interpret it as a number < 65535, and then fail when trying to parse the remainder of the cigar as something else.

The only kind of backwards compatibility it yields is that files that don't contain any cigars with ncigar > 65535 would still parse normally. I realize that's not a small fraction of files out there, but the point is that existing implementations would fail in bizarre ways given long cigars with reusing the bin field.

At the very least I think it would make sense to version the specification, and re-write the section that contains the current definition of bin_mq_nl and flag_nc to be a single uint64_t with 32-bits allocated to ncigar (albeit in 2 x 16-bit words). If we did this we:
a) Would retain the same level of backwards compatibility
b) Have something a little clearer to point to when old implementations fail (are you looking at a V2 BAM?)
c) Make it explicit that the bin field is gone
d) Have a somewhat less ambiguous spec moving forward

@lh3
Copy link
Member

lh3 commented Mar 8, 2017

Yes, forward compatibility is only achievable when there are less than 65536 CIGAR operations, but this is the whole point: we want BAMs without long CIGARs usable by all the old tools, including tools relying on 3rd-party BAM parsing libraries written in other languages.

re-write the section that contains the current definition of bin_mq_nl and flag_nc to be a single uint64_t with 32-bits allocated to ncigar (albeit in 2 x 16-bit words).

I don't think this is possible without breaking compatibility. If I am right, bin and n_cigar are adjacent in memory, but their order is not right. We can't use one 32-bit integer to replace the two 16-bit integers. For compatibility, we can only say higher 16 bits go to this integer and lower 16 bits go to the other.

Make it explicit that the bin field is gone

bin is only gone when there are long CIGARs; otherwise it has the same meaning as before, again for compatibility.

@jkbonfield
Copy link
Contributor Author

Yes the order was incorrect to be a 32-bit integer (neither big nor little nor even pdp endian). It's all in that email discussion from the original post. I think that also shows it was Heng's idea originally, not mine. I just implemented it. :)

Frankly if we were going to break binary compatibility for ALL data, then we'd change lots of BAM things (eg swapping cigar and name around so we don't have to play games to avoid unaligned memory accesses), but it's just a fools errand. May as well start again with a new format or work on improving the next CRAM iteration than changing the BAM format in a non-compatible manner.

The only problem I see with using a flag is that you never really know whether you'll hit a record that can't cope until you find it, unlike chanigng the format. Hence I'd say we should bump the minor version number and leave the major version as is, keeping the file backwards compatible and stating that any tools that explicitly want to conform to the new minor version have to cope with the additional flag.

I did produce a test file using Scramble to see how samtools/htslib copes; it was badly. It started decoding the remaining CIGAR ops as sequence, emitted a bunch of binary qualities and then returned 0 exit status for good measure. So we know this will break things horribly IF we encounter more than 64k cigar ops, but it's already hopelessly broken in that scenario anyway (re: impossible).

@colinhercus
Copy link

colinhercus commented Mar 9, 2017 via email

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Mar 9, 2017

Agreed it's unclear. It would be friendlier to be in the main title of the document. I suspect the version with examples containing different HD lines was an accident, but clearly this is best avoided by putting the actual version at the top rather than in an example.

Largely the changes boil down to two things - substantive changes in file contents and/or meanings, and minor language adjustments or clarifications. The latter do not change the minor version number as the file format is unchanged, however it would be good if there was some sub-revision counter somewhere so you know which era of document you are reading. Note also that SAM auxiliary tags have now been moved to their own document. Previously adding new tags would likely have meant increasing the SAM minor version number.

It gets more confusing though with BAM. BAM has a magic number "BAM\1" and no minor version number at all. I assume it is expected to be gleaned from the "HD" header, but this isn't a required field.

For the purposes of this proposed change, it will be challenging unless we go down the route of losing forwards compatibility and switch to "BAM\2". The SAM specification should not, IMO, be adjusted to indicate the size change of ncigar or add the flag as this is purely an internal BAM representation and not a change needed to the SAM specification itself (which is entirely textual). Mind you, the same is true for the bin field too - it's entirely a BAM optimisation and not in SAM.

@colinhercus
Copy link

colinhercus commented Mar 9, 2017 via email

@tfenne
Copy link
Member

tfenne commented Mar 9, 2017

@colinhercus The problem with setting the magic number that way is that producers have to know in advance whether they will produce such a record, and for a lot of tools that may not be the case either. E.g. a generic tool that converts SAM to BAM won't know until it hits the read.

There are other options we could consider, rather than just going with the one solution proposed. E.g. we might consider that if you have > 65k cigar operations that the right way to handle it is to push the remaining cigar operations into an optional tag, and either terminate the cigar field with a soft-clip that covers the remaining bases, or a new symbolic cigar operator that would a) inform up to date tools that there is more cigar elsewhere, and b) cause existing tools to terminate with an error because they don't recognize the cigar operator.

Re: ordering of bits and the bin field, I think I wasn't clear enough. I was suggesting two things:

  1. If there are situations in which the bin field cannot be relied upon in all cases, the sanger folks already don't seem to trust it, and it can be re-computed from the record, I would prefer going forward to not have bin in the format description.
  2. I was suggesting re-writing the document to collapse those two descriptions of 32-bit chunks of memory to a single description of a 64-bit chunk of memory, and then to document more clearly how each byte or word is used, following the solution you guys outlined to use two 16-bit words for ncigar.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Mar 9, 2017

If we were to bump the BAM version number then I'd say the default should be the current BAM\1 value (unless the input data was BAM\2, in which case retain that) with command line switches to specifically request the newer format. That keeps the vast bulk of data in a well supported format. It's still problematic though.

There is certainly merit in using auxiliary fields to work around this issue in a completely compatible manner - a nice idea. However it migrates the problem from being BAM specific to also being SAM and CRAM oriented unless we simply patch the issue in BAM only: when BAM encounters a SAM cigar it cannot handle it stores it in a new tag, and on decode it reverses this so that it is transparent to the users. This can also make it transparent to the APIs.

Regarding bin, it cannot be relied on in some cases at least; eg when using CSI as you have an extra large chromosome. That means it's unreliable and best off computed on the fly using whatever index algorithm you have. We've had subtle bugs as well (cough, my bugs in cram!) which filled it out incorrectly in a few boundary cases.

@colinhercus
Copy link

colinhercus commented Mar 10, 2017 via email

@lh3
Copy link
Member

lh3 commented Jul 5, 2017

I am reviving this thread again. With the release of NA12878 ultra-long nanopore reads, long CIGARs are becoming more frequent. After some thoughts, I now prefer the following solution as is suggested by @tfenne. When converting an alignment with >65535 CIGAR operations to a record in BAM file, we mark the alignment as unmapped (flag 0x4), set bam1_core_t::n_cigar to zero and move the entire CIGAR string to a tag CG:B,I. When newer htslib sees an unaligned record with a CG:B,I tag, it seamlessly generates the right CIGAR from the CG tag and removes the tag while reading. Normal htslib users would not notice any differences between short and long CIGARs. Older htslib will take records with >65535 CIGAR operations as unmapped and won't crash. Older htslib loses such long alignments, but it can't work with them anyway.

An alternative to CG:B,I is to simply use CG:Z, storing long CIGARs in their text form. I mildly prefer CG:B,I because it makes code faster (moving a long CIGAR can be achieved with memmove calls) and we usually don't see this tag written to a SAM file.

[EDIT for clarity]

@peterjc
Copy link
Contributor

peterjc commented Jul 5, 2017

@lh3 you said in SAM, but I presume you mean in BAM here?:

"For an alignment with a long CIGAR in SAM, we mark the alignment as unmapped (flag 0x4), set ..."

i.e. There is no change when such long mapped reads are stored in SAM format which has no limits on the number of CIGAR operators.

[update - fixed]

@lh3
Copy link
Member

lh3 commented Jul 5, 2017

@peterjc see edit for clarity.

@peterjc
Copy link
Contributor

peterjc commented Jul 5, 2017

@lh3 that's better now, thanks!

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jul 5, 2017

Thumbs up to this idea.

I think it's a workable way of addressing the problem in BAM. For SAM and CRAM no change needs to be made. I agree CG:B,I is appropriate as this field should never be present in SAM or CRAM, only BAM (we could even state that in the specification) in which case the normal BAM specific encoding is the natural way to store it.

As per the SAM spec:

If [flag] 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ and bits 0x2, 0x100 and 0x800

By setting flag 0x4 we aren't technically breaking the specification, but we are adding some meaning to those fields as the flag bit field may be removed, resurrecting those other fields once more. We'd need some footnote to explain this caveat.

@yfarjoun
Copy link
Contributor

yfarjoun commented Jul 5, 2017 via email

@sjackman
Copy link

sjackman commented Oct 26, 2017

!? Are you saying that IGV already uses the CIGAR in the CG tag if it's present?

@lh3
Copy link
Member

lh3 commented Oct 26, 2017

No, it doesn't. I mean IGV recognizes 100000S200000N and displays the read as a solid bar as if the read were perfectly aligned. It does not show mismatches and gaps because it is not reading CG.

EDIT: displaying something, albeit clearly wrong, is better than displaying nothing on the screen.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Oct 26, 2017

Would an old version of samtools sort and samtools index work, and avoid the problem of silently creating an incorrect file? What doesn't work if all the alignments in the BAM are long CIGAR? By doesn't work, I mean silently produces slightly incorrect results. I'm okay with no results or an error.

I think the main thing is depth, mpileup and similar tools. They'll omit the soft-clipped read, essentially acting as if the long reads had been filtered out. Mostly this won't give a different call, but it may. Realistically it'll only likely completely lose differences on poor quality calls, but it may also turn a call with qual 31 to qual 29 and make it vanish on a filter of QUAL >= 30; such is the nature of using hard thresholds. The chances are slim and probably irrelevant, but you can never tell when it mattered. As Heng points out there, there are several other pitfalls that have the same impact, such as running an older copy of the aligner or simply an inappropriate one for long reads.

Basically using old tools is equivalent to filtering out the long reads first, but with the added nuance of (probably) not realising you're doing it. Using BAM2 is the equivalent of all your old tools failing hard, forcing you to either upgrade if available or manually remove the long cigars and generate BAM1 before running the old tools again.

@sjackman
Copy link

My preference is to use entirely long cigars or no long cigars to avoid silently dropping just the long alignments.

@lh3
Copy link
Member

lh3 commented Oct 26, 2017

Update to my proposal with inputs from James and Shaun:

  • If there is a CIGAR with >65535 ops, write <readLen>S<refLen>N at the CIGAR field and move the real CIGAR to the CG tag.

  • If there is a CG tag, the SAM header is required to have CG:Y at the HD line; otherwise, samtools aborts when it writes a BAM with long cigar.

With this solution:

  • Old tools that don't look at CIGAR will work properly. This includes sort, merge and flagstat

  • Old tools that only use CIGAR to compute the end of alignment will work properly. This includes index, random retrieval (e.g. bedtools2 intersectBED).

  • Old IGV semi-works. It displays an alignment without gaps or mismatches. At least you know there is something weird there.

  • When using new samtools, you have to take actions to generate a BAM containing long cigars. For example, you will need to run minimap2 with -L or ngmlr with --bam-fix in order to generate output convertible to BAM. This will make users more aware of the issue.

Of course, you can still make up examples that lead to silent loss of data, but as I said above, there are much easier ways to make much bigger mistakes, for example, by using an old mapper. We should not be obsessed with "absolute robustnes at any cost". Some cost like the risk of fragmenting the BAM ecosystem without careful planning is too high for us to afford.

The above is the best solution.

@lh3
Copy link
Member

lh3 commented Oct 26, 2017

@sjackman I don't object. I will leave header tag and/or clipping every read to you and others after we settle on full clipping. We will move forward quickly after that.

@dkj
Copy link
Member

dkj commented Oct 27, 2017

Meta-status:

  • everyone wants a clear decision from the format maintainers on the proposals at their meeting next week
  • this thread is valuable
    • there are risks with both leading proposals, but many are now mitigated to varying degrees thanks to this thread
    • this recorded process of challenging proposals, finding, assessing and mitigating their risks, shows there is rigor to the SAM/BAM/CRAM standard
      • I hope it will make other proposed standards, which are less amenable to the research world, less likely to take hold - particularly important with the impending tsunami of clinical data
      • there's scope for more quantification of the risks
  • we're missing input from vendors and public archives aren't we?

@lh3
Copy link
Member

lh3 commented Oct 27, 2017

EBI had archiveBAM. It was proposed in early 2011 but seems not used any more? ArchiveBAM was researcher oriented. Even though it didn't fly, I think the same idea might work in the clinical world where we care robustness a lot more.

We could have clinicalSAM, a strict subset of SAM with additional restrictions such as mandated HD-SO, SQ-M5, RD-PL, no non-standard tag, etc. We ought to implement a tool that validates and converts to clinicalSAM. Ga4gh can then set it as a standard for clinical sequence data. Research and clinical worlds are quite different. From the discussion here, it seems to me that a single standard won't work well for both worlds at the same time. ClinicalSam might be one of the potential solutions. Thoughts?

we're missing input from vendors and public archives aren't we?

Derek from PacBio (author of bamtools and the most significant contributor to pbbam) is largely neutral. Generally, vendors and archives rely on public standards and implementations. Their interest is to make their tools/data seamlessly interchangeable with the rest of world. It is good to hear from them, but this is not critical to the discussion in my opinion.

@pezmaster31
Copy link

Not speaking officially for PacBio, of course. But yes, you nailed it, Heng.

Our interest is mostly that the issue is solved and standardized. Something we can code to and ship data/software out with minimal concern. Unless/until I hear differently from upstairs, the specific implementation details are less critical.

@jkbonfield
Copy link
Contributor Author

In preparation for the Thursday conference call I've produced a summary of the two main proponents for a solution to this problem. It's long and had a round of review so it's a google doc.

https://docs.google.com/document/d/1PzH9k5umvraI_ounpcKcUdawj7Ufs-N-jul77xDs1dI

The aim of writing this document is primarily to list the solutions and their associated risks so that we can discuss our preferred implementation on Thursday from a basis of knowledge rather than gut feelings or unsupported conjecture.

Assessing risk likelihood and severity are of course subjective, although I've tried to be neutral in this, but irrespective of that just listing the potential problems and ways of mitigating against them is important so we can make our implementation as robust as possible. Also note that most likelihoods are conditional - conditional on not updating software and obviously conditional on being one of the affected people with extra-long cigars present in their data.

@lh3
Copy link
Member

lh3 commented Oct 31, 2017

I have commented on an earlier version of this document. Some of my concerns have been addressed. Here are some different views on estimating likelihood and severity as I can't comment on the document any more.

The expected number of ultra-long read users who actually make a mistake by using older software will be 10s at most. In the next year or two, the expected number of users who may see BAM2, even if unnecessarily, is of 100s at least, likely 1000s, a couple of orders of magnitude higher.

I strongly disagree with the severity rating because many of them can't be measured and are too subjective. You can consider the following scenarios:

  1. In the worst case, full clipping leads to 1% of randomly missing bases. This reduces variant sensitivity by ~0.01% given that you need multiple occurrences to call a variant.
  2. Using a old version of a mapper has a similar effect broadly: sometimes lower (e.g. no effect with rare bug fixes), sometimes higher (e.g. recent minimap2 improves direct RNA junction accuracy and sensitivity by ~5%).
  3. Changing a reference genome (e.g. with hs37d5/unplaced contigs or not), replacing a mapper, applying quality recalibration or not all affect the final variant accuracy at the scale of ~1%.
  4. It is known that changing short-read RNA-seq mappers has a more dramatic effect.

You can objectively compare the severity of these scenarios, at least the first 3. 1 is the lowest. Now we consider what may happen when you break the BAM ecosystem.

  1. Your pipeline is broken because some tools don't read BAM2 and won't receive updates.
  2. In many cases, you can mitigate the problem above by converting BAM2 to BAM1 first, which takes several hours for a human data set.
  3. In other cases, you have to wait for several months or years until the tools catch up with BAM2. In particular, ultra-long read users need to wait.
  4. We ask tool developers to upgrade multiple times. First with long-cigar support and last with full-pledge BAM. Tools will stop working with each other during this process.

How do you compare a broken pipeline vs changing a mapper? What do you think waiting for several hours to downgrade BAM2 vs using an old version of software? The severity of point 5–8 is highly subjective and varies a lot between users. To me, the severity of 1 is the lowest and 2–4 is lower than 5–8, but the document clearly thinks differently.

dkj added a commit to dkj/hts-specs that referenced this issue Nov 2, 2017
dkj added a commit to dkj/hts-specs that referenced this issue Nov 2, 2017
@jkbonfield
Copy link
Contributor Author

A quick summary for the benefit of those not on the conference call:

We discussed this at length on the GA4GH conference call and I'm pleased to say, 4 years after initial problem(!), came to a unanimous agreement between the 3 SAM spec maintainers (albeit definitely not unanimous within the audience).

Our solution is to go ahead with CG tag (#227). While the concept has now been accepted, we still need to review the text in the specification itself, and some code PRs need updating to incorporate the "N" cigar idea. Hopefully this will not be a lengthy process.

(Keeping this issue open until spec PR #227 is merged.)

@nh13
Copy link
Member

nh13 commented Nov 2, 2017

@jkbonfield who are the current three spec maintainers? Kudos to them for a unanimous decision.

@jmarshall
Copy link
Member

@nh13: See MAINTAINERS.md.

lh3 added a commit to lh3/hts-specs that referenced this issue Nov 18, 2017
This commit addresses samtools#40. It added optional tag `CG` and
explained the workaround to store alignments with >65535 CIGAR operations in
BAM files. The proposal is implemented in samtools/htslib#560.
@jkbonfield
Copy link
Contributor Author

Resolved in #227 some time ago.

@yfarjoun
Copy link
Contributor

yfarjoun commented May 7, 2018

htsjdk supports this now samtools/htsjdk@a00b958

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests