More than 65535 cigar opcodes #40

jkbonfield · 2014-09-11T17:05:25Z

There was discussion on the samtools-dev mailing list about this last year:

http://sourceforge.net/p/samtools/mailman/message/30672431/

The main crux of the discussion there was to reuse the 16bit bin field to act as the top 16 bits of ncigar, possibly using the top flag bit as an indicator.

There are some other discussions (internal) regarding this, including possibly removing bin completely given that it has no meaning with CSI indices, so this issue is largely just a note to track the issue and collate ideas/fixes.

Note that the problem is definitely a real one. I have hit this with real projects, caused when a user merged consensus sequences from an assembly into a BAM file, but it is also not too far away with actual sequence reads too. Newer technologies (PacBio, ONT, maybe more) offer substantially longer read lengths but also with higher indel rates leading to substantially more cigar opcodes. A 320Kb with with a 10% indel rate would lead to 2 changes (D/I to M) every 10 bases, giving 64k cigar ops. (Those aren't figures for real technologies, but it's not inconceivable.)

lh3 · 2014-09-11T17:37:22Z

We should implement this feature. I would prefer to keep bin for backward compatibility. This solution is a little awkward, but compatibility seems more important to me.

nh13 · 2014-09-22T21:24:52Z

We do not use CSI indexes within HTSJDK and Picard, so I would have to think about this in our production pipeline, even as we are moving to CRAM. My apologies for not having a definitive yes/no answer at the moment.

jkbonfield · 2014-09-23T08:27:53Z

Do you trust and use the bin field though, or recompute it on the fly?

nh13 · 2014-09-23T13:36:09Z

Looking at our code, it currently reads the bin from from the BAM file. If validation stringency is turned on, which is most of the time, then it checks that the bin is correctly set. If the alignment is modified in memory, then the bin is recomputed before being written. We trust the bin fields since we are most likely the one's that computed the bin field in the first place.

nh13 · 2014-12-18T21:37:22Z

So looking through our code, we could also not use the bin field, and re-calculate it on-the-fly. We could then implement using a bit in the flag field to say whether or not to use the 16 bits in the bin field. I would suggest we then get rid of it in the specification. Please let us know how you would like us to proceed.

As various CIGAR-related loops and API functions use int, the practical limit without further changes is 2^31-1 rather than 2^32-1. As bam_init1() uses calloc(), the unused field is initialised to 0 and we don't need to bother re-zeroing it in sam_read1() et al. Until the BAM format is extended to deal with it (samtools/hts-specs#40), trying to write >64K CIGAR operations to a BAM file remains an error. Implements and fixes #437. (Update htslib/sam.h copyright notice for changes in 2015 and 2016. MinGW does not have EOVERFLOW.)

nickloman · 2017-03-06T16:29:32Z

This is now an active issue with ONT data, I've provided a dataset in FASTA that will routinely run into this limit here when aligned with BWA-MEM:

http://s3.climb.ac.uk/nanopore/ecoli_allreads.fasta

jkbonfield · 2017-03-06T18:12:02Z

Thanks for the real world examples Nick.

The htslib fix John made above solves this for SAM and CRAM, but not BAM due to file format limitations. I verified this on a CRAM file with 178801 cigar operations and the round trip from SAM -> CRAM -> SAM is identical in the latest develop branch.

Following discussions on samtools-devel several years back https://sourceforge.net/p/samtools/mailman/message/30668698/ I implemented this out of necessity in Staden "io_lib", so scramble can read and write BAM records too with more than 64k entries too. I'm not advocating using that, but it's a practical demonstration that the BAM spec could (and should?) be updated if there is pressure.

In the interim, have you thought of just using CRAM or compressed SAM? Note that you can use CRAM without needing to specify a reference, which may be more efficient with lots of reference differences anyway:

samtools view -O cram,no_ref -o test.cram test.sam

jkbonfield · 2017-03-06T18:18:33Z

A query on the proposed extra FLAG field. Is this only the BAM format we are changing, or is it SAM also? If SAM, do we therefore expose it? Eg a record with 100,000 cigar ops could have the SAM FLAG as >= 32768.

It feels like this is an internal BAM limitation and it should be in the BAM part of the spec and absent in the SAM part, as this is a binary encoding issue only.

peterjc · 2017-03-06T20:09:03Z

I would define the FLAG bit as reserved for BAM use as described, and say the bit should be clear in SAM.

tfenne · 2017-03-06T20:33:25Z

Given that we're likely to see more and more of this going forward, would it be crazy to consider a non-backwards compatible change to the BAM spec along with rev'ing the major version number? I think that would be much preferable in the long run to having a flag field that is reserved for BAM use and whose meaning is questionable for SAM/CRAM use. The latter means that all programs that convert between formats need to be smart about modifying the records prior to writing.

I would much prefer to see this done with a version bump in the spec tied to widening the ncigar field to 32 bits. Then implementations just need to know whether they are parsing a version < n or >= n BAM and expect the field of the appropriate width.

lh3 · 2017-03-08T17:56:49Z

We have maintained backward and nearly forward compatibility of BAM for over 8 years. In my opinion, we should keep it this way as long as there is a solution -- I am always on the compatibility side. Quite a few tools specifically depend on an old version of samtools or use bamtools, native golang, javascript etc to parse BAMs. @jkbonfield's solution will keep them usable, which I think is very important.

tfenne · 2017-03-08T18:19:56Z

I guess I'm not sure I follow. Won't all those implementations break given @jkbonfield's solution and a BAM file that actually contains > 65535 cigar operations for at least one record? Unless those implementations (perhaps conditionally on a flag bit) know to use the 16-bits of ncigar plus the 16-bit bin field to make one 32-bit number, they'll just read the exiting 16-bit ncigar field, interpret it as a number < 65535, and then fail when trying to parse the remainder of the cigar as something else.

The only kind of backwards compatibility it yields is that files that don't contain any cigars with ncigar > 65535 would still parse normally. I realize that's not a small fraction of files out there, but the point is that existing implementations would fail in bizarre ways given long cigars with reusing the bin field.

At the very least I think it would make sense to version the specification, and re-write the section that contains the current definition of bin_mq_nl and flag_nc to be a single uint64_t with 32-bits allocated to ncigar (albeit in 2 x 16-bit words). If we did this we:
a) Would retain the same level of backwards compatibility
b) Have something a little clearer to point to when old implementations fail (are you looking at a V2 BAM?)
c) Make it explicit that the bin field is gone
d) Have a somewhat less ambiguous spec moving forward

lh3 · 2017-03-08T19:18:38Z

Yes, forward compatibility is only achievable when there are less than 65536 CIGAR operations, but this is the whole point: we want BAMs without long CIGARs usable by all the old tools, including tools relying on 3rd-party BAM parsing libraries written in other languages.

re-write the section that contains the current definition of bin_mq_nl and flag_nc to be a single uint64_t with 32-bits allocated to ncigar (albeit in 2 x 16-bit words).

I don't think this is possible without breaking compatibility. If I am right, bin and n_cigar are adjacent in memory, but their order is not right. We can't use one 32-bit integer to replace the two 16-bit integers. For compatibility, we can only say higher 16 bits go to this integer and lower 16 bits go to the other.

Make it explicit that the bin field is gone

bin is only gone when there are long CIGARs; otherwise it has the same meaning as before, again for compatibility.

jkbonfield · 2017-03-08T22:36:19Z

Yes the order was incorrect to be a 32-bit integer (neither big nor little nor even pdp endian). It's all in that email discussion from the original post. I think that also shows it was Heng's idea originally, not mine. I just implemented it. :)

Frankly if we were going to break binary compatibility for ALL data, then we'd change lots of BAM things (eg swapping cigar and name around so we don't have to play games to avoid unaligned memory accesses), but it's just a fools errand. May as well start again with a new format or work on improving the next CRAM iteration than changing the BAM format in a non-compatible manner.

The only problem I see with using a flag is that you never really know whether you'll hit a record that can't cope until you find it, unlike chanigng the format. Hence I'd say we should bump the minor version number and leave the major version as is, keeping the file backwards compatible and stating that any tools that explicitly want to conform to the new minor version have to cope with the additional flag.

I did produce a test file using Scramble to see how samtools/htslib copes; it was badly. It started decoding the remaining CIGAR ops as sequence, emitted a bunch of binary qualities and then returned 0 exit status for good measure. So we know this will break things horribly IF we encounter more than 64k cigar ops, but it's already hopelessly broken in that scenario anyway (re: impossible).

colinhercus · 2017-03-09T01:19:52Z

I feel a bit stupid asking this but how can I find out the version of SAM spec's I'm reading? I know @hd has a VN tag but looking at several copies of SAMv1.pdf from my archives I haven't seen a minor revision number *specified* in the document even though it goes through regular updates. 2014 has examples (not specification) of HD with VN:1.5 && VN:1.3, 2016 has just VN:1.5 (in examples) but the document has a lot of changes since 2014. Could we clearly state in the first 3 header lines what the SAM/BAM version number is for the document. Thanks, Colin

…

On 9 March 2017 at 06:36, James Bonfield ***@***.***> wrote: Yes the order was incorrect to be a 32-bit integer (neither big nor little nor even pdp endian). It's all in that email discussion from the original post. I think that also shows it was Heng's idea originally, not mine. I just implemented it. :) Frankly if we were going to break binary compatibility for ALL data, then we'd change lots of BAM things (eg swapping cigar and name around so we don't have to play games to avoid unaligned memory accesses), but it's just a fools errand. May as well start again with a new format or work on improving the next CRAM iteration than changing the BAM format in a non-compatible manner. The only problem I see with using a flag is that you never really know whether you'll hit a record that can't cope until you find it, unlike chanigng the format. Hence I'd say we should bump the minor version number and leave the major version as is, keeping the file backwards compatible and stating that any tools that explicitly want to conform to the new minor version have to cope with the additional flag. I did produce a test file using Scramble to see how samtools/htslib copes; it was badly. It started decoding the remaining CIGAR ops as sequence, emitted a bunch of binary qualities and then returned 0 exit status for good measure. So we know this will break things horribly *IF* we encounter more than 64k cigar ops, but it's already hopelessly broken in that scenario anyway (re: impossible). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#40 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABcc-l6gN6u6I84BfcfMVjgiNoCDmt0Iks5rjy1kgaJpZM4ChFgT> .

jkbonfield · 2017-03-09T09:48:38Z

Agreed it's unclear. It would be friendlier to be in the main title of the document. I suspect the version with examples containing different HD lines was an accident, but clearly this is best avoided by putting the actual version at the top rather than in an example.

Largely the changes boil down to two things - substantive changes in file contents and/or meanings, and minor language adjustments or clarifications. The latter do not change the minor version number as the file format is unchanged, however it would be good if there was some sub-revision counter somewhere so you know which era of document you are reading. Note also that SAM auxiliary tags have now been moved to their own document. Previously adding new tags would likely have meant increasing the SAM minor version number.

It gets more confusing though with BAM. BAM has a magic number "BAM\1" and no minor version number at all. I assume it is expected to be gleaned from the "HD" header, but this isn't a required field.

For the purposes of this proposed change, it will be challenging unless we go down the route of losing forwards compatibility and switch to "BAM\2". The SAM specification should not, IMO, be adjusted to indicate the size change of ncigar or add the flag as this is purely an internal BAM representation and not a change needed to the SAM specification itself (which is entirely textual). Mind you, the same is true for the bin field too - it's entirely a BAM optimisation and not in SAM.

colinhercus · 2017-03-09T11:17:48Z

A consumer of BAM files that is not programmed to handle the long CIGARS will break when it gets to the first record with a long CIGAR. I feel that's a bit late and very nasty. Changing the magic number might be an easy way to tag the BAM as having the extended length CIGAR option. Producers of extended length CIGARs just need to set the new magic number. Those doing SAM to BAM that don't know at outset if they have long CIGARs could take an extra option to allow long CIGARS and generate the new magic string. These programs need to be updated anyway. Consumers won't recognise the new magic number unless they've been updated to handle it.

…

On 9 March 2017 at 17:48, James Bonfield ***@***.***> wrote: Agreed it's unclear. It would be friendlier to be in the main title of the document. I suspect the version with examples containing different HD lines was an accident, but clearly this is best avoided by putting the actual version at the top rather than in an example. Largely the changes boil down to two things - substantive changes in file contents and/or meanings and minor language adjustments or clarifications. The latter do not change the minor version number as the file format is unchanged, however it would be good if there was some sub-revision counter somewhere so you know which era of document you are reading. Note also that SAM auxiliary tags have now been moved to their own document. Previously adding new tags would likely have meant increasing the SAM minor version number. It gets more confusing though with BAM. BAM has a magic number "BAM\1" and no minor version number at all. I assume it is expected to be gleaned from the "HD" header, but this isn't a required field. For the purposes of this proposed change, it will be challenging unless we go down the route of losing forwards compatibility and switch to "BAM\2". The SAM specification should not, IMO, be adjusted to indicate the size change of ncigar or add the flag as this is purely an internal BAM representation and not a change needed to the SAM specification itself (which is entirely textual). Mind you, the same is true for the bin field too - it's entirely a BAM optimisation and not in SAM. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#40 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABcc-l4cVeRPfJB323lupgW--0328RURks5rj8r4gaJpZM4ChFgT> .

tfenne · 2017-03-09T15:49:06Z

@colinhercus The problem with setting the magic number that way is that producers have to know in advance whether they will produce such a record, and for a lot of tools that may not be the case either. E.g. a generic tool that converts SAM to BAM won't know until it hits the read.

There are other options we could consider, rather than just going with the one solution proposed. E.g. we might consider that if you have > 65k cigar operations that the right way to handle it is to push the remaining cigar operations into an optional tag, and either terminate the cigar field with a soft-clip that covers the remaining bases, or a new symbolic cigar operator that would a) inform up to date tools that there is more cigar elsewhere, and b) cause existing tools to terminate with an error because they don't recognize the cigar operator.

Re: ordering of bits and the bin field, I think I wasn't clear enough. I was suggesting two things:

If there are situations in which the bin field cannot be relied upon in all cases, the sanger folks already don't seem to trust it, and it can be re-computed from the record, I would prefer going forward to not have bin in the format description.
I was suggesting re-writing the document to collapse those two descriptions of 32-bit chunks of memory to a single description of a 64-bit chunk of memory, and then to document more clearly how each byte or word is used, following the solution you guys outlined to use two 16-bit words for ncigar.

jkbonfield · 2017-03-09T16:28:51Z

If we were to bump the BAM version number then I'd say the default should be the current BAM\1 value (unless the input data was BAM\2, in which case retain that) with command line switches to specifically request the newer format. That keeps the vast bulk of data in a well supported format. It's still problematic though.

There is certainly merit in using auxiliary fields to work around this issue in a completely compatible manner - a nice idea. However it migrates the problem from being BAM specific to also being SAM and CRAM oriented unless we simply patch the issue in BAM only: when BAM encounters a SAM cigar it cannot handle it stores it in a new tag, and on decode it reverses this so that it is transparent to the users. This can also make it transparent to the APIs.

Regarding bin, it cannot be relied on in some cases at least; eg when using CSI as you have an extra large chromosome. That means it's unreliable and best off computed on the fly using whatever index algorithm you have. We've had subtle bugs as well (cough, my bugs in cram!) which filled it out incorrectly in a few boundary cases.

colinhercus · 2017-03-10T01:24:25Z

@timfennell "@colinhercus <https://github.com/colinhercus> The problem with setting the magic number that way is that producers have to know *in advance* whether they will produce such a record, and for a lot of tools that may not be the case either. E.g. a generic tool that converts SAM to BAM won't know until it hits the read." This is why I suggested adding an option to these programs so that they knew in advance. Obviously if the option wasn't set then an error "CIGAR Exceeds 65535" would be reported. Some producers could default to the new format and always produce bam\2 and their BAMs wouldn't be accepted by consumers that don't recognise the new format. The advantage is that consumers that haven't been updated will not recognise the magic number and will not accept the file rather than crashing when they hit the first read with the new format. They same undefined behaviour will happen with any consumer that doesn't understand the new format, what ever format you choose and it's unrealistic to think all consumers will be updated promptly.

…

On 9 March 2017 at 23:49, Tim Fennell ***@***.***> wrote: @colinhercus <https://github.com/colinhercus> The problem with setting the magic number that way is that producers have to know *in advance* whether they will produce such a record, and for a lot of tools that may not be the case either. E.g. a generic tool that converts SAM to BAM won't know until it hits the read. There are other options we could consider, rather than just going with the one solution proposed. E.g. we might consider that if you have > 65k cigar operations that the right way to handle it is to push the remaining cigar operations into an optional tag, and either terminate the cigar field with a soft-clip that covers the remaining bases, or a new symbolic cigar operator that would a) inform up to date tools that there is more cigar elsewhere, and b) cause existing tools to terminate with an error because they don't recognize the cigar operator. Re: ordering of bits and the bin field, I think I wasn't clear enough. I was suggesting two things: 1. If there are situations on which the bin field cannot be relied upon *in all cases*, the sanger folks already don't seem to trust it, and it can be re-computed from the record, I would prefer going forward to not have bin in the format description. 2. I was suggesting re-writing the document to collapse those two *descriptions* of 32-bit chunks of memory to a single *description* of a 64-bit chunk of memory, and then to document more clearly how each byte or word is used, following the solution you guys outlined to use two 16-bit words for ncigar. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABcc-rFXquCdCViVPjyFR9gIc20W4QmQks5rkB90gaJpZM4ChFgT> .

lh3 · 2017-07-05T15:08:38Z

I am reviving this thread again. With the release of NA12878 ultra-long nanopore reads, long CIGARs are becoming more frequent. After some thoughts, I now prefer the following solution as is suggested by @tfenne. When converting an alignment with >65535 CIGAR operations to a record in BAM file, we mark the alignment as unmapped (flag 0x4), set bam1_core_t::n_cigar to zero and move the entire CIGAR string to a tag CG:B,I. When newer htslib sees an unaligned record with a CG:B,I tag, it seamlessly generates the right CIGAR from the CG tag and removes the tag while reading. Normal htslib users would not notice any differences between short and long CIGARs. Older htslib will take records with >65535 CIGAR operations as unmapped and won't crash. Older htslib loses such long alignments, but it can't work with them anyway.

An alternative to CG:B,I is to simply use CG:Z, storing long CIGARs in their text form. I mildly prefer CG:B,I because it makes code faster (moving a long CIGAR can be achieved with memmove calls) and we usually don't see this tag written to a SAM file.

[EDIT for clarity]

peterjc · 2017-07-05T15:11:38Z

@lh3 you said in SAM, but I presume you mean in BAM here?:

"For an alignment with a long CIGAR in SAM, we mark the alignment as unmapped (flag 0x4), set ..."

i.e. There is no change when such long mapped reads are stored in SAM format which has no limits on the number of CIGAR operators.

[update - fixed]

lh3 · 2017-07-05T15:16:47Z

@peterjc see edit for clarity.

peterjc · 2017-07-05T15:17:56Z

@lh3 that's better now, thanks!

jkbonfield · 2017-07-05T15:36:15Z

Thumbs up to this idea.

I think it's a workable way of addressing the problem in BAM. For SAM and CRAM no change needs to be made. I agree CG:B,I is appropriate as this field should never be present in SAM or CRAM, only BAM (we could even state that in the specification) in which case the normal BAM specific encoding is the natural way to store it.

As per the SAM spec:

If [flag] 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ and bits 0x2, 0x100 and 0x800

By setting flag 0x4 we aren't technically breaking the specification, but we are adding some meaning to those fields as the flag bit field may be removed, resurrecting those other fields once more. We'd need some footnote to explain this caveat.

yfarjoun · 2017-07-05T15:49:21Z

unmapped will not work since various formats (CRAM?) will throw away mapping information if they see an unmapped read...so you'll lose reference, position, strands...not good. So I'm seeing a problem where we first write to bam and then convert to cram. There will need to be some logic introduced to capture the difference in behavior. Can't we leave the read as mapped (which is more correct) but use a null CIGAR? Is that legal?

…

On Wed, Jul 5, 2017 at 11:36 AM, James Bonfield ***@***.***> wrote: Thumbs up to this idea. I think it's a workable way of addressing the problem in BAM. For SAM and CRAM no change needs to be made. I agree CG:B,I is appropriate as this field should never be present in SAM or CRAM, only BAM (we could even state that in the specification) in which case the normal BAM specific encoding is the natural way to store it. As per the SAM spec: If [flag] 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ and bits 0x2, 0x100 and 0x800 By setting flag 0x4 we aren't technically *breaking* the specification, but we are adding some meaning to those fields as the bit field that wasn't previously there. We'd need some footnote to explain this caveat. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#40 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACnk0vsLodH9GyQAiRZztG_Y3CRPMARJks5sK61wgaJpZM4ChFgT> .

sjackman · 2017-10-26T18:25:09Z

!? Are you saying that IGV already uses the CIGAR in the CG tag if it's present?

lh3 · 2017-10-26T18:29:20Z

No, it doesn't. I mean IGV recognizes 100000S200000N and displays the read as a solid bar as if the read were perfectly aligned. It does not show mismatches and gaps because it is not reading CG.

EDIT: displaying something, albeit clearly wrong, is better than displaying nothing on the screen.

jkbonfield · 2017-10-26T20:06:51Z

Would an old version of samtools sort and samtools index work, and avoid the problem of silently creating an incorrect file? What doesn't work if all the alignments in the BAM are long CIGAR? By doesn't work, I mean silently produces slightly incorrect results. I'm okay with no results or an error.

I think the main thing is depth, mpileup and similar tools. They'll omit the soft-clipped read, essentially acting as if the long reads had been filtered out. Mostly this won't give a different call, but it may. Realistically it'll only likely completely lose differences on poor quality calls, but it may also turn a call with qual 31 to qual 29 and make it vanish on a filter of QUAL >= 30; such is the nature of using hard thresholds. The chances are slim and probably irrelevant, but you can never tell when it mattered. As Heng points out there, there are several other pitfalls that have the same impact, such as running an older copy of the aligner or simply an inappropriate one for long reads.

Basically using old tools is equivalent to filtering out the long reads first, but with the added nuance of (probably) not realising you're doing it. Using BAM2 is the equivalent of all your old tools failing hard, forcing you to either upgrade if available or manually remove the long cigars and generate BAM1 before running the old tools again.

sjackman · 2017-10-26T21:24:41Z

My preference is to use entirely long cigars or no long cigars to avoid silently dropping just the long alignments.

lh3 · 2017-10-26T21:25:46Z

Update to my proposal with inputs from James and Shaun:

If there is a CIGAR with >65535 ops, write <readLen>S<refLen>N at the CIGAR field and move the real CIGAR to the CG tag.
If there is a CG tag, the SAM header is required to have CG:Y at the HD line; otherwise, samtools aborts when it writes a BAM with long cigar.

With this solution:

Old tools that don't look at CIGAR will work properly. This includes sort, merge and flagstat
Old tools that only use CIGAR to compute the end of alignment will work properly. This includes index, random retrieval (e.g. bedtools2 intersectBED).
Old IGV semi-works. It displays an alignment without gaps or mismatches. At least you know there is something weird there.
When using new samtools, you have to take actions to generate a BAM containing long cigars. For example, you will need to run minimap2 with -L or ngmlr with --bam-fix in order to generate output convertible to BAM. This will make users more aware of the issue.

Of course, you can still make up examples that lead to silent loss of data, but as I said above, there are much easier ways to make much bigger mistakes, for example, by using an old mapper. We should not be obsessed with "absolute robustnes at any cost". Some cost like the risk of fragmenting the BAM ecosystem without careful planning is too high for us to afford.

The above is the best solution.

lh3 · 2017-10-26T21:59:24Z

@sjackman I don't object. I will leave header tag and/or clipping every read to you and others after we settle on full clipping. We will move forward quickly after that.

dkj · 2017-10-27T10:42:57Z

Meta-status:

everyone wants a clear decision from the format maintainers on the proposals at their meeting next week
this thread is valuable
- there are risks with both leading proposals, but many are now mitigated to varying degrees thanks to this thread
- this recorded process of challenging proposals, finding, assessing and mitigating their risks, shows there is rigor to the SAM/BAM/CRAM standard
  - I hope it will make other proposed standards, which are less amenable to the research world, less likely to take hold - particularly important with the impending tsunami of clinical data
  - there's scope for more quantification of the risks
we're missing input from vendors and public archives aren't we?

lh3 · 2017-10-27T14:07:02Z

EBI had archiveBAM. It was proposed in early 2011 but seems not used any more? ArchiveBAM was researcher oriented. Even though it didn't fly, I think the same idea might work in the clinical world where we care robustness a lot more.

We could have clinicalSAM, a strict subset of SAM with additional restrictions such as mandated HD-SO, SQ-M5, RD-PL, no non-standard tag, etc. We ought to implement a tool that validates and converts to clinicalSAM. Ga4gh can then set it as a standard for clinical sequence data. Research and clinical worlds are quite different. From the discussion here, it seems to me that a single standard won't work well for both worlds at the same time. ClinicalSam might be one of the potential solutions. Thoughts?

we're missing input from vendors and public archives aren't we?

Derek from PacBio (author of bamtools and the most significant contributor to pbbam) is largely neutral. Generally, vendors and archives rely on public standards and implementations. Their interest is to make their tools/data seamlessly interchangeable with the rest of world. It is good to hear from them, but this is not critical to the discussion in my opinion.

pezmaster31 · 2017-10-27T15:54:28Z

Not speaking officially for PacBio, of course. But yes, you nailed it, Heng.

Our interest is mostly that the issue is solved and standardized. Something we can code to and ship data/software out with minimal concern. Unless/until I hear differently from upstairs, the specific implementation details are less critical.

jkbonfield · 2017-10-31T10:23:40Z

In preparation for the Thursday conference call I've produced a summary of the two main proponents for a solution to this problem. It's long and had a round of review so it's a google doc.

https://docs.google.com/document/d/1PzH9k5umvraI_ounpcKcUdawj7Ufs-N-jul77xDs1dI

The aim of writing this document is primarily to list the solutions and their associated risks so that we can discuss our preferred implementation on Thursday from a basis of knowledge rather than gut feelings or unsupported conjecture.

Assessing risk likelihood and severity are of course subjective, although I've tried to be neutral in this, but irrespective of that just listing the potential problems and ways of mitigating against them is important so we can make our implementation as robust as possible. Also note that most likelihoods are conditional - conditional on not updating software and obviously conditional on being one of the affected people with extra-long cigars present in their data.

lh3 · 2017-10-31T14:25:46Z

I have commented on an earlier version of this document. Some of my concerns have been addressed. Here are some different views on estimating likelihood and severity as I can't comment on the document any more.

The expected number of ultra-long read users who actually make a mistake by using older software will be 10s at most. In the next year or two, the expected number of users who may see BAM2, even if unnecessarily, is of 100s at least, likely 1000s, a couple of orders of magnitude higher.

I strongly disagree with the severity rating because many of them can't be measured and are too subjective. You can consider the following scenarios:

In the worst case, full clipping leads to 1% of randomly missing bases. This reduces variant sensitivity by ~0.01% given that you need multiple occurrences to call a variant.
Using a old version of a mapper has a similar effect broadly: sometimes lower (e.g. no effect with rare bug fixes), sometimes higher (e.g. recent minimap2 improves direct RNA junction accuracy and sensitivity by ~5%).
Changing a reference genome (e.g. with hs37d5/unplaced contigs or not), replacing a mapper, applying quality recalibration or not all affect the final variant accuracy at the scale of ~1%.
It is known that changing short-read RNA-seq mappers has a more dramatic effect.

You can objectively compare the severity of these scenarios, at least the first 3. 1 is the lowest. Now we consider what may happen when you break the BAM ecosystem.

Your pipeline is broken because some tools don't read BAM2 and won't receive updates.
In many cases, you can mitigate the problem above by converting BAM2 to BAM1 first, which takes several hours for a human data set.
In other cases, you have to wait for several months or years until the tools catch up with BAM2. In particular, ultra-long read users need to wait.
We ask tool developers to upgrade multiple times. First with long-cigar support and last with full-pledge BAM. Tools will stop working with each other during this process.

How do you compare a broken pipeline vs changing a mapper? What do you think waiting for several hours to downgrade BAM2 vs using an old version of software? The severity of point 5–8 is highly subjective and varies a lot between users. To me, the severity of 1 is the lowest and 2–4 is lower than 5–8, but the document clearly thinks differently.

jkbonfield · 2017-11-02T18:11:37Z

A quick summary for the benefit of those not on the conference call:

We discussed this at length on the GA4GH conference call and I'm pleased to say, 4 years after initial problem(!), came to a unanimous agreement between the 3 SAM spec maintainers (albeit definitely not unanimous within the audience).

Our solution is to go ahead with CG tag (#227). While the concept has now been accepted, we still need to review the text in the specification itself, and some code PRs need updating to incorporate the "N" cigar idea. Hopefully this will not be a lengthy process.

(Keeping this issue open until spec PR #227 is merged.)

nh13 · 2017-11-02T18:37:51Z

@jkbonfield who are the current three spec maintainers? Kudos to them for a unanimous decision.

jmarshall · 2017-11-02T20:10:27Z

@nh13: See MAINTAINERS.md.

This commit addresses samtools#40. It added optional tag `CG` and explained the workaround to store alignments with >65535 CIGAR operations in BAM files. The proposal is implemented in samtools/htslib#560.

jkbonfield · 2018-04-04T10:30:03Z

Resolved in #227 some time ago.

yfarjoun · 2018-05-07T21:45:58Z

htsjdk supports this now samtools/htsjdk@a00b958

jmarshall added the sam label Nov 4, 2014

jmarshall mentioned this issue Nov 21, 2016

Lift internal CIGAR operation count limit from 16 bits to 32 bits samtools/htslib#437

Closed

jkbonfield mentioned this issue Mar 9, 2017

Discussion: specification versioning #194

Closed

magicDGS mentioned this issue Oct 27, 2017

BAM version 2 proposal #259

Closed

dkj added a commit to dkj/hts-specs that referenced this issue Nov 2, 2017

update BAM cigar field as per samtools#40 (comment)

75f7d6c

dkj added a commit to dkj/hts-specs that referenced this issue Nov 2, 2017

update BAM cigar field as per samtools#40 (comment)

5262b5d

dkj mentioned this issue Nov 2, 2017

update following discussion on issue 40 lh3/hts-specs#1

Closed

This was referenced Nov 3, 2017

Support CIGARs with >65535 operators pezmaster31/bamtools#169

Closed

Support BAMs with >65535 CIGAR operations pezmaster31/bamtools#170

Merged

nspies mentioned this issue Jan 31, 2018

pysam doesn't support long cigars pysam-developers/pysam#613

Closed

nh13 mentioned this issue Mar 10, 2018

Question: what is the SAM_RGAUX flag? brentp/hts-nim#13

Closed

jkbonfield closed this as completed Apr 4, 2018

ghuls mentioned this issue May 26, 2018

New BAM files can support more than 65535 cigar opcodes BioJulia/BioAlignments.jl#12

Closed

peterjc mentioned this issue Jul 17, 2018

Support CIGARs with >65535 operations in BAM files betteridiot/bamnostic#6

Closed

cjw85 mentioned this issue Feb 19, 2019

resolve_cigar2: Assertion `k < c->n_cigar' failed. nanoporetech/medaka#26

Closed

rob-p mentioned this issue May 24, 2020

Handling of long (>65k) CIGAR strings jkbonfield/io_lib#28

Closed

More than 65535 cigar opcodes #40

More than 65535 cigar opcodes #40

Comments

jkbonfield commented Sep 11, 2014

lh3 commented Sep 11, 2014

nh13 commented Sep 22, 2014

jkbonfield commented Sep 23, 2014

nh13 commented Sep 23, 2014

nh13 commented Dec 18, 2014

nickloman commented Mar 6, 2017

jkbonfield commented Mar 6, 2017 • edited Loading

jkbonfield commented Mar 6, 2017

peterjc commented Mar 6, 2017

tfenne commented Mar 6, 2017

lh3 commented Mar 8, 2017

tfenne commented Mar 8, 2017

lh3 commented Mar 8, 2017

jkbonfield commented Mar 8, 2017

colinhercus commented Mar 9, 2017 via email

jkbonfield commented Mar 9, 2017 • edited Loading

colinhercus commented Mar 9, 2017 via email

tfenne commented Mar 9, 2017 • edited Loading

jkbonfield commented Mar 9, 2017 • edited Loading

colinhercus commented Mar 10, 2017 via email

lh3 commented Jul 5, 2017 • edited Loading

peterjc commented Jul 5, 2017 • edited Loading

lh3 commented Jul 5, 2017

peterjc commented Jul 5, 2017

jkbonfield commented Jul 5, 2017 • edited Loading

yfarjoun commented Jul 5, 2017 via email

sjackman commented Oct 26, 2017 • edited Loading

lh3 commented Oct 26, 2017 • edited Loading

jkbonfield commented Oct 26, 2017 • edited Loading

sjackman commented Oct 26, 2017

lh3 commented Oct 26, 2017 • edited Loading

lh3 commented Oct 26, 2017

dkj commented Oct 27, 2017

lh3 commented Oct 27, 2017

pezmaster31 commented Oct 27, 2017

jkbonfield commented Oct 31, 2017

lh3 commented Oct 31, 2017 • edited Loading

jkbonfield commented Nov 2, 2017

nh13 commented Nov 2, 2017 • edited Loading

jmarshall commented Nov 2, 2017

jkbonfield commented Apr 4, 2018

yfarjoun commented May 7, 2018

jkbonfield commented Mar 6, 2017 •

edited

Loading

jkbonfield commented Mar 9, 2017 •

edited

Loading

tfenne commented Mar 9, 2017 •

edited

Loading

jkbonfield commented Mar 9, 2017 •

edited

Loading

lh3 commented Jul 5, 2017 •

edited

Loading

peterjc commented Jul 5, 2017 •

edited

Loading

jkbonfield commented Jul 5, 2017 •

edited

Loading

sjackman commented Oct 26, 2017 •

edited

Loading

lh3 commented Oct 26, 2017 •

edited

Loading

jkbonfield commented Oct 26, 2017 •

edited

Loading

lh3 commented Oct 26, 2017 •

edited

Loading

lh3 commented Oct 31, 2017 •

edited

Loading

nh13 commented Nov 2, 2017 •

edited

Loading