Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work-around for CIGARs with >64k operations in BAM files #560

Closed
wants to merge 15 commits into from

Conversation

lh3
Copy link
Member

@lh3 lh3 commented Jul 6, 2017

This PR implements the idea discussed in samtools/hts-specs#40.

EDIT: test data are available here.

See also samtools/hts-spec#40.
Copy link
Contributor

@jkbonfield jkbonfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't cloned, compiled and checked it, but have a few comments from manually reviewing the code.

sam.c Outdated
if ((CG = bam_aux_get(b, "CG")) == 0) return;
if (CG[0] != 'B' || CG[1] != 'I') return;
cigar_st = (uint8_t*)bam_get_cigar(b) - b->data;
c->n_cigar = *(uint32_t*)(CG + 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could do with some endianness protection via ed_swap_4p(&c->n_cigar).

sam.c Outdated
b->m_data = b->l_data;
kroundup32(b->m_data);
b->data = (uint8_t*)realloc(b->data, b->m_data);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What aren't I following here. We migrate an 4-byte B array from aux tag to cigar array, but the total number of 4-byte values is the same. The only difference should be we have one less aux tag header, so we drop the CGBI<size> bytes. Memory allocation should be sufficient as it becomes smaller after migration. Are you reallocing purely for the purpose of making the shuffle easier? If so I think allocating a temporary buffer may be easier to understand, although it requires two copies.

If you want to grow and do inline, please check the return value from malloc.

Also I can't see anywhere that checks the aux data is large enough. We could have a truncated aux array where the size is larger than the aux tag permits. We ought to error rather than crash in that scenario, or worse it's a buffer overrun and potential exploit due to the memmove writing beyond the end of the buffer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which function will create a truncated aux array? This would be a bug in htslib. In addition, bam_aux_append() doesn't check aux data, either.

realloc is to reduce future heap allocations. If there were many long CIGARs, you would need to allocate a temporary buffer for swapping every time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on realloc. If we're in the usual pipeline of reusing the same bam then it'll not realloc very often so thumbs up on that idea.

Regarding truncated array, my worry is about malicious data where a malformed BAM file has been loaded. My understanding is that the BAM reader just reads block_size bytes, but the contents of the aux data may not be consistent with the decoded block_size value.

It may not actually be a buffer write overrun though as by reallocing we're growing the array first before doing the memcpy, but it may lead to exposing uninitialised data. Consider a broken aux field of CGBI<0xff 0x00 0x00 0x00>foo . It claims to be 255 ints long, so nearly 1Kb, but only contains "foo" bytes. The BAM record size itself could be correct, ending at the "foo" point, but the actual decoding of the BAM record would expose an error. This code however will grow the BAM record by ~1Kb to make way for 255 cigar elements and then memcpy ~1Kb from CGBI onwards to that cigar array, which possibly includes nearly 1Kb of random contents from a previous malloc/free on the heap.

If bam_aux_append has the same issue then it needs fixing too, but on a quick glance it seems fine to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it all back. :-)

The bam_aux_get call already checks (since 95b1034 by Rob I think) the data being queried doesn't overflow the buffer and returns NULL if so, making the entire bam_tag2cigar function bail out at this point. Alarm averted! :-)

sam.c Outdated
@@ -421,6 +448,7 @@ int bam_read1(BGZF *fp, bam1_t *b)
bgzf_read(fp, b->data + c->l_qname, b->l_data - c->l_qname) != b->l_data - c->l_qname)
return -4;
if (fp->is_be) swap_data(c, b->l_data, b->data, 0);
bam_tag2cigar(b);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the memory allocation, please make bam_tag2cigar return int and check the value here.

sam.c Outdated
cigar_en = cigar_st + c->n_cigar * 4;
if (ok) ok = (bgzf_write(fp, &b->data[cigar_en], b->l_data - cigar_en) >= 0); // write data after CIGAR
memcpy(&x[0], "CGBI", 4);
x[1] = c->n_cigar;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also needs byte swapping code.

sam.c Outdated
@@ -1143,6 +1179,7 @@ int sam_parse1(kstring_t *s, bam_hdr_t *h, bam1_t *b)
} else _parse_err_param(1, "unrecognized type %c", type);
}
b->data = (uint8_t*)str.s; b->l_data = str.l; b->m_data = str.m;
bam_tag2cigar(b);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add check for return value.

@jkbonfield
Copy link
Contributor

I see you've already noticed the endianness, great! Niggles aside, the basic implementation looks like the right approach to me and the sensible place to do it. I like the idea of calling bam_tag2cigar when parsing SAM too incase it's been processed by an another BAM reader or older code. I wonder if we need this logic applying when decoding CRAM too for similar reasons (BAM->CRAM with outdated software).

Also, we need an ancillary fix for the spec too to reserve the new tag type (even if it's just BAM only, we have to prevent us from reusing that tag).

@lh3
Copy link
Member Author

lh3 commented Jul 6, 2017

Also, we need an ancillary fix for the spec too to reserve the new tag type (even if it's just BAM only, we have to prevent us from reusing that tag).

Yes, hts-specs needs to be changed first. I am posting this PR just to show how the idea in samtools/hts-specs#40 works in practice. It does not require much code change.

@lh3 lh3 changed the title Support CIGARs with >64k operations Work-around for supporting CIGARs with >64k operations in BAM files Jul 6, 2017
@lh3 lh3 changed the title Work-around for supporting CIGARs with >64k operations in BAM files Work-around for CIGARs with >64k operations in BAM files Jul 6, 2017
@sjackman
Copy link

sjackman commented Jul 9, 2017

Commented at samtools/hts-specs#40 (comment)

lh3 added a commit to lh3/hts-specs that referenced this pull request Jul 14, 2017
This commit addresses samtools#40. It added optional tag `CG` and
explained the workaround to store alignments with >65535 CIGAR operations in
BAM files. The proposal is implemented in samtools/htslib#560.
because older tools may put a wrong bin in BAM. Although htslib does not care,
3rd-party tools may have problems.
@lh3
Copy link
Member Author

lh3 commented Nov 3, 2017

Updated to write fake cigar like <readLength>S<refLength>N if there are >65535 CIGAR operators. The reading code is more general. It works as long as the first CIGAR operation is <readLength>S.

sam.c Outdated
if (ori_len > CG_en) // move data after the CG tag
memmove(b->data + CG_st + n_cigar4 - fake_bytes, b->data + CG_en + n_cigar4 - fake_bytes, ori_len - CG_en);
b->l_data -= n_cigar4 + 8; // 8: CGBI (4 bytes) and CGBI length (4)
b->core.bin = hts_reg2bin(b->core.pos, b->core.pos + bam_cigar2rlen(b->core.n_cigar, bam_get_cigar(b)), 14, 5);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the main decoding loop has a bam_cigar2qlen check always performed, it makes sense to combine these into a (hypothetical) bam_cigar2rqlens(b->core.n_cigar, bam_get_cigar(b), &qlen, &rlen) call that only scans the cigar string once. Summing both lengths is minimal difference in CPU time than just summing one, so we can exploit this to do some better error checking.

This can be done in the main loop, rather than bam_tag2cigar, and we may as well validate (or just set if ncigar large) reg2bin in all cases as computing bin is a cheap operation compared to scanning cigars.

@lh3
Copy link
Member Author

lh3 commented Nov 3, 2017

One remaining question: it would be good to call bam_tag2cigar() immediately after reading a CRAM record. Where should we add this call? I am thinking to add to cram_get_bam_seq() in cram_decode.c. Is that a good place?

@jkbonfield
Copy link
Contributor

I don't think it should be anywhere in SAM or CRAM reading code as this is purely a BAM workaround.

@lh3
Copy link
Member Author

lh3 commented Nov 3, 2017

When older tools converting long-cigar BAM to SAM or CRAM, they will leave the CG tag in SAM/CRAM unintentionally. Calling bam_tag2cigar() allows the latest samtools to work with such SAM/CRAM correctly.

@jkbonfield
Copy link
Contributor

Fair enough, but when doing so we should emit a warning (once only ideally), as it is an explicit indication that there is an out-dated part of your pipeline which you need to be aware of.

As for where the call should go, I'd say just after the call to cram_get_bam_seq as then the function can be static and doesn't leak out of sam.c.

@lh3
Copy link
Member Author

lh3 commented Nov 3, 2017

It is easy to print a warning for every offending read. Does htslib have a mechanism to output one type of warning only once? Or I will just call the hts_log_warning() macro?

@lh3
Copy link
Member Author

lh3 commented Nov 3, 2017

As htslib does not have a mechanism to print a type of warning only once, I will not invent one with this PR. The latest change gives a warning for each offending record. There are few such reads, so the repeated message to stderr won't be that bad. When long-cigar reads become common and this message becomes annoying, the users should really update their other tools anyway.

Test data can be found here. There is also a precompiled samtools-1.6 plus a backport of this PR.

Any more concerns?

@sjackman
Copy link

sjackman commented Nov 3, 2017

May I suggest…

if (give_warning) {
	static int warned;
	if (!warned)
		hts_log_error("%s encodes a CIGAR with %d operators at the CG tag", bam_get_qname(b), c->n_cigar);
	warned = 1;
}

@sjackman
Copy link

sjackman commented Nov 3, 2017

Checking and setting warned ought to use an atomic primitive if available. The result of the race condition is simply to print the error message more than once, so not so terrible.

@lh3
Copy link
Member Author

lh3 commented Nov 3, 2017

I thought about that. In theory, it is not the right thing to do with a library, but if @jkbonfield is ok with this solution, I can live with that.

@sjackman
Copy link

sjackman commented Nov 3, 2017

The biggest downside to the above approach is that it will be printed only once per run of the program, rather than once per SAM file, which would be more intuitive.

@sjackman
Copy link

sjackman commented Nov 3, 2017

For that reason, perhaps one warning per offending read is better.

@lh3
Copy link
Member Author

lh3 commented Nov 6, 2017

If there are no more concerns, could someone review this PR and get it merged? Note that Travis reported an error, but it seems unrelated to my changes. The error is:

Error: Fetching /usr/local/Homebrew/Library/Taps/homebrew/homebrew-core failed!
...
The command "if [[ "$TRAVIS_OS_NAME" == "osx" ]]; then brew update    ; fi" failed and exited with 1 during .

It is a homebrew error.

@jmarshall
Copy link
Member

It would be preferable for the merging of implementation PRs to hold off until the specification PR samtools/hts-specs#227 is merged or closer to merging. While there is agreement on the general approach, IMHO as a spec maintainer there are details still remaining to be worked out.

@lh3
Copy link
Member Author

lh3 commented Nov 6, 2017

It seems to me there are no pending issues with samtools/hts-specs#227 so far – CG:Y has been pushed back, which is not my preference anyway. Could you comment on the other PR about specific concerns?

@jkbonfield
Copy link
Contributor

Agreed on Travis error - it's just one of the many MacOS X glitches it has.

Regarding merging, I took a quick look at the code and commented a bit but haven't yet stress tested it.

@jkbonfield
Copy link
Contributor

jkbonfield commented Nov 9, 2017

As I feared, it does indeed have potential security issues when given deliberately broken data. My nasty test has a longer (actual) CIGAR string than CG:B,I tag. The expectation is that we're replacing a short bit of memory by a large bit of memory and the memmoves are written accordingly, but if we reverse this assumption it starts to shift outside its memory buffers.

There are also issues with signed vs unsigned types. m_data is unsigned while l_data is signed. I'm not sure why this is so, but the check for reallocating is wrong.

(gdb) n
389	    b->l_data += n_cigar4 - fake_bytes; // we need c->n_cigar-fake_bytes bytes to swap CIGAR to the right place
(gdb) p n_cigar4
$4 = 4
(gdb) p fake_bytes
$5 = 408
(gdb) n
390	    if (b->m_data < b->l_data) {
(gdb) ptype b->m_data
type = unsigned int
(gdb) ptype b->l_data
type = int

My example file was produced from this one liner:

perl -e 'print "\@SQ\tSN:a\tLN:100\nfoo\t16\ta\t1\t1\t1S1N", "1N" x 100, "\t*\t0\t0\tN\t*\tCG:B,I,0\n"' > a.sam

I think we need to fuzz test this basically, with both BAM and SAM inputs so we can check binary and textual parsing of cigar and CG tags. I don't mind if it simply gives nonsensical output from nonsensical input (eg the CG:B,I tag makes no sense and uses illegal cigar opcodes), but it should never be permitted to crash and definitely not to overrun any buffers.

The basic principle of how it works however is fine. We just need a few more input validation checks and detection of integer overflows.

Edit: I forgot to say that n_cigar4 and fake_bytes are both unsigned, so n_cigar4 - fake_bytes is 4 - 408 which wraps around to become approx 4 billion.

@lh3
Copy link
Member Author

lh3 commented Nov 9, 2017

We can change

if (CG_len == 0) return 0;

to

if (CG_len < c->n_cigar) return 0;

to prevent this crash. If you are ok with this, I can push another commit.

@jkbonfield
Copy link
Contributor

jkbonfield commented Nov 9, 2017

It cuts out one of the issues. I'm still concerned over the realloc and unsigned vs signed variables though.

I think maybe if we really push the boat out we can get a 2Gb CG tag, wrap around b->l_data, make the if (b->m_data < b->l_data) check succeed and avoid the realloc, and then do a memmove outside of the allocated memory.

It's extreme, but I think it still exists as a possibility unless there is something else that forbids a 2Gb aux tag. It could perhaps be guarded against simply by adding a maximum length of CG too to prevent this scenario being possible. Eg:

if (CG_len < c->n_cigar || CG_len >= (1u<<31)) return 0;

Edit, maybe +8 for good measure. I'd need to think through the code logic some more!

@jkbonfield
Copy link
Contributor

Actually that should have been CG_len*4, or more robust CG_len >= (1u<<29) I guess.

We also wondered about the impact of faked up CG binary data, eg where the CG tag claims to be 1Mb long, but actually isn't. There are no checks here about going outside the end of the array, however I think this cannot happen as bam_aux_get itself specifically checks for this case and returns NULL so it simply pretends that the truncated aux tag doesn't exist.

In short, you've probably nailed it, but I'll try and remember how to run afl again.

sam.c Outdated
hts_log_error("CIGAR and query sequence lengths differ for %s",
bam_get_qname(b));
return -4;
bam_tag2cigar(b, 0, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bam_tag2cigar returns an error (0 or -1), but every call to it doesn't bother to check the return value. Please fix this.

@jkbonfield
Copy link
Contributor

Thanks for the updates Heng. I can confirm that AFL was finding the problems before and no longer is. I'll leave it running over the weekend and assuming it's fine merge Monday hopefully

@jkbonfield
Copy link
Contributor

jkbonfield commented Nov 13, 2017

Rebased, squashed and merged as aea349a.

(AFL ran the entire weekend without any more problems.)

@lh3
Copy link
Member Author

lh3 commented Nov 13, 2017

Thanks a lot, @jkbonfield!

lh3 added a commit to lh3/hts-specs that referenced this pull request Nov 18, 2017
This commit addresses samtools#40. It added optional tag `CG` and
explained the workaround to store alignments with >65535 CIGAR operations in
BAM files. The proposal is implemented in samtools/htslib#560.
lh3 added a commit to lh3/hts-specs that referenced this pull request Nov 18, 2017
It works by encoding the real CIGAR at the CG tag and writing a fake
CIGAR `<readLen>S<refLen>N` as CIGAR in BAM. samtools/htslib#560 has
implemented the method and been merged.
lh3 added a commit to lh3/hts-specs that referenced this pull request Nov 18, 2017
It works by encoding the real CIGAR at the CG tag and writing a fake
CIGAR `<readLen>S<refLen>N` as CIGAR in BAM. samtools/htslib#560 has
implemented the method and been merged.
lh3 added a commit to lh3/hts-specs that referenced this pull request Nov 18, 2017
It works by encoding the real CIGAR at the CG tag and writing a fake
CIGAR `<readLen>S<refLen>N` as CIGAR in BAM. samtools/htslib#560 has
implemented the method and been merged.
lh3 added a commit to lh3/hts-specs that referenced this pull request Nov 18, 2017
It works by encoding the real CIGAR at the CG tag and writing a fake
CIGAR `<readLen>S<refLen>N` as CIGAR in BAM. samtools/htslib#560 has
implemented the method and been merged.
jkbonfield pushed a commit to samtools/hts-specs that referenced this pull request Nov 28, 2017
It works by encoding the real CIGAR at the CG tag and writing a fake
CIGAR `<readLen>S<refLen>N` as CIGAR in BAM. samtools/htslib#560 has
implemented the method and been merged.
uint32_t cigar_st, n_cigar4, CG_st, CG_en, ori_len = b->l_data, *cigar0, CG_len, fake_bytes;
uint8_t *CG;

// test where there is a real CIGAR in the CG tag to move
Copy link

@TransGirlCodes TransGirlCodes Jun 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lh3, I'm adding support for cigars in CG tags into BioJulia/BioAlignments.jl, and am currently implementing the test for the fake cigar, which is defined as [k<<4|4,m<<4|3] in the spec, where k is l_seq and m is the length of the reference in the alignment. I don't see a field I can say obviously corresponds to m in BAM records.

In your C of bam_tag2cigar, if I understand it correctly, I see you check whether the first op is a softclip, and if it's length == l_seq: if (bam_cigar_op(cigar0[0]) != BAM_CSOFT_CLIP || bam_cigar_oplen(cigar0[0]) != c->l_qseq) return 0;
But I don't see in the C where the second cigar op of "mN" is checked, is it therefore something that is a given if kS is present then mN will be present?

Thanks,
Ben.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a field directly in the BAM structure to define 'm' as it is the number of reference bases spanned by the sequence and not query (so not l_qseq). The corresponding code for writing it is also in sam.c:

        cigar[0] = (uint32_t)c->l_qseq << 4 | BAM_CSOFT_CLIP;
        cigar[1] = (uint32_t)bam_cigar2rlen(c->n_cigar, bam_get_cigar(b)) << 4 | BAM_CREF_SKIP;

You'll see the bam_cigar2rlen function is the appropriate code for working out 'm'.

Copy link

@TransGirlCodes TransGirlCodes Jun 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkbonfield Ah I see, so if I'm properly understanding, since m is computed with bam_cigar2rlen, using the actual long cigar, then when reading in a BAM record, in bam_tag2cigar, only the first cigar op is checked for k and BAM_CSOFT_CLIP here:

// test where there is a real CIGAR in the CG tag to move
    if (c->n_cigar == 0 || c->tid < 0 || c->pos < 0) return 0;
    cigar0 = bam_get_cigar(b);
    if (bam_cigar_op(cigar0[0]) != BAM_CSOFT_CLIP || bam_cigar_oplen(cigar0[0]) != c->l_qseq) return 0;
    fake_bytes = c->n_cigar * 4;

Preciely because you can't yet compute m as the real cigar has not been read yet, in order to do so?

Copy link
Contributor

@jkbonfield jkbonfield Jun 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be the generous interpretation, but it probably boils down to not bothering to check. :-) Note it's not solely entire-soft-clip, but also the presence of the CG tag. The two combined are pretty specific, although it's not a bad idea to also validate that CG and cigar are consistent with each other.

Note the main reason for the existence of the m part of kSmN was so that tools that wanted to do coverage graphs could still use the minimal cigar string to derive the alignment end coordinate. The suggestion started with just kS and so the htslib implementation probably shows some of that history in how it is structured.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants