Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

htsjdk.samtools.SAMException: Unexpected number of metadata chunks 3 #643

Open
travc opened this issue Jun 17, 2016 · 4 comments
Open

htsjdk.samtools.SAMException: Unexpected number of metadata chunks 3 #643

travc opened this issue Jun 17, 2016 · 4 comments

Comments

@travc
Copy link
Contributor

travc commented Jun 17, 2016

I'm getting a exception being thrown by multiple tools which use htsjdk (IGV and picard at least). GATK is also choking on the bam file, but it's errors don't get logged in my workflow as well.

Example: running IGV 2.3.63 on a bam file generated with bwa mem 0.7.5a-r405 from paired-end reads... when I zoom in far enough to show alignments it throws:

Error encountered querying alignments: htsjdk.samtools.SAMException: Unexpected number of metadata chunks 4

I've also seen Unexpected number of metadata chunks 3 on different bam files (after merging using sambamba.)

Various recent versions of htslib seem to be involved.
java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)
OS: Ubuntu 14.04 LTS (with 128GB of RAM)

I'm thinking that the problem might have to do with the fact that one of the scaffolds in my reference is 552137040 bp long. 8*552137040 > 2^32, but I'm not sure why that would be an issue. Maybe something to do with indexing, but I'm just guessing.

There is nothing obvious in logs. I'm currently working on chopping up that big scaffold in the reference to see if the issues persist, but I suspect they will. Essentially the same workflow has run fine with a different reference.

@yfarjoun
Copy link
Contributor

I'm pretty sure there's a known issue with reference sequences that are
longer than INT_MAX_VALUE, so that is probably the problem you are seeing.
I agree that the logging could be better, but I doubt this will be fixed
soon.

On Fri, Jun 17, 2016 at 6:06 PM, Travis Collier [email protected]
wrote:

I'm getting a exception being thrown by multiple tools which use htsjdk
(IGV, picard and samtools specifically). GATK is also choking on the bam
file, but it's errors don't get logged in my workflow as well.

Example: running IGV 2.3.63 on a bam file generated with bwa mem
0.7.5a-r405 from paired-end reads... when I zoom in far enough to show
alignments it throws:

Error encountered querying alignments: htsjdk.samtools.SAMException: Unexpected number of metadata chunks 4

Various recent versions of htslib seem to be involved.
java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)
OS: Ubuntu 14.04 LTS (with 128GB of RAM)

I'm thinking that the problem might have to do with the fact that one of
the scaffolds in my reference is 552137040 bp long. 8*552137040 > 2^32, but
I'm not sure why that would be an issue. Maybe something to do with
indexing, but I'm just guessing.

There is nothing obvious in logs. I'm currently working on chopping up
that big scaffold in the reference to see if the issues persist, but I
suspect they will. Essentially the same workflow has run fine with a
different reference.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#643, or mute the thread
https://github.com/notifications/unsubscribe/ACnk0kQKfLDgRDSZVGsOken3CqVbePrgks5qMxp2gaJpZM4I4xMg
.

@travc
Copy link
Contributor Author

travc commented Jun 18, 2016

I think I've found at least part of the problem actually...
According to the SAM reference docs:
https://samtools.github.io/hts-specs/SAMv1.pdf

In the BAI format, each bin may span 229, 226, 223, 220, 217 or 214 bp. Bin 0 spans a 512Mbp region, bins 1–8 span 64Mbp, 9–72 8Mbp, 73–584 1Mbp, 585–4680 128Kbp, and bins 4681–37448 span 16Kbp regions.
This implies that this index format does not support reference chromosome sequences longer than 2^29 − 1.
The CSI format generalises the sizes of the bins, and supports reference sequences of the same length as are supported by SAM and BAM.

Looks like I need to use CSI indexes.

Note that the limit isn't quite INT_MAX_VALUE, it is (2^29)-1

That was throwing me off since I'm well below (2^32)-1.
I was also a bit confused by "references" and "reference sequences"... I was thinking that was referring to the total length, not per contig. Obvious in retrospect of course. The SAM docs uses the term "reference chromosome sequences" which is unambiguous.

@peterjc
Copy link

peterjc commented Mar 10, 2017

Clearly the error handling could be improved when attempting to view a reference/chromosome sequence over 512Mbp using the BAI index.

See also #447 for supporting CSI indexing, needed if you have a reference/chromosome longer than 512Mbp.

@lbergelson
Copy link
Member

We should fix that error message so that it's not confusing and useless. I opened a more explicit ticket #823

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants