Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI index support for BAM files #1040

Merged
merged 16 commits into from
Jan 16, 2019
Merged

Conversation

valeriuo
Copy link
Contributor

Description

  1. Adds the ability to read and use CSI indexes for BAM files (no write support, yet). BAI indexes still have priority. Thus, if both .bai and .csi files are present alongside a BAM file, the BAI index will be used.
  2. Includes Bugfix for BAM index bins #1024.

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

@codecov-io
Copy link

codecov-io commented Nov 22, 2017

Codecov Report

Merging #1040 into master will increase coverage by 0.603%.
The diff coverage is 72.089%.

@@               Coverage Diff               @@
##              master     #1040       +/-   ##
===============================================
+ Coverage     68.336%   68.939%   +0.603%     
- Complexity      8001      8208      +207     
===============================================
  Files            541       546        +5     
  Lines          32681     32971      +290     
  Branches        5536      5558       +22     
===============================================
+ Hits           22333     22730      +397     
+ Misses          8122      7987      -135     
- Partials        2226      2254       +28
Impacted Files Coverage Δ Complexity Δ
src/main/java/htsjdk/samtools/BamFileIoUtils.java 0% <0%> (ø) 0 <0> (ø) ⬇️
...rc/main/java/htsjdk/samtools/BAMIndexMetaData.java 73.832% <0%> (+19.546%) 22 <0> (+5) ⬆️
...ain/java/htsjdk/samtools/SAMFileWriterFactory.java 64.189% <100%> (+2.027%) 48 <0> (-2) ⬇️
src/main/java/htsjdk/samtools/SamReader.java 82.558% <100%> (+1.308%) 0 <0> (ø) ⬇️
src/main/java/htsjdk/samtools/BinWithOffset.java 100% <100%> (ø) 2 <2> (?)
src/main/java/htsjdk/samtools/BAMFileWriter.java 65.625% <100%> (ø) 16 <0> (ø) ⬇️
src/main/java/htsjdk/samtools/CRAMFileReader.java 74.5% <100%> (ø) 53 <0> (ø) ⬇️
...rc/main/java/htsjdk/samtools/BAMFileConstants.java 80% <100%> (+13.333%) 1 <0> (ø) ⬇️
...n/java/htsjdk/samtools/IndexFileBufferFactory.java 33.333% <33.333%> (ø) 1 <1> (?)
src/main/java/htsjdk/samtools/SamIndexes.java 73.077% <42.857%> (-6.923%) 13 <0> (+1)
... and 157 more

@lbergelson
Copy link
Member

@valeriuo I'm sorry we haven't gotten to this yet. I've been very busy and this is obviously a complicated PR. It's a really valuable PR though, lots of people want CSI support, so we will definitely get around to reviewing it when someone has bandwidth.

@valeriuo
Copy link
Contributor Author

@lbergelson I realize there is a lot of code in here. Some of it is duplicated from AbstractBAMFileIndex class, which I couldn't extend, although that might have seemed like the natural approach.
We could arrange a call, if you need me to present the solution and talk about potential improvements.

@yfarjoun
Copy link
Contributor

yfarjoun commented Jun 6, 2018

@lbergelson could this be prioritized, there's a picard PR waiting on it. broadinstitute/picard#998

@lbergelson
Copy link
Member

@yfarjoun Yes, I was planning on prioritizing this as the next thing after going through a bunch of small stuff that has been in flight for a while.

@lbergelson
Copy link
Member

@valeriuo I've begun to take a look at this. Sorry for the extremely long delay. Are you still around and interested in helping get this in?

@valeriuo
Copy link
Contributor Author

@lbergelson, definitely.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@valeriuo I'm still looking through this, I just wanted to post what I'd gotten through so you can see what I'm thinking. Almost all of my comments are structural / code style things rather than anything broken about the implementation. It looks pretty good to me, but I'm not super familiar with the indexes, so I may have to come back to you with questions in a bit.

I'm increasingly annoyed by the BAMIndex class hierarchy. It seems like we should be able to implement this as a subclass of AbstractBamIndex, and the fact that it isn't obvious how to do that is problematic. I think we'll need to consider moving some things around.

I haven't taken a serious look at the tests yet, but I suspect I'm going to want to add additional test cases. It looks like you have good unit test coverage, and BamIndexValidator might be checking everything I want, but as a heads up I might want more tests that open bam readers on various combinations of indexes, particularly ones accessible only through nio.Path APIs.

* @param endPos 1-based end of target region, inclusive.
* @return bit set for each bin that may contain SAMRecords in the target region.
*/
protected BitSet regionToBins(final int startPos, final int endPos) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of deleting this, please deprecate it and have it call through to the preferred one.

import java.nio.channels.FileChannel;
import java.util.*;

public class BAMCSIFileIndex implements BrowseableBAMIndex {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename this to CSIIndex. Although I guess that's a bit like ATMMachine...

@Override
public int getLevelForBin(Bin bin) {
if(bin == null || bin.getBinNumber() > getMaxBins()) {
throw new SAMException("Tried to get level for invalid bin.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include the invalid value in the error message.

@Override
public int getFirstLocusInBin(Bin bin) {
if(bin == null || bin.getBinNumber() > getMaxBins()) {
throw new SAMException("Tried to get first locus for invalid bin.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@Override
public void close() {}

private void verifyBAMMagicNumber(final String sourceName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be renamed to verifyCSIMagicNumer

* Constructors
*/

private BAMCSIFileIndex(final SeekableStream stream, final SAMSequenceDictionary dictionary)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the constructors should be refactored. I would unify them by a private constructor
private BAMCSIFileIndex(IndexFileBuffer buffer, String Source, SAMSequenceDictionary dictionary) and then have the other constructers call though to that. In the end we want r 3 public constructors.

One that takes SeekableStream as the input, one for Path, and one for File.

Copy link
Contributor Author

@valeriuo valeriuo Jun 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, Path would be memory mapped and File would be randomly accessed. Is this what you had in mind?

* Statistics include count of aligned and unaligned reads for each reference sequence
* and a count of all records with no start coordinate
*/
static public BAMIndexMetaData[] getCSIIndexStats(final BAMFileReader bam) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs a test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index hierarchy is such a mess. I think we should consider restructuring it to some degree to make this method unnecessary.

metaDataSeen = true;
continue; // don't create a Bin
} else {
skipBytes(16 * nChunks);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets extract a named constant here

return mIndexBuffer.position();
}

private abstract static class IndexFileBuffer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should not duplicate this code. Lets pull out this and the implementers as package protected classes instead. While we're at it we can make this an interface and have it extend Closeable.

@@ -34,6 +34,7 @@
public interface BAMIndex extends Closeable {

public static final String BAMIndexSuffix = ".bai";
public static final String BAMIndexSuffix2 = ".csi";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It bothers me that these don't follow constant naming convetion. I would rename this to CSI_INDEX_SUFFIX , deprecate the existing BAMIndexSuffix and add a new BAM_INDEX_SUFFIX

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BAI_INDEX_SUFFIX would probably be more appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, that's better.

…elNumber is too big.

2. Fixes a bug: the level size for the last level was miscalculated. It should be 32768 (2^15).
3. Removes the unused method regionToBins, whose code was already implemented in the class GenomicIndexUtil.
Change `BAMIndexSuffix2` to `CSI_INDEX_SUFFIX`.
Deprecate `BAMIndexSuffix` and use `BAI_INDEX_SUFFIX` instead.
Make setters of `CSIIndex` private.
…vate interface `IndexFileBuffer` and classes `IndexFileBuffer`, `MemoryMappedFileBuffer` and `RandomAccessFileBuffer`.

Refactor the code to use the new classes.
… `AbstractBAMFileIndex`.

Fix a few index arithmetic bugs.
@valeriuo
Copy link
Contributor Author

valeriuo commented Jun 21, 2018

@lbergelson I've addressed all the topics mentioned in your review. I didn't add any unit tests, but I wanted to push the latest code version, because I am going on a short holiday until next week and I thought you could have another look over it in the meantime.

I've managed to refactor the BAM index class hierarchy by opening slightly the AbstractBAMFileIndex class, without exposing anything to the public interface or braking its contract, in order to allow CSIIndex to extend it, thus removing some duplicate code.

I've tested the CSI index functionality by running Picard BamIndexStats command against some indexes generated with samtools. The results were OK, but the inconvenient is that samtools generates compressed CSI files, by default, and I had to decompress them manually so that Picard could read them. For that matter, BAI files are uncompressed. I don't know if Picard can read compressed index files, but we'll have to address this limitation somehow, as I can see some usability issue.

Also, I've rebased the PR onto master.

@lbergelson
Copy link
Member

@valeriuo Oh, that's awesome! I will take a look. That's interesting about the compression. It doesn't seem to mention it anywhere in the spec. I assume they're just gzip? Or are they bgzip? We should be able to add detection of the compression and automatically decompress. There was just recently some code added to make that sort of detection easier as well.

@jkbonfield
Copy link

Htslib writes CSI using bgzf (but I doubt it matters whether it is this or pure gzip as we have no random access capability to our indices anyway). You are correct in that it is not documented. I note that Tabix does document this, as bgzf, so I assume this is just an omission in the CSI documentation.

Note that what CSI does to BAI - allowing for longer chromosomes - needs to be applied to tabix TBI also. There is no point having long chromosome support in BAM if we then cannot analyse it and handle bgzipped VCF. Bcftools does this via an intermediate CSI/TBI hybrid, which is sadly also undocumented! It uses a tabix header as the meta-data for a CSI file, which means it doesn't have to have every #contig line present in the index and reference numbering is based on presence in the file rather than presence in the VCF header. I believe this is an attempt to handle the case where the VCF header does not exist (#contig lines are optional), but I can't really be sure. Maybe it's simply an accident of sharing code paths!

Note, I've just added CSI and BAI support for bgzipped SAM in Htslib, mainly for completeness but also as a safe workaround for when BAM just doesn't cut it (eg a chromosome more than 2Gb long). Does htsjdk support this? I haven't got TBI support for it, but arguably it's not helpful anyway given it also has the same size limit.

@jrobinso
Copy link
Contributor

jrobinso commented Nov 1, 2018

Ping, @lbergelson et al I wonder what the status of CSI support is? Just asking so I know how to respond to IGV tickets, not requesting anything.

@valeriuo
Copy link
Contributor Author

valeriuo commented Nov 2, 2018

I don't have anything else to add to the PR, at this point. I did some basic tests with Picard and they looked OK.

@lbergelson
Copy link
Member

It's been stalled forever waiting on me. I need to do a review but I never find the time.

@jrobinso
Copy link
Contributor

jrobinso commented Nov 2, 2018

@lbergelson I understand believe me. If I can help let me know, again no pressure I just need to respond to a ticket on my side.

@lbergelson
Copy link
Member

I would like to get this into the next release, but I've been saying that to myself for every release...

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@valeriuo I started doing what should hopefully be the final review pass on this. It's looking really good. It's great that you figured out how to merge it with the existing index class hierarchy. I still haven't looked over quite everything yet, I just wanted to checkpoint what I've done so far though. Thank you for your overtaxed patience.

@@ -39,7 +39,18 @@

static final byte[] BAM_MAGIC = "BAM\1".getBytes();
/**
* BAM index file magic number.
* BAM index file magic numbers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* BAM index file magic numbers.
* BAM index file magic numbers.
* @deprecated prefer {@link BAI_INDEX_MAGIC}

@@ -366,27 +366,15 @@ protected int getMaxAddressibleGenomicLocation() {
}

/**
* @deprecated Use {@link GenomicIndexUtil#regionToBins(int, int)} instead.
*
* Get candidate bins for the specified region
* @param startPos 1-based start of target region, inclusive.
* @param endPos 1-based end of target region, inclusive.
* @return bit set for each bin that may contain SAMRecords in the target region.
*/
protected BitSet regionToBins(final int startPos, final int endPos) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
protected BitSet regionToBins(final int startPos, final int endPos) {
@Deprecated
protected BitSet regionToBins(final int startPos, final int endPos) {

Don't forget the actual deprecation tag.


verifyBAMMagicNumber(stream.getSource());
int [] sequenceIndexes;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int [] sequenceIndexes;
int[] sequenceIndexes;


verifyBAMMagicNumber(file.getName());
private AbstractBAMFileIndex(final IndexFileBuffer indexFileBuffer, final String source, final SAMSequenceDictionary dictionary) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: revisit this comment before commiting
There's something funny about the constructor heirachy. It feels like verifyBAMMagicNumber should probably be made abstract and called always.

else
return GenomicIndexUtil.LEVEL_STARTS[levelNumber+1]-GenomicIndexUtil.LEVEL_STARTS[levelNumber];
if (levelNumber >= getNumIndexLevels()) {
throw new SAMException("Level number is too big (" + levelNumber + ").");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is duplicated it might be worth extracting a method assertLevelIsValid method.


@Override
public int position() {
return (int)mCompressedStream.getPosition();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably check for > maxInt and throw in that case. Unlikely with an index file... but you never know...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call on this. BGZF compressed streams use virtual offsets, which are 64 bit by construction and usually exceed Integer.MAX_VALUE. I will change position return type and seek argument type to long.

@Override
public void readBytes(final byte[] bytes) {
try {
mCompressedStream.read(bytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a subtle bug here. mCompressedStream.read(bytes) makes no guarantee that it will read as many bytes as you ask for. You have to check the return value of read and continue reading until you either have read as many bytes as you require, or it returns -1 in which case you should probably throw.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need something like readFully or readBytes from BinaryCodec here I think.

import java.io.IOException;


class CompressedIndexFileBuffer implements IndexFileBuffer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs class level javadoc


class CompressedIndexFileBuffer implements IndexFileBuffer {

private BlockCompressedInputStream mCompressedStream;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this final


import java.io.Closeable;

interface IndexFileBuffer extends Closeable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably get some javadoc.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round of comments. Sorry there are so many. It's a lot of complex code.

BAMIndexMetaData[] data = getIndexStats(bam);

BAMIndexMetaData[] data = null;
if (bam.getIndexType().equals(SamIndexes.BAI) || bam.getIndexType().equals(SamIndexes.CSI)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check necessary or is it effectively handled by the null -> throw check below?

@@ -43,49 +43,73 @@ public static int exhaustivelyTestIndex(final SamReader reader) { // throws Exce
// look at all chunk offsets in a linear index to make sure they are valid

if (reader.indexing().hasBrowseableIndex()) {
if (SamIndexes.BAI.fileNameSuffix.endsWith(reader.type().indexExtension())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole file could be simplified a lot by extracting methods and rearranging things. If you don't want to do that as part of this pr we can file an issue and add this as is.

* @param levelNumber Level for which to compute the size.
* @return
*/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra new line

Suggested change

* Extends the functionality of {@link AbstractBAMFileIndex#getFirstBinInLevel(int)} ,
* which cannot be overridden due to its static nature.
*/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Suggested change

* which cannot be overridden due to its static nature.
*/

public int getFirstBinOnLevel (final int levelNumber) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might make sense to just pull this implementation up to the base class. You would have to pull up getBinDepth as well. I guess there might be some performance concerns but I doubt this is any slower than the nested ifs and array lookups that it uses.

If you think that's not a good solution, this needs to be renamed to something more clearly different. Maybe getFirstBinInLevelForCSI

seek(BAMFileConstants.CSI_MINSHIFT_OFFSET);
}
setMinShift(readInteger()); // min_shift
setBinDepth(readInteger() + 1); // depth - HTSlib doesn't count the first level (bin 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this comment.

} else if (indexBin == getMaxBins() + 1) {
// meta data - build the bin so that the count of bins is correct;
// but don't attach meta chunks to the bin, or normal queries will be off
for (int ci = 0; ci < nChunks; ci++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets pull out a readChunks in the superclass and use it in all 4 places this block of code is repeated.


/** Continually reads from the provided {@link SeekableStream} into the buffer until the specified number of bytes are read, or
* until the stream is exhausted, throwing a {@link RuntimeIOException}. */
private static void readFully(final SeekableStream in, final byte[] buffer, final int offset, final int length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to use BinaryCodec instead of reimplementing readFully.

@Override
public void readBytes(final byte[] bytes) {
try {
mCompressedStream.read(bytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need something like readFully or readBytes from BinaryCodec here I think.


@Override
public int readInteger() {
final byte[] intbuff = new byte[4];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocating a new byte[] for each integer read is going to be super inefficient. It would make sense to reuse the buffer. This method suffers from the same problem with read not guaranteeing a definite number of bytes.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a test that actually opens a bam reader and exercises the various query methods with a file that only has a CSI index.

Fixed `CSIIndex.getStartOfLastLinearBin` method.
Fixed read methods in `CompressedIndexFileBuffer` class.
Changed `seek` and `position` methods to work with `long` offsets.
Simplified the constructor hierarchy in `AbstractBAMFileIndex`.
Added more unit tests in `BAMFileReaderTest`.
Added javadoc and annotations.
@valeriuo
Copy link
Contributor Author

valeriuo commented Nov 28, 2018

I've addressed most of the change requests and also added a new test class BAMFileReaderTest, that makes use of the query methods in the BAMFileReader for comparing the results of BAI and CSI indexes.
The test harness currently fails due to some connection issues with the Broad ftp server.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@valeriuo I think this is good to go. Thank you for your extreme patience and perseverance. 👍

@lbergelson lbergelson merged commit 16a4e37 into samtools:master Jan 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants