Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for reading and writing splitting BAM index files. #1138

Merged
merged 7 commits into from
Dec 11, 2018

Conversation

tomwhite
Copy link
Contributor

@tomwhite tomwhite commented Jun 4, 2018

Move logic to read and write splitting-bai files to htsjdk from Hadoop-BAM, since they are useful for distributed processing in general (and perhaps other tools). The format here is different to (and incompatible with) the one in Hadoop-BAM: it adds a header with a magic number (for file discovery and versioning), as well as a field for granularity. Also, the offsets are written in little-endian format for consistency with the rest of the BAM spec.

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

@tomwhite
Copy link
Contributor Author

tomwhite commented Jun 4, 2018

See #1112

@codecov-io
Copy link

codecov-io commented Jun 4, 2018

Codecov Report

Merging #1138 into master will increase coverage by 0.108%.
The diff coverage is 63.83%.

@@              Coverage Diff               @@
##             master     #1138       +/-   ##
==============================================
+ Coverage     68.38%   68.488%   +0.108%     
- Complexity     8013      8063       +50     
==============================================
  Files           541       545        +4     
  Lines         32717     32902      +185     
  Branches       5531      5554       +23     
==============================================
+ Hits          22372     22534      +162     
- Misses         8134      8144       +10     
- Partials       2211      2224       +13
Impacted Files Coverage Δ Complexity Δ
src/main/java/htsjdk/samtools/BAMFileReader.java 64.306% <100%> (+0.877%) 41 <2> (+3) ⬆️
src/main/java/htsjdk/samtools/SAMUtils.java 59.148% <33.333%> (-0.196%) 126 <1> (+1)
src/main/java/htsjdk/samtools/SBIIndex.java 59.091% <59.091%> (ø) 16 <16> (?)
src/main/java/htsjdk/samtools/SBIIndexWriter.java 69.231% <69.231%> (ø) 8 <8> (?)
src/main/java/htsjdk/samtools/BAMSBIIndexer.java 75% <75%> (ø) 1 <1> (?)
.../variant/variantcontext/StructuralVariantType.java 80% <0%> (-20%) 2% <0%> (+1%)
...samtools/util/AsyncBlockCompressedInputStream.java 72% <0%> (-4%) 12% <0%> (-1%)
src/main/java/htsjdk/samtools/SamReader.java 80% <0%> (-1.25%) 0% <0%> (ø)
...dk/samtools/seekablestream/SeekableHTTPStream.java 57.353% <0%> (-0.223%) 13% <0%> (ø)
...rc/main/java/htsjdk/samtools/util/BinaryCodec.java 69.266% <0%> (ø) 56% <0%> (+1%) ⬆️
... and 21 more

@magicDGS
Copy link
Member

magicDGS commented Jun 4, 2018

@tomwhite - do you think that it might be worthy to add to the specs (https://github.com/samtools/hts-specs)? That will ensure that the format is well-defined and not specific to HTSJDK (and allows integration with other frameworks in other languages).

@yfarjoun
Copy link
Contributor

yfarjoun commented Jun 4, 2018

@magicDGS This was actually just discussed in the last GA4GH meeting. I think that @lbergelson will be proposing a change to the spec to document this type of indexing scheme

@magicDGS
Copy link
Member

magicDGS commented Jun 4, 2018

Thanks @yfarjoun - nice to hear that!

@tomwhite
Copy link
Contributor Author

tomwhite commented Jun 4, 2018

@magicDGS @yfarjoun absolutely. I'm in the process of drafting something. I'll work with @lbergelson and others on this.

tomwhite added a commit to tomwhite/hts-specs that referenced this pull request Jun 5, 2018
that used in Hadoop-BAM. A htsjdk implementation can be found in
samtools/htsjdk#1138.
@lbergelson
Copy link
Member

We should hold off on reviewing this until we work through the issues in the spec discussion.

@tomwhite
Copy link
Contributor Author

@lbergelson I think this is ready for review now as the spec discussion has converged. Also, I've successfully used the code in this PR to build SBI files (with granularity 1 - i.e. every read start position) for benchmarking count reads on BAM for speed and accuracy (see https://github.com/tomwhite/disq-benchmarks).

tomwhite added a commit to tomwhite/disq-original that referenced this pull request Jul 12, 2018
tomwhite added a commit to broadinstitute/gatk that referenced this pull request Jul 12, 2018
tomwhite added a commit to broadinstitute/gatk that referenced this pull request Jul 26, 2018
tomwhite added a commit to broadinstitute/gatk that referenced this pull request Aug 27, 2018
tomwhite added a commit to broadinstitute/gatk that referenced this pull request Sep 6, 2018
tomwhite added a commit to broadinstitute/gatk that referenced this pull request Sep 10, 2018
Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments

src/main/java/htsjdk/samtools/SBIIndex.java Show resolved Hide resolved
src/main/java/htsjdk/samtools/SBIIndex.java Show resolved Hide resolved
}

private static SBIIndex readIndex(final InputStream in) {
BinaryCodec binaryCodec = new BinaryCodec(in);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final here and below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* @return a list of contiguous, non-overlapping, sorted chunks that cover the whole data file
* @see #getChunk(long, long)
*/
public List<Chunk> split(long splitSize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finals here and below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@tomwhite
Copy link
Contributor Author

tomwhite commented Oct 2, 2018

Thanks for the reviews @lbergelson and @yfarjoun. I've addressed all the comments.

// the end, once we know the number of offsets. This is more efficient than using a List<Long> for very
// large numbers of offsets (e.g. 10^8, which is possible for low granularity), since the list resizing
// operation is slow.
this.tempOffsetsFile = File.createTempFile("offsets-", ".headerless.sbi");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to use Path over File in new code where possible. E.g. use Files.createTempFile() here and Files.newOutputStream() below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion @pshapiro4broad. I've made the change in the latest version of this PR.

tomwhite added a commit to broadinstitute/gatk that referenced this pull request Oct 8, 2018
@yfarjoun yfarjoun added Waiting for revisions This PR has received comments from reviewers and is waiting for the Author to respond and removed Waiting for revisions This PR has received comments from reviewers and is waiting for the Author to respond labels Oct 16, 2018
Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @tomwhite.

Could ass final to the variables that can be? I commented next to some of the places...but not all.

blockIn.seek(recordStart);
// Create a buffer for reading the BAM record lengths. BAM is little-endian.
final ByteBuffer byteBuffer = ByteBuffer.allocate(4).order(ByteOrder.LITTLE_ENDIAN);
SBIIndexWriter indexWriter = new SBIIndexWriter(out, granularity);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final

public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
Header header = (Header) o;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final

src/main/java/htsjdk/samtools/SBIIndex.java Show resolved Hide resolved
throw new RuntimeException(String.format("Cannot read SBI with more than %s offsets.", Integer.MAX_VALUE));
}
final int numOffsets = (int) numOffsetsLong;
long[] virtualOffsets = new long[numOffsets];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final

if (!Arrays.equals(buffer, SBI_MAGIC)) {
throw new RuntimeException("Invalid file header in SBI: " + new String(buffer) + " (" + Arrays.toString(buffer) + ")");
}
long fileLength = binaryCodec.readLong();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more finals

finish(header, finalVirtualOffset);
}

void finish(SBIIndex.Header header, long finalVirtualOffset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finals

List<SAMRecord> allReads = Iterables.slurp(samReader);

List<SAMRecord> allReadsFromChunks = new ArrayList<>();
for (Chunk chunk : chunks) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final chunk

}
Assert.assertEquals(allReadsFromChunks, allReads);

List<Chunk> optimizedChunks = Chunk.optimizeChunkList(chunks, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final lists

public void testIndexersProduceSameIndexes() throws Exception {
long bamFileSize = BAM_FILE.length();
for (long g : new long[] { 1, 2, 10, SBIIndexWriter.DEFAULT_GRANULARITY }) {
SBIIndex index1 = fromBAMFile(BAM_FILE, g);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final


@Test
public void testIndexersProduceSameIndexes() throws Exception {
long bamFileSize = BAM_FILE.length();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finals

@tomwhite
Copy link
Contributor Author

@yfarjoun thanks for taking a look. I've added the final modifier to the places you suggested, and many more that I found.

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another few nits. thanks!

@tomwhite
Copy link
Contributor Author

@yfarjoun @lbergelson this should be ready to go in. I've addressed all the feedback. Thanks!

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. We should i

@lbergelson lbergelson dismissed yfarjoun’s stale review December 11, 2018 01:38

Tom responded to Yossi's comments.

@lbergelson
Copy link
Member

We should integrate this with SamFileWriterFactory so it's easy to write an SBI as part of writing a Bam file.

@lbergelson lbergelson merged commit 28dde96 into samtools:master Dec 11, 2018
@tomwhite
Copy link
Contributor Author

Thanks @lbergelson

tomwhite added a commit to tomwhite/disq that referenced this pull request Jan 17, 2019
tomwhite added a commit to disq-bio/disq that referenced this pull request Jan 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants