Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow-up to #1276: Remove multiple versions of Slice/Container getCRAIEntries() #1329

Merged
merged 18 commits into from
Apr 1, 2019

Conversation

jmthibault79
Copy link
Contributor

@jmthibault79 jmthibault79 commented Mar 15, 2019

  • Encapsulate Container.byteOffset and distribute to Slices on set
  • enforce prerequisites to index entry creation

Description

Please explain the changes you made here.
Explain the motivation for making this change. What existing problem does the pull request solve?

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

@@ -159,7 +159,7 @@ void recordMetaData(final Slice slice) {
unAlignedRecords += slice.unmappedReadsCount;
}

final long start = slice.offset;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed the Slice indexing parameters to make their function clearer.

@@ -125,14 +124,12 @@ public void processContainer(final Container container, final ValidationStringen

int sliceIndex = 0;
for (final Slice slice : container.slices) {
slice.containerOffset = container.offset;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now set elsewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Container.setSlices() sets this value. If the slices exist prior to the setting of this value, then Container.setByteOffset() sets the slices' values.

@@ -59,7 +59,7 @@ public void writeIndex(final OutputStream os) {
* @param container the container to index
*/
public void processContainer(final Container container) {
addEntries(container.getCRAIEntriesSplittingMultiRef());
addEntries(container.getCRAIEntries());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we always split multiref now

@@ -58,8 +58,7 @@ public Container buildContainer(final List<CramCompressionRecord> records) {
slices.add(slice);
}

final Container container = Container.initializeFromSlices(slices);
container.compressionHeader = compressionHeader;
final Container container = Container.initializeFromSlices(slices, compressionHeader);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the compressionHeader when we call getCraiEntries() so requiring it's here helps ensure it's valid

/**
* Retrieve the list of CRAI Index entries corresponding to this Container.
*
* TODO: investigate why we sometimes split multi-ref Slices
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we always split

* TODO: investigate why we sometimes split multi-ref Slices
* into different entries and sometimes do not
*
* TODO: clearly identify and enforce preconditions, e.g.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

precondition enforcement is in craiIndexInitializationCheck()

if (referenceContext.isMultiRef()) {
final Map<ReferenceContext, AlignmentSpan> spans = getMultiRefAlignmentSpans(compressionHeader, ValidationStringency.DEFAULT_STRINGENCY);

return spans.entrySet().stream()
.map(e -> new CRAIEntry(e.getKey().getSerializableId(),
e.getValue().getStart(),
e.getValue().getSpan(),
containerStartByteOffset,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can trust the Slice now instead of relying on the Container, for these 2 values

if (referenceContext.isMultiRef()) {
final Map<ReferenceContext, AlignmentSpan> spans = getMultiRefAlignmentSpans(compressionHeader, ValidationStringency.DEFAULT_STRINGENCY);

return spans.entrySet().stream()
.map(e -> new CRAIEntry(e.getKey().getSerializableId(),
e.getValue().getStart(),
e.getValue().getSpan(),
containerStartByteOffset,
landmarks[index],
size))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same member variable. just renamed.

sequenceId,
alignmentStart,
alignmentSpan,
containerByteOffset,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

containerByteOffset newly uses the Slice member variable.
byteOffsetFromContainer and byteSize are continuing uses of the Slice member variables, renamed.

@codecov-io
Copy link

codecov-io commented Mar 15, 2019

Codecov Report

Merging #1329 into master will increase coverage by 0.026%.
The diff coverage is 82.278%.

@@               Coverage Diff               @@
##              master     #1329       +/-   ##
===============================================
+ Coverage     67.822%   67.848%   +0.026%     
- Complexity      8258      8273       +15     
===============================================
  Files            562       562               
  Lines          33663     33678       +15     
  Branches        5644      5653        +9     
===============================================
+ Hits           22831     22850       +19     
+ Misses          8658      8653        -5     
- Partials        2174      2175        +1
Impacted Files Coverage Δ Complexity Δ
src/main/java/htsjdk/samtools/cram/CRAIEntry.java 73.684% <ø> (ø) 25 <0> (ø) ⬇️
...mtools/cram/build/CramContainerHeaderIterator.java 100% <100%> (ø) 2 <1> (ø) ⬇️
...samtools/cram/build/CramSpanContainerIterator.java 71.053% <100%> (-2.118%) 9 <0> (ø)
...a/htsjdk/samtools/cram/build/ContainerFactory.java 96.552% <100%> (-0.115%) 5 <0> (ø)
...va/htsjdk/samtools/cram/build/ContainerParser.java 96.774% <100%> (ø) 10 <0> (ø) ⬇️
...ava/htsjdk/samtools/CRAMContainerStreamWriter.java 72.34% <100%> (-0.117%) 54 <0> (ø)
src/main/java/htsjdk/samtools/CRAMFileReader.java 76.02% <100%> (ø) 53 <0> (ø) ⬇️
...rc/main/java/htsjdk/samtools/BAMIndexMetaData.java 77.67% <100%> (ø) 22 <0> (ø) ⬇️
src/main/java/htsjdk/samtools/cram/CRAIIndex.java 88.571% <100%> (+0.166%) 26 <1> (ø) ⬇️
...ain/java/htsjdk/samtools/cram/structure/Slice.java 67.593% <100%> (+3.307%) 43 <10> (+9) ⬆️
... and 9 more

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question (see comments) to start off with.

for (final Slice slice : slices) {
slice.containerByteOffset = byteOffset;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there actually a case where this is called and slices == null ? If so, and the slices are somehow populated at a later time, how can we ensure that the offset will be propagated then ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmthibault79 It seems like it should be possible to reduce the number of call sites that need to update these offsets. Every call to this method (except for the one in ContainerStreamWriter, when a new container is being created from records) is immediately preceded by a call (either directly or indirectly), to ContainerIO.readContainer. It would be much simpler to make ContainerIO responsible for all of those cases, even if it means passing the offset in (although I think it could often derive the offset from the stream position). Would that work ? I think it would reduce the complexity around understanding who/when the slice offsets get propagated, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think this is possible, and I have started an attempt at this. I'll work on that some more.

Good point about propagation. I will have to think more about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed both of these in a new commit.

catch (IOException e) {
throw new RuntimeIOException("Error writing CRAI index to output stream");
// get the first container
Container container = ContainerIO.readContainer(cramVersion, cramStream);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset is set inside readContainer() by querying the stream

@@ -400,7 +400,7 @@ public SAMRecordIterator assertSorted(final SortOrder sortOrder) {
seekableStream.seek(0);
newIterator = new CRAMIterator(seekableStream, referenceSource, validationStringency);
seekableStream.seek(startOfLastLinearBin >>> 16);
final Container container = ContainerIO.readContainerHeader(newIterator.getCramHeader().getVersion().major, seekableStream);
final Container container = ContainerHeaderIO.readContainerHeader(newIterator.getCramHeader().getVersion().major, seekableStream);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was driving me nuts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO obviously: move this to a new ContainerHeader class. Just not today.

* @param countingStream the {@link CountingInputStream} to read from
* @return The next Container's header from the stream, returned as a Container.
*/
@Override
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tripped me up!

Hopefully this explicit Override and Javadoc will save some future grief.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good catch.

@@ -17,20 +17,14 @@
private CountingInputStream countingInputStream;
private Container nextContainer;
private boolean eof = false;
private long offset = 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

containers can now take care of this. don't need it here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay!

@@ -293,36 +297,39 @@ private static SAMFileHeader readSAMFileHeader(final Version version, InputStrea
} else {
/*
* pending issue: container.containerByteSize inputStream 2 bytes shorter
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not investigated this

block = Block.read(version.major, inputStream);
}
}

inputStream = new ByteArrayInputStream(block.getUncompressedContent());
byte[] bytes;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really related to this PR. Just a little cleanup while I'm here.

final Container c = ContainerIO.readContainer(cramHeader.getVersion(), seekableStream);
c.offset = offset;
return c;
return ContainerIO.readContainer(cramHeader.getVersion(), seekableStream);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readContainer() takes care of setting the offset by querying the SeekableStream

@@ -65,13 +65,55 @@
public CompressionHeader compressionHeader;

// slices found in the container:
public Slice[] slices;
private Slice[] slices;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't set the slices without also setting their byteOffsets now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent!

@@ -70,9 +77,42 @@ public static Container readContainerHeader(final int major, final InputStream i
if (major >= 3)
container.checksum = CramInt.readInt32(inputStream);

container.setByteOffset(containerByteOffset);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now the only remaining external call to setByteOffset()

return container;
}

// convenience methods for SeekableStream and CountingInputStream
// TODO: merge these two classes?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these 2 seem VERY similar. It's annoying to have to account for both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Seekable can inherit from Counting?

public static Container readContainer(final Version version,
final InputStream inputStream,
final long containerByteOffset) {
Container container = readContainerInternal(version.major, inputStream, containerByteOffset);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another version of this readContainer() consolidation is in #1306

Copy link
Collaborator

@cmnbroad cmnbroad Mar 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This readContainer method can be made private now.

if (container.isEOF()) {
return container;
}

container.compressionHeader = CompressionHeader.read(major, inputStream);

howManySlices = Math.min(container.landmarks.length, howManySlices);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

howManySlices and fromSlice were never used. See #1306 for more details.

* @param outputStream the stream to write to
* @return the number of bytes written
*/
public static int writeContainerHeader(final int major, final Container container, final OutputStream outputStream) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this exist

final CramHeader cramHeader = CramIO.readCramHeader(bais);
final Container container = ContainerIO.readContainer(cramHeader.getVersion(), bais);
final Slice slice = container.slices[0];
Container container;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more unrelated misc cleanup

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran into some new questions, per our slack conversation. Checkpointing what I have so far.

* @param countingStream the {@link CountingInputStream} to read from
* @return The next Container's header from the stream, returned as a Container.
*/
@Override
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good catch.

@@ -65,13 +65,55 @@
public CompressionHeader compressionHeader;

// slices found in the container:
public Slice[] slices;
private Slice[] slices;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent!

src/main/java/htsjdk/samtools/cram/build/CramIO.java Outdated Show resolved Hide resolved
slices.add(SliceIO.read(major, inputStream));
}

container.populateSlicesAndIndexingParameters(slices);
container.populateSlicesAndIndexingParameters(slices, containerByteOffset);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method has the same signature as setSlices - could you just call that instead, and have it figure out whether to call populateSlicesAndIndexingParameters based on whether landmarks are set ?populateSlicesAndIndexingParameters could then be private. You'd have to normalize array->list, but maybe the setSlices signature should be the one to change to accommodate.

Or even better, this could just call setSlices, and defer populating the index info for the slices since its only needed when indexing happens on the container. It might be cleaner to defer the index info calculation until we actually need it to create an index, by which time there should always be landmarks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated by removing its parameters entirely (and renaming to distributeIndexingParametersToSlices). Slices must be now be set before calling this method.

Added tests to ContainerTest and CRAMBAiIndexerTest

// BAI only
public int index = UNINITIALIZED_INDEXING_PARAMETER;

void baiIndexInitializationCheck() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good check, but it doesn't appear to be called from anywhere except within one test (craiIndexInitializationCheck below is called before indexing).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started looking at this a bit more. It looks like populateSlicesAndIndexingParameters is only called when we're creating an index for a CRAM file that we've read in from disk, but I'm unclear on how the slice index info gets populated in the case where we're creating the index on-the-fly, when writing a new CRAM file. Since the tests were passing, I assumed it was happening somehow. So, I naively tried adding a call to this method from within CRAMBAIIndexer.processContainer, and then ran as many of the indexing tests as I easily could. 6 of them failed. So this led to a few questions:

  • Where should the call to baiIndexInitializationCheck method go ?
  • Does the equivalent of populateSlicesAndIndexingParameters happen in the case where we're writing a cram out from SAMRecords ?
  • If we aren't populating the slice index info when writing a new CRAM file on this branch, how/why is it that all of the index tests are still passing ?
  • Is there a similar issue already on master that isn't being caught by tests ?

comments and clarification for CRAMBAIIndexer
- Encapsulate Container.byteOffset and distribute to Slices on set
# This is the 1st commit message:

move Container.setByteOffset() call inside ContainerIO.readContainer()
- also ContainerHeaderIO.readContainerHeader()

restrict access to setByteOffset() and encapsulate Container.slices

# This is the commit message #2:

oops

# This is the commit message #3:

a little unrelated cleanup

# This is the commit message #4:

oops

# This is the commit message #5:

better CRAIEntryTest

# This is the commit message #6:

test improvements and a fix

# This is the commit message #7:

comment

# This is the commit message #8:

javadoc

# This is the commit message #9:

review comments

# This is the commit message #10:

comments and clarification for CRAMBAIIndexer
…Slices

- called on Container write now as well
Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking better and better. I proposed a couple more method consolidations that I think are simplifying - haven't looked at tests yet though, since they may change based on the outcome of this round.

final long containerSizeInBytes = countingInputStream.getCount() - offset;

nextContainer.offset = offset;
offset += containerSizeInBytes;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cleaning up nicely.

@@ -17,20 +17,14 @@
private CountingInputStream countingInputStream;
private Container nextContainer;
private boolean eof = false;
private long offset = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay!

if (container == null || container.isEOF()) {
return;
}

container.distributeIndexingParametersToSlices();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there is no corresponding call to this in the CRAIIndexer ? But rather than add one there, I'd propose that we should actually remove this one, and just have validation checks in both the BAI and CRAI indexers, instead of actual distribute calls. In the current strategy, the slice index info is always propagated as soon as we have it (either on read or on write). If for any reason we ever reach this point in the code and it hasn't already been done, it indicates an error somewhere. Calling this here could mask that. I think this should just be a validation check, rather than an actual distribute operation. Also, I would make both BAI and CRAI validation container-level (and check all slices), rather than slice granularity, towards the goal of eventually making slice hidden from container consumers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch: yes this can be removed.

There is a validation check later on, in processAsSingleReferenceSlice() because that is where all BAI code paths end up.

Since the indexing values all live in Slice (currently) I feel that's the appropriate place to validate them. Maybe in the future that will change, but that's beyond the scope of this PR.

*/
void setSlices(final List<Slice> slices, final long byteOffset) {
this.slices = slices.toArray(new Slice[0]);
setByteOffsetInSlices(byteOffset);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method has a byteoffset that it propagates to slices, but doesn't set it on the container itself (setByteOffset does that). I don't see any meaningful distinction now between setByteOffset and setByteOffsetInSlices. However the byteoffset is established, it should always be set on both the container and on the slices (if they exist) at the same time, in complete lockstep, so there should only be one method that always does both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

public static Container initializeFromSlices(final List<Slice> containerSlices) {
public static Container initializeFromSlices(final List<Slice> containerSlices,
final CompressionHeader compressionHeader,
final long containerByteOffset) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only (non test) external caller to setSlices that exists has the slices, the containerByteOffset, and the compression header, and could call this method directly. I think setSlices and initializeFromSlices can be consolidated into a single method that does everything ?

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed all the files now including tests, a few more changes requested.

public static Container readContainer(final Version version,
final InputStream inputStream,
final long containerByteOffset) {
Container container = readContainerInternal(version.major, inputStream, containerByteOffset);
Copy link
Collaborator

@cmnbroad cmnbroad Mar 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This readContainer method can be made private now.

}
}

public void craiIndexInitializationCheck() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be private except for tests, so it could be package-protected/VisibleForTesting. But I'm not sure its useful to have separate methods for the BAI and CRAI validation anyway. They're only subtly different (which is hard to spot - I thought they were the same when I first saw them), but the slice is either ready for indexing or it isn't. Whether it fails for BAI or CRAI seems like a minor detail - you could remove a bunch of code and test code if they were consolidated into a single method that checked for everything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment to make the difference clearer.

I'd prefer to keep them separate, since they both test for exactly what is needed under their own requirements.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why the distinction, or the increased complexity and code duplication, is useful - there is only one distributeIndexingParametersToSlices thats establishes these values. But ok.

* finish() is called at the end.
* 1) Building a bam index (BAI) while building the CRAM file
* 2) Building a bam index (BAI) from an existing CRAI file
*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a third pattern implemented in this file, which is creating an index from an existing CRAM file (see createIndex). But createIndex appears to be unused anywhere, except in CRAMFileBAIIndexerTest. Which means CRAMFileBAIIndexerTest is somewhat bogus in that its testing index streams that are not created using real code. (This is confirmed in my test inventory notes from a couple of days ago - I just didn't realize when I did that inventory that the method being used by this test was not actually used anywhere else).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - that's why I didn't include it. I can add it here with a note that it's unused.

src/main/java/htsjdk/samtools/CRAMBAIIndexer.java Outdated Show resolved Hide resolved
src/test/java/htsjdk/samtools/cram/CRAIEntryTest.java Outdated Show resolved Hide resolved
@jmthibault79 jmthibault79 merged commit aa89809 into master Apr 1, 2019
@jmthibault79 jmthibault79 deleted the jt_crai_consolidation branch April 1, 2019 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants