-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Follow-up to #1276: Remove multiple versions of Slice/Container getCRAIEntries() #1329
Conversation
@@ -159,7 +159,7 @@ void recordMetaData(final Slice slice) { | |||
unAlignedRecords += slice.unmappedReadsCount; | |||
} | |||
|
|||
final long start = slice.offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed the Slice indexing parameters to make their function clearer.
@@ -125,14 +124,12 @@ public void processContainer(final Container container, final ValidationStringen | |||
|
|||
int sliceIndex = 0; | |||
for (final Slice slice : container.slices) { | |||
slice.containerOffset = container.offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is now set elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Container.setSlices()
sets this value. If the slices exist prior to the setting of this value, then Container.setByteOffset()
sets the slices' values.
@@ -59,7 +59,7 @@ public void writeIndex(final OutputStream os) { | |||
* @param container the container to index | |||
*/ | |||
public void processContainer(final Container container) { | |||
addEntries(container.getCRAIEntriesSplittingMultiRef()); | |||
addEntries(container.getCRAIEntries()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we always split multiref now
@@ -58,8 +58,7 @@ public Container buildContainer(final List<CramCompressionRecord> records) { | |||
slices.add(slice); | |||
} | |||
|
|||
final Container container = Container.initializeFromSlices(slices); | |||
container.compressionHeader = compressionHeader; | |||
final Container container = Container.initializeFromSlices(slices, compressionHeader); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the compressionHeader when we call getCraiEntries()
so requiring it's here helps ensure it's valid
/** | ||
* Retrieve the list of CRAI Index entries corresponding to this Container. | ||
* | ||
* TODO: investigate why we sometimes split multi-ref Slices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now we always split
* TODO: investigate why we sometimes split multi-ref Slices | ||
* into different entries and sometimes do not | ||
* | ||
* TODO: clearly identify and enforce preconditions, e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
precondition enforcement is in craiIndexInitializationCheck()
if (referenceContext.isMultiRef()) { | ||
final Map<ReferenceContext, AlignmentSpan> spans = getMultiRefAlignmentSpans(compressionHeader, ValidationStringency.DEFAULT_STRINGENCY); | ||
|
||
return spans.entrySet().stream() | ||
.map(e -> new CRAIEntry(e.getKey().getSerializableId(), | ||
e.getValue().getStart(), | ||
e.getValue().getSpan(), | ||
containerStartByteOffset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can trust the Slice now instead of relying on the Container, for these 2 values
if (referenceContext.isMultiRef()) { | ||
final Map<ReferenceContext, AlignmentSpan> spans = getMultiRefAlignmentSpans(compressionHeader, ValidationStringency.DEFAULT_STRINGENCY); | ||
|
||
return spans.entrySet().stream() | ||
.map(e -> new CRAIEntry(e.getKey().getSerializableId(), | ||
e.getValue().getStart(), | ||
e.getValue().getSpan(), | ||
containerStartByteOffset, | ||
landmarks[index], | ||
size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same member variable. just renamed.
sequenceId, | ||
alignmentStart, | ||
alignmentSpan, | ||
containerByteOffset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
containerByteOffset
newly uses the Slice member variable.
byteOffsetFromContainer
and byteSize
are continuing uses of the Slice member variables, renamed.
Codecov Report
@@ Coverage Diff @@
## master #1329 +/- ##
===============================================
+ Coverage 67.822% 67.848% +0.026%
- Complexity 8258 8273 +15
===============================================
Files 562 562
Lines 33663 33678 +15
Branches 5644 5653 +9
===============================================
+ Hits 22831 22850 +19
+ Misses 8658 8653 -5
- Partials 2174 2175 +1
|
68ed944
to
72b71cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question (see comments) to start off with.
for (final Slice slice : slices) { | ||
slice.containerByteOffset = byteOffset; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there actually a case where this is called and slices == null ? If so, and the slices are somehow populated at a later time, how can we ensure that the offset will be propagated then ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmthibault79 It seems like it should be possible to reduce the number of call sites that need to update these offsets. Every call to this method (except for the one in ContainerStreamWriter, when a new container is being created from records) is immediately preceded by a call (either directly or indirectly), to ContainerIO.readContainer. It would be much simpler to make ContainerIO responsible for all of those cases, even if it means passing the offset in (although I think it could often derive the offset from the stream position). Would that work ? I think it would reduce the complexity around understanding who/when the slice offsets get propagated, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think this is possible, and I have started an attempt at this. I'll work on that some more.
Good point about propagation. I will have to think more about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed both of these in a new commit.
catch (IOException e) { | ||
throw new RuntimeIOException("Error writing CRAI index to output stream"); | ||
// get the first container | ||
Container container = ContainerIO.readContainer(cramVersion, cramStream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offset is set inside readContainer()
by querying the stream
@@ -400,7 +400,7 @@ public SAMRecordIterator assertSorted(final SortOrder sortOrder) { | |||
seekableStream.seek(0); | |||
newIterator = new CRAMIterator(seekableStream, referenceSource, validationStringency); | |||
seekableStream.seek(startOfLastLinearBin >>> 16); | |||
final Container container = ContainerIO.readContainerHeader(newIterator.getCramHeader().getVersion().major, seekableStream); | |||
final Container container = ContainerHeaderIO.readContainerHeader(newIterator.getCramHeader().getVersion().major, seekableStream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was driving me nuts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO obviously: move this to a new ContainerHeader
class. Just not today.
* @param countingStream the {@link CountingInputStream} to read from | ||
* @return The next Container's header from the stream, returned as a Container. | ||
*/ | ||
@Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tripped me up!
Hopefully this explicit Override and Javadoc will save some future grief.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good catch.
@@ -17,20 +17,14 @@ | |||
private CountingInputStream countingInputStream; | |||
private Container nextContainer; | |||
private boolean eof = false; | |||
private long offset = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
containers can now take care of this. don't need it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay!
@@ -293,36 +297,39 @@ private static SAMFileHeader readSAMFileHeader(final Version version, InputStrea | |||
} else { | |||
/* | |||
* pending issue: container.containerByteSize inputStream 2 bytes shorter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not investigated this
block = Block.read(version.major, inputStream); | ||
} | ||
} | ||
|
||
inputStream = new ByteArrayInputStream(block.getUncompressedContent()); | ||
byte[] bytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really related to this PR. Just a little cleanup while I'm here.
final Container c = ContainerIO.readContainer(cramHeader.getVersion(), seekableStream); | ||
c.offset = offset; | ||
return c; | ||
return ContainerIO.readContainer(cramHeader.getVersion(), seekableStream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readContainer()
takes care of setting the offset by querying the SeekableStream
@@ -65,13 +65,55 @@ | |||
public CompressionHeader compressionHeader; | |||
|
|||
// slices found in the container: | |||
public Slice[] slices; | |||
private Slice[] slices; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't set the slices without also setting their byteOffsets now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent!
@@ -70,9 +77,42 @@ public static Container readContainerHeader(final int major, final InputStream i | |||
if (major >= 3) | |||
container.checksum = CramInt.readInt32(inputStream); | |||
|
|||
container.setByteOffset(containerByteOffset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is now the only remaining external call to setByteOffset()
return container; | ||
} | ||
|
||
// convenience methods for SeekableStream and CountingInputStream | ||
// TODO: merge these two classes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these 2 seem VERY similar. It's annoying to have to account for both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Seekable can inherit from Counting?
public static Container readContainer(final Version version, | ||
final InputStream inputStream, | ||
final long containerByteOffset) { | ||
Container container = readContainerInternal(version.major, inputStream, containerByteOffset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another version of this readContainer()
consolidation is in #1306
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This readContainer
method can be made private now.
if (container.isEOF()) { | ||
return container; | ||
} | ||
|
||
container.compressionHeader = CompressionHeader.read(major, inputStream); | ||
|
||
howManySlices = Math.min(container.landmarks.length, howManySlices); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
howManySlices
and fromSlice
were never used. See #1306 for more details.
* @param outputStream the stream to write to | ||
* @return the number of bytes written | ||
*/ | ||
public static int writeContainerHeader(final int major, final Container container, final OutputStream outputStream) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did this exist
final CramHeader cramHeader = CramIO.readCramHeader(bais); | ||
final Container container = ContainerIO.readContainer(cramHeader.getVersion(), bais); | ||
final Slice slice = container.slices[0]; | ||
Container container; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more unrelated misc cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran into some new questions, per our slack conversation. Checkpointing what I have so far.
src/main/java/htsjdk/samtools/cram/build/CramContainerHeaderIterator.java
Outdated
Show resolved
Hide resolved
* @param countingStream the {@link CountingInputStream} to read from | ||
* @return The next Container's header from the stream, returned as a Container. | ||
*/ | ||
@Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good catch.
@@ -65,13 +65,55 @@ | |||
public CompressionHeader compressionHeader; | |||
|
|||
// slices found in the container: | |||
public Slice[] slices; | |||
private Slice[] slices; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent!
slices.add(SliceIO.read(major, inputStream)); | ||
} | ||
|
||
container.populateSlicesAndIndexingParameters(slices); | ||
container.populateSlicesAndIndexingParameters(slices, containerByteOffset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method has the same signature as setSlices
- could you just call that instead, and have it figure out whether to call populateSlicesAndIndexingParameters
based on whether landmarks are set ?populateSlicesAndIndexingParameters
could then be private. You'd have to normalize array->list, but maybe the setSlices
signature should be the one to change to accommodate.
Or even better, this could just call setSlices, and defer populating the index info for the slices since its only needed when indexing happens on the container. It might be cleaner to defer the index info calculation until we actually need it to create an index, by which time there should always be landmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated by removing its parameters entirely (and renaming to distributeIndexingParametersToSlices). Slices must be now be set before calling this method.
Added tests to ContainerTest and CRAMBAiIndexerTest
// BAI only | ||
public int index = UNINITIALIZED_INDEXING_PARAMETER; | ||
|
||
void baiIndexInitializationCheck() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good check, but it doesn't appear to be called from anywhere except within one test (craiIndexInitializationCheck
below is called before indexing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started looking at this a bit more. It looks like populateSlicesAndIndexingParameters
is only called when we're creating an index for a CRAM file that we've read in from disk, but I'm unclear on how the slice index info gets populated in the case where we're creating the index on-the-fly, when writing a new CRAM file. Since the tests were passing, I assumed it was happening somehow. So, I naively tried adding a call to this method from within CRAMBAIIndexer.processContainer, and then ran as many of the indexing tests as I easily could. 6 of them failed. So this led to a few questions:
- Where should the call to
baiIndexInitializationCheck
method go ? - Does the equivalent of
populateSlicesAndIndexingParameters
happen in the case where we're writing a cram out from SAMRecords ? - If we aren't populating the slice index info when writing a new CRAM file on this branch, how/why is it that all of the index tests are still passing ?
- Is there a similar issue already on master that isn't being caught by tests ?
comments and clarification for CRAMBAIIndexer
- Encapsulate Container.byteOffset and distribute to Slices on set
# This is the 1st commit message: move Container.setByteOffset() call inside ContainerIO.readContainer() - also ContainerHeaderIO.readContainerHeader() restrict access to setByteOffset() and encapsulate Container.slices # This is the commit message #2: oops # This is the commit message #3: a little unrelated cleanup # This is the commit message #4: oops # This is the commit message #5: better CRAIEntryTest # This is the commit message #6: test improvements and a fix # This is the commit message #7: comment # This is the commit message #8: javadoc # This is the commit message #9: review comments # This is the commit message #10: comments and clarification for CRAMBAIIndexer
…Slices - called on Container write now as well
59980f4
to
1e1f348
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking better and better. I proposed a couple more method consolidations that I think are simplifying - haven't looked at tests yet though, since they may change based on the outcome of this round.
final long containerSizeInBytes = countingInputStream.getCount() - offset; | ||
|
||
nextContainer.offset = offset; | ||
offset += containerSizeInBytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cleaning up nicely.
src/main/java/htsjdk/samtools/cram/build/CramContainerHeaderIterator.java
Show resolved
Hide resolved
@@ -17,20 +17,14 @@ | |||
private CountingInputStream countingInputStream; | |||
private Container nextContainer; | |||
private boolean eof = false; | |||
private long offset = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay!
if (container == null || container.isEOF()) { | ||
return; | ||
} | ||
|
||
container.distributeIndexingParametersToSlices(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there is no corresponding call to this in the CRAIIndexer ? But rather than add one there, I'd propose that we should actually remove this one, and just have validation checks in both the BAI and CRAI indexers, instead of actual distribute
calls. In the current strategy, the slice index info is always propagated as soon as we have it (either on read or on write). If for any reason we ever reach this point in the code and it hasn't already been done, it indicates an error somewhere. Calling this here could mask that. I think this should just be a validation check, rather than an actual distribute operation. Also, I would make both BAI and CRAI validation container
-level (and check all slices), rather than slice
granularity, towards the goal of eventually making slice hidden from container consumers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch: yes this can be removed.
There is a validation check later on, in processAsSingleReferenceSlice()
because that is where all BAI code paths end up.
Since the indexing values all live in Slice (currently) I feel that's the appropriate place to validate them. Maybe in the future that will change, but that's beyond the scope of this PR.
*/ | ||
void setSlices(final List<Slice> slices, final long byteOffset) { | ||
this.slices = slices.toArray(new Slice[0]); | ||
setByteOffsetInSlices(byteOffset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method has a byteoffset
that it propagates to slices, but doesn't set it on the container itself (setByteOffset
does that). I don't see any meaningful distinction now between setByteOffset
and setByteOffsetInSlices
. However the byteoffset is established, it should always be set on both the container
and on the slices
(if they exist) at the same time, in complete lockstep, so there should only be one method that always does both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
public static Container initializeFromSlices(final List<Slice> containerSlices) { | ||
public static Container initializeFromSlices(final List<Slice> containerSlices, | ||
final CompressionHeader compressionHeader, | ||
final long containerByteOffset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only (non test) external caller to setSlices
that exists has the slices, the containerByteOffset, and the compression header, and could call this method directly. I think setSlices
and initializeFromSlices
can be consolidated into a single method that does everything ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed all the files now including tests, a few more changes requested.
public static Container readContainer(final Version version, | ||
final InputStream inputStream, | ||
final long containerByteOffset) { | ||
Container container = readContainerInternal(version.major, inputStream, containerByteOffset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This readContainer
method can be made private now.
} | ||
} | ||
|
||
public void craiIndexInitializationCheck() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be private except for tests, so it could be package-protected/VisibleForTesting. But I'm not sure its useful to have separate methods for the BAI and CRAI validation anyway. They're only subtly different (which is hard to spot - I thought they were the same when I first saw them), but the slice is either ready for indexing or it isn't. Whether it fails for BAI or CRAI seems like a minor detail - you could remove a bunch of code and test code if they were consolidated into a single method that checked for everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a comment to make the difference clearer.
I'd prefer to keep them separate, since they both test for exactly what is needed under their own requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why the distinction, or the increased complexity and code duplication, is useful - there is only one distributeIndexingParametersToSlices
thats establishes these values. But ok.
* finish() is called at the end. | ||
* 1) Building a bam index (BAI) while building the CRAM file | ||
* 2) Building a bam index (BAI) from an existing CRAI file | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a third pattern implemented in this file, which is creating an index from an existing CRAM file (see createIndex
). But createIndex
appears to be unused anywhere, except in CRAMFileBAIIndexerTest
. Which means CRAMFileBAIIndexerTest
is somewhat bogus in that its testing index streams that are not created using real code. (This is confirmed in my test inventory notes from a couple of days ago - I just didn't realize when I did that inventory that the method being used by this test was not actually used anywhere else).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - that's why I didn't include it. I can add it here with a note that it's unused.
Description
Please explain the changes you made here.
Explain the motivation for making this change. What existing problem does the pull request solve?
Checklist