Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cram implementation reports validation errors at container granularity #1091

Conversation

SathemBite
Copy link
Contributor

@SathemBite SathemBite commented Feb 20, 2018

Description

Associated issue: #1076

Motivation

  • CRAM implementation is different then BAM implementation and validates all records in container when it first opened, but BAM implementation validated each record as its retrieved
  • CRAM iterator contained useless memory allocation for 10000 elements of ArrayList<SAMRecord> records;

Solution

The first problem was solved by deletion of validation from cramIterator.nextContainer() method, and addition it in cramIterator.next(), for validation records when they retrieves.
And the second problem was solved by deletion of objects reservation.

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

@SathemBite SathemBite force-pushed the TASK-298-CRAM_implementation_reports_validation_errors_at_container_granularity branch 2 times, most recently from 17dd6f2 to b3e4f77 Compare February 20, 2018 11:45
@codecov-io
Copy link

codecov-io commented Feb 20, 2018

Codecov Report

Merging #1091 into master will increase coverage by 0.003%.
The diff coverage is 100%.

@@              Coverage Diff               @@
##             master     #1091       +/-   ##
==============================================
+ Coverage      68.7%   68.703%   +0.003%     
- Complexity     8060      8061        +1     
==============================================
  Files           542       542               
  Lines         32728     32725        -3     
  Branches       5537      5536        -1     
==============================================
- Hits          22484     22483        -1     
  Misses         8043      8043               
+ Partials       2201      2199        -2
Impacted Files Coverage Δ Complexity Δ
src/main/java/htsjdk/samtools/CRAMIterator.java 79.72% <100%> (+0.953%) 33 <0> (ø) ⬇️
...samtools/util/AsyncBlockCompressedInputStream.java 72% <0%> (-4%) 12% <0%> (-1%)
src/main/java/htsjdk/samtools/SAMRecord.java 67.795% <0%> (+0.117%) 315% <0%> (+1%) ⬆️
src/main/java/htsjdk/samtools/SAMUtils.java 59.848% <0%> (+0.505%) 126% <0%> (+1%) ⬆️

@@ -106,7 +106,7 @@ public CRAMIterator(final SeekableStream seekableStream, final CRAMReferenceSour
this.containerIterator = containerIterator;

firstContainerOffset = containerIterator.getFirstContainerOffset();
records = new ArrayList<SAMRecord>(10000);
records = new ArrayList<SAMRecord>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many records are expected in each container? This could have some performance effect, but I understand that there is no explanation for the magic number. Should it be extracted as an default constant at the class level or maybe to an argument to the iterator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@magicDGS hello, thanks for your review, I've extracted it as constant at the class level

@vadimzalunin
Copy link
Contributor

vadimzalunin commented Feb 20, 2018 via email

@SathemBite SathemBite force-pushed the TASK-298-CRAM_implementation_reports_validation_errors_at_container_granularity branch from ea5521b to 410011b Compare February 21, 2018 15:35
Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass, with a few requests.

@@ -43,6 +43,7 @@

public class CRAMIterator implements SAMRecordIterator {
private static final Log log = Log.getInstance(CRAMIterator.class);
private static final int DEFAULT_CONTAINER_CAPACITY = 10000;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are existing constants named DEFAULT_RECORDS_PER_SLICE and DEFAULT_SLICES_PER_CONTAINER (defined in CRAMContainerStreamWriter!). We're planning to do a refactoring of this code soon to fix a lot of this (i.e., those constants and much of this code should live in the Container and Slice classes, and since its reading an existing container, the size of the list can be based on the size of the container being read rather than constant). For now, if anything, I'd say to just use the existing DEFAULT_RECORDS_PER_SLICE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad hello, thanks for your review, all requested changes are done

*
* @param samRecord - validated record
*/
public void validateRecord(SAMRecord samRecord){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be public, or even a separate method at all since its only called once.

*/
public void validateRecord(SAMRecord samRecord){
final List<SAMValidationError> validationErrors = samRecord.isValid();
SAMUtils.processValidationErrors(validationErrors, samRecordIndex, validationStringency);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that in this context, samRecordIndex bears any relationship to the record being processed.



/**
* @author [email protected], EPAM Systems, Inc.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test file, so it doesn't matter, but block comments like "author" need to go after the main javadoc or else the entire javadoc text becomes part of the author block.


public class CRAMIteratorTest extends HtsjdkTest {
private static final File refFile = new File("src/test/resources/htsjdk/samtools/cram/ce.fa");
private static final File cramFile = new File("src/test/resources/htsjdk/samtools/cram/ce#supp.3.0.cram");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably this file has some records that fail validation. It would be better to make a copy of it with a name that indicates that, and reinforce it by using a variable name here that reflects it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, although many of the test are written this same way, it would be preferable for each test to do its own setup rather than relying on class-level state.

public class CRAMIteratorTest extends HtsjdkTest {
private static final File refFile = new File("src/test/resources/htsjdk/samtools/cram/ce.fa");
private static final File cramFile = new File("src/test/resources/htsjdk/samtools/cram/ce#supp.3.0.cram");
ReferenceSource source = new ReferenceSource(refFile);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't publish a style guide, but for new code we try to use "final" everywhere possible (paramaters, locals, etc.).

getCramFileIterator(cramFile, source, ValidationStringency.STRICT);

while (cramIter.hasNext())
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't publish a style guide, but we generally try to put opening braces on the line with the previous statement.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other comment I forgot to make; since this test is expected to throw an exception, it would be a good idea to use a try/finally block to close the reader.


private SAMRecordIterator getCramFileIterator(File cramFile,
ReferenceSource source,
ValidationStringency valStrigency) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valStrigency -> valStringency

}

@Test(expectedExceptions = SAMException.class)
public void shouldThrowExceptionIfCRAMFileContainsInvalidRecods() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename to throwOnRecordValidationFailure.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay on reviewing this. The changes are pretty nicely simplified now - just a couple of test cleanup comments.

final ReferenceSource source = new ReferenceSource(refFile);
final SAMRecordIterator cramIteratorOverInvalidRecords =
getCramFileIterator(cramFileWithInvalidRecs, source, ValidationStringency.STRICT);
try{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try-with-resources

final File cramFileWithInvalidRecs = new File("src/test/resources/htsjdk/samtools/cram/ce#containsInvalidRecords.3.0.cram");
final ReferenceSource source = new ReferenceSource(refFile);
final SAMRecordIterator cramIteratorOverInvalidRecords =
getCramFileIterator(cramFileWithInvalidRecs, source, ValidationStringency.STRICT);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the code up to this point is duplicated in both test methods. Can you factor that out (maybe into getCramFileIterator).

return iterator.next();
SAMRecord samRecord = iterator.next();
if (validationStringency != ValidationStringency.SILENT) {
SAMUtils.processValidationErrors(samRecord.isValid(), samRecordIndex++, validationStringency);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since samRecordIndex is a class not a local variable, can you add a comment to the declaration saying that it is only used when Validation Stringency is not SILENT, and otherwise it isn't valid. (this was always true even before this PR, but it would be good to make that clear).

public class CRAMIteratorTest extends HtsjdkTest {

@Test(description = "This test checks that records validation is deferred until they are retrieved")
public void notThrowOnOpeningContainerWithInvalidRecords() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename this to "noValidationFailureOnContainerOpen".

getCramFileIterator(cramFileWithInvalidRecs, source, ValidationStringency.STRICT);

Assert.assertTrue(cramIteratorOverInvalidRecords.hasNext());
cramIteratorOverInvalidRecords.close();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SAMRecordIterator is closable, so you should be able to use try-with-resources here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad Hello! Thank you for review, requested changes are done

@cmnbroad
Copy link
Collaborator

@AntonMazur I think this is pretty close now, but it fails to compile on travis. CRAMIteratorTest.java imports a junit package, but we use testng. I suspect if you replace that import it will compile, and then I can do a final review.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AntonMazur Sorry again for the long delay on getting this branch reviewed. It looks pretty good with the last set of changes, however there is one comment thats a bit awkwardly written. Since the tests didn't run last time this was submitted, it would be good to update that, and then rebase onto current master since this branch is pretty far behind. This will rerun the CI tests to make sure everything passes, then we can get this merged. Alternatively, I'd be happy to make that last change for you if you prefer - just let me know. Thanks again for doing this.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Oct 22, 2018

@AntonMazur Just pinging you one more time on this. I'll plan to update and rebase this branch if I don't hear back from you soon, since I thinks pretty close, but has conflicts that need to be resolved.

@cmnbroad cmnbroad added the Waiting for revisions This PR has received comments from reviewers and is waiting for the Author to respond label Oct 22, 2018
@SathemBite SathemBite force-pushed the TASK-298-CRAM_implementation_reports_validation_errors_at_container_granularity branch from f80ba81 to cba9220 Compare October 30, 2018 08:39
…ation_errors_at_container_granularity

# Conflicts:
#	src/main/java/htsjdk/samtools/CRAMIterator.java
@SathemBite SathemBite force-pushed the TASK-298-CRAM_implementation_reports_validation_errors_at_container_granularity branch from 6439592 to cb13b65 Compare October 30, 2018 09:06
Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 @cmnbroad Can you make the fields we talked about final in a follow up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cram Waiting for revisions This PR has received comments from reviewers and is waiting for the Author to respond
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants