Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for partial VCF writing #904

Merged
merged 23 commits into from
Sep 12, 2017
Merged

Conversation

amilamanoj
Copy link
Contributor

@amilamanoj amilamanoj commented Jun 20, 2017

Description

This PR addresses Issue #902.

Added support for setting and writing the header in two different steps for the classes derived from VariantContextWriter.

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

@codecov-io
Copy link

codecov-io commented Jun 20, 2017

Codecov Report

Merging #904 into master will increase coverage by 0.964%.
The diff coverage is 80%.

@@               Coverage Diff               @@
##              master      #904       +/-   ##
===============================================
+ Coverage     65.048%   66.012%   +0.964%     
- Complexity      7256      7471      +215     
===============================================
  Files            528       529        +1     
  Lines          31938     31973       +35     
  Branches        5454      5462        +8     
===============================================
+ Hits           20775     21106      +331     
+ Misses          9019      8725      -294     
+ Partials        2144      2142        -2
Impacted Files Coverage Δ Complexity Δ
...ontext/writer/SortingVariantContextWriterBase.java 0% <0%> (ø) 0 <0> (ø) ⬇️
...riantcontext/writer/AsyncVariantContextWriter.java 66.667% <100%> (+35.897%) 6 <1> (+4) ⬆️
...sjdk/variant/variantcontext/writer/BCF2Writer.java 83.974% <82.353%> (+0.082%) 34 <4> (+2) ⬆️
...tsjdk/variant/variantcontext/writer/VCFWriter.java 87.654% <88.889%> (+0.988%) 17 <3> (+2) ⬆️
...in/java/htsjdk/samtools/util/StringLineReader.java 0% <0%> (-96.154%) 0% <0%> (-10%)
src/main/java/htsjdk/samtools/SAMTextWriter.java 69.318% <0%> (-6.818%) 12% <0%> (ø)
src/main/java/htsjdk/samtools/BAMIndexer.java 76.768% <0%> (-4.953%) 17% <0%> (ø)
...c/main/java/htsjdk/tribble/index/IndexFactory.java 75.658% <0%> (-3.009%) 26% <0%> (ø)
...rc/main/java/htsjdk/samtools/util/BinaryCodec.java 67.89% <0%> (-2.671%) 55% <0%> (ø)
.../java/htsjdk/samtools/util/BufferedLineReader.java 68.421% <0%> (-2.547%) 6% <0%> (-2%)
... and 74 more

@@ -221,6 +221,11 @@ public void close() {
super.close();
}

@Override
public void setVcfHeader(VCFHeader header) {
//no-op
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a "no-op"? This doesn't seem right. The add() method in BCF2Writer will blow up without a header, since it calls vc = vc.fullyDecode(header, false);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented.

@@ -49,4 +49,6 @@
public boolean checkError();

public void add(VariantContext vc);

void setVcfHeader(VCFHeader header);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add javadoc for the new interface method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


@Override
public void setVcfHeader(VCFHeader header) {
this.underlyingWriter.setVcfHeader(header);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

untested according to code cov

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the test case

@@ -132,6 +132,11 @@ public synchronized void add(VariantContext vc) {
emitSafeRecords();
}

@Override
public void setVcfHeader(VCFHeader header) {
innerWriter.setVcfHeader(header);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also untested

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but don't worry about this one since SortingVariantContextWriterBase is basically ununsed.

@@ -26,10 +26,13 @@
package htsjdk.variant.variantcontext.writer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need tests to prove that the setVcfHeader() works for all concrete implementations of VariantContextWriter

Copy link
Contributor

@droazen droazen Jun 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also test interaction between setVcfHeader() and writeHeader() (ie., test calling one and then the other for the various implementations).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Override
public void setVcfHeader(VCFHeader header) {
this.mHeader = doNotWriteGenotypes ? new VCFHeader(header.getMetaDataInSortedOrder()) : header;
this.vcfEncoder = new VCFEncoder(this.mHeader, this.allowMissingFieldsInHeader, this.writeFullFormatField);
Copy link
Contributor

@droazen droazen Jun 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this duplicates code in the existing writeHeader() method, would it be possible to have writeHeader() call into setVcfHeader(), or call into a shared helper method?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think we should guard against the problematic case where both setHeader and writeHeader are called, in any order, using different headers. The simplest way would be to just reject either call if a header has already been established. That would ensure consistency. @amilamanoj would that work for your use case ? If so, we should include that contract in the javadoc here at the interface level, and include tests that verify that for each implementation. If not we may need to resort to something more complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droazen I've changed it so that writeHeader() calls into setVcfHeader().
@cmnbroad Yes that should work for our case, since we don't need to change the header once it's set. When rejecting, do we still allow repeated calls of writeHeader() and setVcfHeader() (individually, without calling them both) or just simply check if header has been set once before?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amilamanoj I originally thought the right thing would be for both methods to reject if the header is already set, but changing writeHeader to work that way could break existing code. So perhaps we should just make setVcfHeader enforce that, and leave writeHeader's current promiscuous behavior as is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amilamanoj Actually, on giving this more thought, any code that currently calls writeHeader more than once is already on a bad path, so I would vote for rejecting in either case if a header exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad Thanks for the feedback. I've also discussed this with @cyenyxe. We also think changing header should be prevented if the header (or just a part of body, in our use case) is already written to the output stream, since there's only one output stream per writer instance. But setting the header many times should not matter as long as nothing is written to the output stream yet. I've added some checks along with tests, please take a look and see if it works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad @droazen @yfarjoun Added the same checks for BCF2Writer as well. I think with this commit I've addressed all the comments so far. Waiting for further feedback.

@droazen droazen self-assigned this Jun 27, 2017
@Test(dataProvider = "vcfExtensionsDataProvider")
public void testWriteAndReadVCFBody(final String extension) throws IOException {
final File fakeVCFFile = File.createTempFile("testWriteAndReadVCFBody.", extension);
fakeVCFFile.deleteOnExit();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use method VariantBaseTest.createTempFile() here, which does this for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

writer.setVcfHeader(header);
writer.add(createVC(header));
writer.add(createVC(header));
writer.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a try-with-resources block rather than an explicit close() call, to ensure that the writer gets closed even if there's an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

writer.close();

final VCFCodec codec = new VCFCodec();
codec.setVCFHeader(header, VCFHeaderVersion.VCF4_0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 4.0? The writer above will write a 4.2 vcf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used 4.2.

if (".vcf.gz".equals(extension)) {
iterator = new AsciiLineReaderIterator(new AsciiLineReader(new BlockCompressedInputStream(fakeVCFFile)));
} else {
iterator = new AsciiLineReaderIterator(new AsciiLineReader(new FileInputStream(fakeVCFFile)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All input stream creations should be wrapped in a try-with-resources to ensure the streams get closed.

Copy link
Contributor

@droazen droazen Jun 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is it possible to use AbstractFeatureReader.getFeatureReader() here to get a reader on the vcf file? It will handle the block-gzipped case for you automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AbstractFeatureReader.getFeatureReader() cannot be used here because it includes a check for header is the vcf file. Since we are reading a data-block, calling this method throws htsjdk.tribble.TribbleException$MalformedFeatureFile.

@droazen droazen requested a review from cmnbroad June 29, 2017 16:47
Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amilamanoj Thanks for doing this - it actually raises a few interesting issues. I did a first review pass and added a few requests, along with a couple questions. I expect it will take another round or two of review to get this finished. It would be helpful for the next round if you could squash any subsequent commits down to a single, incremental commit with just the changes based on the code review. Thanks.

@Override
public void setVcfHeader(VCFHeader header) {
this.mHeader = doNotWriteGenotypes ? new VCFHeader(header.getMetaDataInSortedOrder()) : header;
this.vcfEncoder = new VCFEncoder(this.mHeader, this.allowMissingFieldsInHeader, this.writeFullFormatField);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think we should guard against the problematic case where both setHeader and writeHeader are called, in any order, using different headers. The simplest way would be to just reject either call if a header has already been established. That would ensure consistency. @amilamanoj would that work for your use case ? If so, we should include that contract in the javadoc here at the interface level, and include tests that verify that for each implementation. If not we may need to resort to something more complicated.

@@ -0,0 +1,253 @@
/*
* Copyright (c) 2012 The Broad Institute
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we need the copyright notice, but it should probably be 2017 ?

/**
* @author amila
* <p/>
* Class VCFWriterUnitTest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't match the class name.

* <p/>
* Class VCFWriterUnitTest
* <p/>
* This class tests out the ability of the VCF writer to correctly write VCF files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for testing the BCF writer/BCF files, right ?

private Set<String> additionalColumns;

private File tempDir;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these variables be scoped locally to each method that uses them ?

@@ -0,0 +1,169 @@
/*
* Copyright (c) 2012 The Broad Institute
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*/
public static VCFHeader createFakeHeader(final Set<VCFHeaderLine> metaData, final Set<String> additionalColumns,
final SAMSequenceDictionary sequenceDict) {
metaData.add(new VCFHeaderLine(VCFHeaderVersion.VCF4_0.getFormatString(), VCFHeaderVersion.VCF4_0.getVersionString()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be removed, or at the least use VCF4_2.

f.delete();
}
tempDir.delete();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above, use deleteOnExit.

int counter = 0;
while (iterator.hasNext()) {
VariantContext context = codec.decode(iterator.next());
if (context != null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above. Can this be removed ?

int counter = 0;
while (iterator.hasNext()) {
VariantContext context = codec.decode(iterator.next());
if (context != null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this is guarding against ? If hasNet returns true I think next and decode should succeed. Can this be removed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the null check.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amilamanoj I think we're getting closer. There are still a couple of logic issues/questions, and a few more minor requests. For some of the comments (marking args final, making sure all public methods have javadoc, and breaking up long tests into individual test methods), please make those changes wherever applicable - I didn't comment on every individual instance). Thanks.


// is at least one part of body written to the output stream?
private boolean isBodyWritten;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amilamanoj Is there a reason that the writers need to keep two state variables to track what is essentially a single binary state (the header is frozen) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of providing two distinct error messages based on which one (header/body) is written (In our use case, for example, we don't write the header). But I can merge the two variables if it's not important.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's about using an int variable for different error messages? If different to -1, there is an error, and that can be switch to different messages...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@magicDGS I didn't understand how exactly you meant using an int variable. For now I merged the two variables. We can change it later if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind @amilamanoj, it was just a proposal if you really want to be more verbose about the error and what happened. Just FYI, this was what I meant:

// this integer indicates the status for writing the header down
// -1 indicates no problem to write the header
// 0 means header was already written
// 1 means body was already written
// otherwise, this is an unexpected error
private int shouldWriteHeader = -1;

@Override
public void setVCFHeader(final VCFHeader header) {
    if (shouldWriteHeader != -1) {
        final String msg;
        switch(shouldWriteHeader) {
            case 0: msg = "header already written"; break;
            case 1: msg = "body already written"; break;
            default: msg = "unexpected error code (" + shouldWriteHeader + ")";
        }
        throw new IllegalStateException("The header cannot be modified after the header or variants have been written to the output stream. Error was: " + msg);
    }

    ....

}

for ( int i = 0; i < dict.size(); i++ ) {
stringDictionaryMap.put(dict.get(i), i);
if (isHeaderWritten) {
throw new IllegalStateException("Header is already written");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block seems redundant - the next statement is a call to setVcfheader, which handles this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will remove this.


// setup the field encodings
fieldManager.setup(header, encoder, stringDictionaryMap);
setVcfHeader(header);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, even though htsjdk is not completely consistent about this, can we name these methods "setVCFHeader" instead of "setVcfHeader"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@@ -221,6 +206,43 @@ public void close() {
super.close();
}

@Override
public void setVcfHeader(VCFHeader header) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the param should be marked final

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// make sure the header is sorted correctly
header = new VCFHeader(header.getMetaDataInSortedOrder(), header.getGenotypeSamples());

this.header = doNotWriteGenotypes ? new VCFHeader(header.getMetaDataInSortedOrder()) : header;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should need to make a copy of the header twice here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean something like this? this.header = doNotWriteGenotypes ? new VCFHeader(header.getMetaDataInSortedOrder()) : new VCFHeader(header.getMetaDataInSortedOrder(), header.getGenotypeSamples());

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amilamanoj Yes, thats what I meant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad Fixed.

@@ -53,4 +53,10 @@ public void writeHeader(final VCFHeader header) {
public boolean checkError() {
return false;
}

@Override
public void setVcfHeader(VCFHeader header) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing code isn't consistent about this, but we generally mark arguments as final in new code. Thanks.


try {
// write out the header into a byte stream, get its length, and write everything to the file
final ByteArrayOutputStream capture = new ByteArrayOutputStream();
final OutputStreamWriter writer = new OutputStreamWriter(capture);
this.header = VCFWriter.writeHeader(header, writer, doNotWriteGenotypes, VCFWriter.getVersionLine(), "BCF2 stream");
this.header = VCFWriter.writeHeader(header, writer, VCFWriter.getVersionLine(), "BCF2 stream");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this code is using the wrong header now. Previous to this PR, this method updated the "header" argument variable (questionable practice) with the updated header, and then passed that to the writer. With these changes, it call setVcfHeader, which updates the header member variable with the updated header, but then passes the (now stale) input argument "header" to the writer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to use the member variable.

* <p/>
* Class BCF2WriterUnitTest
* <p/>
* This class tests out the ability of the VCF writer to correctly write VCF files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is still wrong - this is testing that the BCF writer works correctly.

writer2.writeHeader(header);
Assert.fail("Should not allow writing header twice");
} catch (Exception e) {
Assert.assertTrue(e instanceof IllegalStateException);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test (prevent writing the header twice) should be factored out into a separate test method that declares IllegalStateException as an expectedException in the test annotation (see example here). It eliminates the need to write a try/catch block and cleanly separates the individual tests from each other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Assert.fail("Should not allow changing header if header is already written");
} catch (Exception e) {
Assert.assertTrue(e instanceof IllegalStateException);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here, and in the other test files. Each individual test should be a separate method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

…t cases, renamed setVCFHeader() method for consistency,
this.mHeader = writeHeader(header, writer, doNotWriteGenotypes, getVersionLine(), getStreamName());
this.vcfEncoder = new VCFEncoder(this.mHeader, this.allowMissingFieldsInHeader, this.writeFullFormatField);
setVCFHeader(header);
writeHeader(header, writer, getVersionLine(), getStreamName());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amilamanoj This one also is using the stale (local) header. It should use the member variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Set<VCFHeaderLine> metaData = new HashSet<VCFHeaderLine>();
Set<String> additionalColumns = new HashSet<String>();
final SAMSequenceDictionary sequenceDict = createArtificialSequenceDictionary();
final VCFHeader header = createFakeHeader(metaData, additionalColumns, sequenceDict);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like all of the calls to createFakeHeader pass in the same empty metadata and additional columns lists, and fake seq dictionary, so this can be much simplified by just removing those and returning the fake header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*/
private VariantContext createVC(final VCFHeader header) {

return createVCGeneral(header, "1", 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should just be collapsed into createVCGeneral (it seems like all calls are identical).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

.build();
writer2.writeHeader(header);
writer2.setVCFHeader(header);
writer2.add(createVC(header));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call is unnecessary for the test - the exception gets thrown in the previous stmt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

public void testChangeHeaderAfterWritingHeader() {
final File bcfOutputFile = VariantBaseTest.createTempFile("testWriteAndReadVCF.", ".bcf");

Tribble.indexFile(bcfOutputFile).deleteOnExit();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I suggested creating a temporary directory in the earlier review pass was so this kind of call would be unnecessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

// prevent changing header if it's already written
final VariantContextWriter writer2 = new VariantContextWriterBuilder()
.setOutputFile(bcfOutputFile).setReferenceDictionary(sequenceDict)
.setOptions(EnumSet.of(Options.INDEX_ON_THE_FLY))
Copy link
Collaborator

@cmnbroad cmnbroad Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both the sequence dictionary and the INDEX_ON_THE_FLY options are unnecessary for this test, and can be removed. We're just trying to test that we throw if you change the header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed INDEX_ON_THE_FLY option. But sequence dictionary has to be set because INDEX_ON_THE_FLY is a default option anyway, and without setting the sequence dictionary, writer cannot be created (an exception gets thrown).

@cmnbroad
Copy link
Collaborator

@amilamanoj I added a couple more comments on the tests (most of which apply to both test files, though I only put comment on one of them), since I think they can be simplified a bit. In general we should only include the minimal code necessary for the specific test. Thanks in advance for your patience in seeing these through.

@amilamanoj
Copy link
Contributor Author

@cmnbroad @droazen I think all comments so far have been addressed. Do you have more feedback or does the PR look OK now?

@cmnbroad
Copy link
Collaborator

@amilamanoj I need to do another review pass on the updated changes. Sorry for the long turnaround time, but we have quite a backlog of PRs. I'll do it as soon as I can; hopefully it will be this week.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amilamanoj There are a couple of minor things in the code itself in this round, but most of these comments are on the test code, which I think can be simplified a lot. Also, I realize I steered you wrong on the deleteOnExit comment last time - my apologies for that (see the comments).

Many of the test comments are just about reducing the test case code to the minimal set of things that are necessary for the specific test at hand; though some are stylistic. In many cases the comments apply in places other than where I mentioned them (i.e., final, try-with-resources), so please apply them wherever possible. I do expect we'll need at least one more round of review for this - thanks for your patience.

fieldManager.setup(header, encoder, stringDictionaryMap);

}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really need to ensure that this PR doesn't introduce any subtle changes that are not essential to the core change. There are still several places here where the stale header that was passed in is being used, which makes it different from the previous implementation (which updated the header and used the updated version after that). Its not clear to me if the differences are significant or not; but thats just all the more reason to make sure we don't change it unless its necessary.

*/
public static VCFHeader createFakeHeader(final SAMSequenceDictionary sequenceDict) {
Set<VCFHeaderLine> metaData = new HashSet<VCFHeaderLine>();
Set<String> additionalColumns = new HashSet<String>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These an be final. Also, you don't need to include the types in the constructor calls - you can just use the diamond operator ("<>") and they will be inferred. The should be removed everywhere in this file/PR.

ret.addMetaDataLine(new VCFFormatHeaderLine("BB", 1, VCFHeaderLineType.String, "x"));
ret.addMetaDataLine(new VCFFormatHeaderLine("GQ", 1, VCFHeaderLineType.String, "x"));
ret.setSequenceDictionary(sequenceDict);
return ret;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please name the header variable something like "header" instead of "ret". Also, I don't think all these column/info/format lines are adding anything relevant to this test (?). Maybe just one of each to simplify this, unless I'm missing something.

*
* @return a fake VCF header
*/
public static VCFHeader createFakeHeader(final SAMSequenceDictionary sequenceDict) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be private. Also, it looks like every single call to this method use the same fake sequenceDictionary. If so, that should all be collapsed to the minimum amount of code possible.

@Test
public void testWriteAndReadBCF() throws IOException {
final File bcfOutputFile = File.createTempFile("testWriteAndReadVCF.", ".bcf", tempDir);
Tribble.indexFile(bcfOutputFile);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies - I think I steered you wrong on this in the last review round. This indexFile() call was previously here so that deleteOnExit could be called on the resulting index file, which we do need to do if we're creating an index, even if the file is created in the temp dir. So the deleteOnExit calls need to be restored for these index files everywhere (my bad).

However, as I mentioned before, we could get rid of this index problem (and the need for a sequence dictionary) altogether in most of these tests by turning off index creation (call ".unsetOption(Options.INDEX_ON_THE_FLY)" on the builder. It would be useful to have one separate test that does create an index, just to make sure we have test coverage for that, but then that would be the only place we have to deal with deleting the index file. Then the rest of these tests could be very minimal.

.build();
writer2.writeHeader(header);
writer2.setVCFHeader(header);
writer2.close();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try

try (FileInputStream fis = new FileInputStream(fakeVCFFile)) {
AsciiLineReaderIterator iterator;

iterator = new AsciiLineReaderIterator(new AsciiLineReader(fis));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please combine lines 96 and 98 into one line.

final VCFCodec codec = new VCFCodec();
codec.setVCFHeader(header, VCFHeaderVersion.VCF4_2);

try (FileInputStream fis = new FileInputStream(fakeVCFFile)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final

iterator = new AsciiLineReaderIterator(new AsciiLineReader(bcis));
} else {
iterator = new AsciiLineReaderIterator(new AsciiLineReader(fis));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ternary operator cold make this all much cleaner and easier to read:

AsciiLineReaderIterator iterator =
    new AsciiLineReaderIterator(new AsciiLineReader(".vcf.gz".equals(extension) ? bcis : fis));




}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of all of the blank lines?

@amilamanoj
Copy link
Contributor Author

@cmnbroad Thanks for the feedback. I've addressed your comments.

@amilamanoj
Copy link
Contributor Author

@cmnbroad I'm waiting for any further feedback.

@cmnbroad
Copy link
Collaborator

@amilamanoj I haven't forgotten that this needs another review pass, and will get to it as soon as possible. Realistically, that will not be until next week though.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Sep 5, 2017

This still needs a final review. In addition to enabling the original motivating (htsget) use case, it would also allow us to get rid of this Hadoop-BAM hack.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amilamanoj Thanks for doing this, I have a few nitpicky comments and then this looks good to merge 👍

@@ -49,4 +55,13 @@
public boolean checkError();

public void add(VariantContext vc);

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the javadoc should mention that you should call exactly one of writeHeader() / setHeader()

@@ -122,6 +122,10 @@
private VCFHeader lastVCFHeaderOfUnparsedGenotypes = null;
private boolean canPassOnUnparsedGenotypeDataForLastVCFHeader = false;

// is the header or body written to the output stream?
private boolean isWrittenToOutput;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name outputHasBeenWritten would be more clear. Can you change this in all places where this variable has been added?

*/
@Test
public void testWriteAndReadBCF() throws IOException {
final File bcfOutputFile = File.createTempFile("testWriteAndReadVCF.", ".bcf", tempDir);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you mark this for deletion?

*/
@Test
public void testWriteAndReadBCFWithIndex() throws IOException {
final File bcfOutputFile = File.createTempFile("testWriteAndReadVCF.", ".bcf", tempDir);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with this

*/
@Test
public void testWriteAndReadBCFHeaderless() throws IOException {
final File bcfOutputFile = File.createTempFile("testWriteAndReadBCFWithHeader.", ".bcf", tempDir);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with these


@Test(expectedExceptions = IllegalStateException.class)
public void testChangeHeaderAfterWritingHeader() throws IOException {
final File bcfOutputFile = File.createTempFile("testWriteAndReadVCF.", ".bcf", tempDir);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here


@Test(expectedExceptions = IllegalStateException.class)
public void testChangeHeaderAfterWritingBody() throws IOException {
final File bcfOutputFile = File.createTempFile("testWriteAndReadVCF.", ".bcf", tempDir);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here


final VCFHeader header = createFakeHeader();
// prevent writing header twice
try (final VariantContextWriter writer1 = new VariantContextWriterBuilder()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would drop the number from these writers 1, 2, and 3

/** test, using the writer and reader, that we can output and input a VCF body without problems */
@Test
public void testWriteAndReadAsyncVCFHeaderless() throws IOException {
final File fakeVCFFile = VariantBaseTest.createTempFile("testWriteAndReadAsyncVCFHeaderless.", ".vcf");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, sadly all the test files have to be manually marked as deleteOnExit()

@@ -77,28 +80,21 @@
@BeforeClass
private void createTemporaryDirectory() {
tempDir = TestUtil.getTempDirectory("VCFWriter", "StaleIndex");
tempDir.deleteOnExit();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this pattern works. The older deleteTemporaryDirectory method that was called previously seems more likely to work. I would just mark all of the temporary files as deleteOnExit()

@amilamanoj
Copy link
Contributor Author

@lbergelson Thanks for the feedback. I've addressed your comments.

@lbergelson
Copy link
Member

@amilamanoj Thank you for sticking through with this! This is a useful change, I'm not sure how it became so complicated with so many reviewers.

@lbergelson lbergelson merged commit 9991e55 into samtools:master Sep 12, 2017
@lbergelson
Copy link
Member

@tomwhite This is now merged into htjskd and we may want to update hadoop bam to take advantage of it. Although that will probably require a release and a coordinated change with GATK since this breaks our GVCF writer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants