Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCFHeader, VCFCodec and VCFHeaderLine refactoring for VCF4.3/BCF2.2 #835

Closed
wants to merge 1 commit into from

Conversation

cmnbroad
Copy link
Collaborator

This is the first of 3 PRs for adding support for VCF4.3 and BCF 2.2; the first PR is mostly a refactoring of VCFHeader and the VCFHeaderLine class hierarchy in order to fix a bunch of bugs, and to provide the foundation for supporting new VCF versions.

There are a number of TODO comments embedded in these changes that are there either to:

  • underscore differences from prior behavior
  • ask a question about how some case should be handled
  • provide an explanation to help the reviewer understand the changes
  • note other changes that need to be fixed independent of this PR.

Fixes:

Changes to VCFHeader:

  • Always maintains a VCFHeaderVersion, and validates any header lines against the validation rules for that version.
  • Delegates management of the header line list to the VCFMetaDataLines class, which uses a single LinkedHashMap of header lines. The VCFMetaDataLines class is also used to merge headers.
  • No longer removes/restores file format lines from the header line list.

Changes to the VCFHeaderLine hierarchy:

  • All structured header lines classes maintain and round-trips all tags
  • All header lines are now immutable
  • All VCFHeaderLine classes have a validateForVersion method that validates the line against a particular version (this is called by VCFHeader when the line is added to a header, or when a header version is changed)
  • Replaced VCFSimpleHeaderLine/VCFIDHeaderLine with VCFStructuredHeaderLine, which is the base class for all “ID” header lines. Its no longer necessary to use “instanceof” tests to determine header line type. The VCFHeaderLine base class now has an “isStructuredHeaderLine” and a “getID” method; getID is valid iff isStructuredHeaderLine returns true.

VcfUtils.smartMergeHeaders delegates to VCFHeader, which in turn sometimes delegates to VCFHeaderLine classes to resolve conflicts when merging headers

Non-backward compatible API changes:

  • Removed VCFSimpleHeaderLine class, VCFIDHeaderLine interface
  • Removed the SupportedLineType enum from VCFCompoundHeaderLine
  • Removed public sameLineTypeAndName and equalExcludingDescription methods from VCFCompoundHeaderLine
  • Removed the VCFContigHeaderLine constructor that allowed the caller to specify a key
  • Also deprecated VCFUtils.makeContigHeaderLine

Notes: These changes have been run against the Picard and GATK test suites. All of the tests pass, though some require minor changes; these are mostly tests that do line-by-line comparisons against expected results files that have duplicate ID header lines. These now fail because the new implementation no longer retains/roundtrips the duplicate lines. One tool (LiftOverVcf) requires a minor code change because the version validation framework when it attempts to set a header version on a header that contains a conflicting file format line (which would previously have been stripped out by VCFHeader).

Outstanding question: There are several places where code nw either logs or throws on input that is not conformant for the version; was previously silently dropped; was previously silently retained when it shouldn't have been, or has correct behavior that differs from the previous implementation. For now, the throwing/logging is based on “strict reporting” mode setting that is controlled by a global variable. We need to determine the right level and mechanism for opt-out/opt-in should be for these cases.

@codecov-io
Copy link

codecov-io commented Mar 29, 2017

Codecov Report

Merging #835 into master will increase coverage by 0.13%.
The diff coverage is 76.281%.

@@              Coverage Diff               @@
##              master      #835      +/-   ##
==============================================
+ Coverage     64.882%   65.011%   +0.13%     
- Complexity      7198      7308     +110     
==============================================
  Files            527       533       +6     
  Lines          31781     32056     +275     
  Branches        5424      5442      +18     
==============================================
+ Hits           20620     20840     +220     
- Misses          9016      9060      +44     
- Partials        2145      2156      +11
Impacted Files Coverage Δ Complexity Δ
src/main/java/htsjdk/variant/bcf2/BCF2Codec.java 74.847% <ø> (ø) 31 <0> (ø) ⬇️
...c/main/java/htsjdk/variant/vcf/VCFRecordCodec.java 0% <0%> (ø) 0 <0> (ø) ⬇️
src/main/java/htsjdk/variant/bcf2/BCF2Utils.java 71.845% <100%> (-2.194%) 47 <0> (-1)
...n/java/htsjdk/variant/vcf/VCFFormatHeaderLine.java 100% <100%> (+30%) 7 <6> (+3) ⬆️
...ava/htsjdk/variant/vcf/VCFStandardHeaderLines.java 76.389% <100%> (+0.333%) 11 <5> (ø) ⬇️
src/main/java/htsjdk/variant/vcf/VCFConstants.java 91.667% <100%> (+91.667%) 1 <0> (+1) ⬆️
src/main/java/htsjdk/variant/vcf/VCFCodec.java 23.077% <23.077%> (-42.832%) 5 <5> (-8)
src/main/java/htsjdk/utils/Utils.java 50% <50%> (ø) 3 <3> (?)
src/main/java/htsjdk/variant/vcf/VCF3Codec.java 40% <50%> (-13.659%) 4 <2> (-3)
src/main/java/htsjdk/variant/vcf/VCFUtils.java 11.905% <55.556%> (-12.801%) 4 <4> (-2)
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3a75d02...1efa91e. Read the comment docs.

@cmnbroad cmnbroad force-pushed the cn_vcf_43 branch 5 times, most recently from 7cf14e6 to 9408a7a Compare April 6, 2017 19:45
@cmnbroad
Copy link
Collaborator Author

This branch also fixes #934.

Copy link
Member

@magicDGS magicDGS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments (not in-depth reviewed)

@@ -150,6 +150,10 @@ public FeatureCodecHeader readHeader( final PositionalBufferedStream inputStream

if ( bcfVersion.getMajorVersion() != ALLOWED_MAJOR_VERSION )
error("BCF2Codec can only process BCF2 files, this file has major version " + bcfVersion.getMajorVersion());

// TODO: fixing this breaks GATK GenomicsDB integration/tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't like that a block for fixing a bug is based on a downstream project. I will rather vote for fixing it here and fix the tests in GATK when it is updated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It needs to be fixed, but that will go into the next PR. This is one several places in this PR where I added notes like this for things like this.

@@ -206,6 +210,10 @@ public FeatureCodecHeader readHeader( final PositionalBufferedStream inputStream

@Override
public boolean canDecode( final String path ) {
// TODO: this is broken in a couple of ways:
// First, the version check is too permissive - it accepts any minor version, including BCF 2.2,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the version check, I suggest to extract the checking code in readHeader to a private method returning an Optional<String> with the error message/code; if isPresent() returns true, cannot be decode. In the case of the readHeader(), if the optional is present it is thrown with the error message.

// TODO: this is broken in a couple of ways:
// First, the version check is too permissive - it accepts any minor version, including BCF 2.2,
// which it shouldn't. Second, it doesn't recognize that BCF can be block gzipped, so it rejects
// those files because the header never matches, but only because the stream isn't decompressed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A block-gzipped file cannot be read by using any of the codecs implemented in htsjdk unless it has a gz or bgz extension, even if this method returns true for them. Maybe AbstractFeatureReader should use BlockCompressedInputStream.isValidFile() to open properly this kind of files and not base the code on the extension...But I guess that it is part of other PR.

@@ -141,7 +146,7 @@ public void writeHeader(final VCFHeader header) {
}

public static String getVersionLine() {
return VERSION_LINE;
return DEFAULT_VERSION_LINE;
}

public static VCFHeader writeHeader(VCFHeader header,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you are already here, what's about adding the writeHeader method to the encoder to have a better encapsulation? This will fit the encoder interface that I'm working in my tribble-writing PR (#822)...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are already a lot of changes in this PR, so I'd prefer to keep this focused for now.
(I like the the general idea of #822, but that one also will need some discussion).

}
}
if (VCFUtils.getVerboseVCFLogging()) {
logger.warn(message);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning does not says anything about the compatibility checked in the previous if clause. Maybe it will be nice to add this to the warning to do not scare users writing compatible versions...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure I get what you're saying. Are you saying to only log this message if the versions are incompatible ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that the message could include information about the compatibility...

public enum VCFHeaderLineCount {
INTEGER, A, R, G, UNBOUNDED;

// A default int value used to represent an integral count value (not a count *type*) when the
// actual count is derived and not a fixed integer (i.e., when isFixedCount()==false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make this a javadoc and not a comment, because it is public.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point.

import java.util.*;
import java.util.stream.Collectors;

//TODO: should this class be public so consumers can use in place of Set<VCFHeaderLine>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is any other change planned for the next PRs? If not, maybe it is worthy to make it public in this PR to make people familiar with the API from the beginning...

// Allow duplicate non-ID keys, unless fileformat, reference ?

// TODO: Should we reject attempts to add two contig header lines with the same contigIndex ?
// TODO: GATK VcfUtilsUnitTest.createHeaderLines test creates headers with contig lines with identical (0) indices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think that it is not a good idea to block htsjdk development for failing tests in downstream projects...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, these are just notes for myself, and for discussion. The GATK comment is only there to reinforce the idea that this would be a breaking change.

// TODO: duplicate ID lines, which will have to be modified in order for the tests to pass.
//
// TODO: Is it sufficient to log a warning for this case (duplicate key) ? Should we add code
// TODO: here to throw only for v4.3+ ? Only if VCFUtils.getStrictVCFVersionValidation is set ? Or both ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading through the whole PR, I start to think that VCF requires a ValidationStringency enum similar to SAM/BAM parsing to use in different parts, with STRICT value by default (maybe SILENT until the 3 PRs are in to keep the actual behaviour).


//TODO: Is this useful ? It returns the first match for the given key, no matter how many
//TODO: there are. Should we deprecate it (and add a new one that returns a collection)
//TODO: or just change this one to return a collection ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for a breaking change to return a Collection, because anyway this PR will introduce lots of backwards incompatible changes...

@yfarjoun
Copy link
Contributor

yfarjoun commented Feb 4, 2019

How about we close this PR and move the code to htsjdk3?

@yfarjoun
Copy link
Contributor

Closing due complexity.

author will work with htsjdk-next-beta to include these great features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants