-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VCFHeader, VCFCodec and VCFHeaderLine refactoring for VCF4.3/BCF2.2 #835
Conversation
Codecov Report
@@ Coverage Diff @@
## master #835 +/- ##
==============================================
+ Coverage 64.882% 65.011% +0.13%
- Complexity 7198 7308 +110
==============================================
Files 527 533 +6
Lines 31781 32056 +275
Branches 5424 5442 +18
==============================================
+ Hits 20620 20840 +220
- Misses 9016 9060 +44
- Partials 2145 2156 +11
Continue to review full report at Codecov.
|
7cf14e6
to
9408a7a
Compare
…or VCF4.3/BCF2.2 and bug fixes.
This branch also fixes #934. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments (not in-depth reviewed)
@@ -150,6 +150,10 @@ public FeatureCodecHeader readHeader( final PositionalBufferedStream inputStream | |||
|
|||
if ( bcfVersion.getMajorVersion() != ALLOWED_MAJOR_VERSION ) | |||
error("BCF2Codec can only process BCF2 files, this file has major version " + bcfVersion.getMajorVersion()); | |||
|
|||
// TODO: fixing this breaks GATK GenomicsDB integration/tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally don't like that a block for fixing a bug is based on a downstream project. I will rather vote for fixing it here and fix the tests in GATK when it is updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. It needs to be fixed, but that will go into the next PR. This is one several places in this PR where I added notes like this for things like this.
@@ -206,6 +210,10 @@ public FeatureCodecHeader readHeader( final PositionalBufferedStream inputStream | |||
|
|||
@Override | |||
public boolean canDecode( final String path ) { | |||
// TODO: this is broken in a couple of ways: | |||
// First, the version check is too permissive - it accepts any minor version, including BCF 2.2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the version check, I suggest to extract the checking code in readHeader
to a private method returning an Optional<String>
with the error message/code; if isPresent()
returns true, cannot be decode. In the case of the readHeader()
, if the optional is present it is thrown with the error message.
// TODO: this is broken in a couple of ways: | ||
// First, the version check is too permissive - it accepts any minor version, including BCF 2.2, | ||
// which it shouldn't. Second, it doesn't recognize that BCF can be block gzipped, so it rejects | ||
// those files because the header never matches, but only because the stream isn't decompressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A block-gzipped file cannot be read by using any of the codecs implemented in htsjdk unless it has a gz
or bgz
extension, even if this method returns true for them. Maybe AbstractFeatureReader
should use BlockCompressedInputStream.isValidFile()
to open properly this kind of files and not base the code on the extension...But I guess that it is part of other PR.
@@ -141,7 +146,7 @@ public void writeHeader(final VCFHeader header) { | |||
} | |||
|
|||
public static String getVersionLine() { | |||
return VERSION_LINE; | |||
return DEFAULT_VERSION_LINE; | |||
} | |||
|
|||
public static VCFHeader writeHeader(VCFHeader header, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because you are already here, what's about adding the writeHeader
method to the encoder to have a better encapsulation? This will fit the encoder interface that I'm working in my tribble-writing PR (#822)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are already a lot of changes in this PR, so I'd prefer to keep this focused for now.
(I like the the general idea of #822, but that one also will need some discussion).
} | ||
} | ||
if (VCFUtils.getVerboseVCFLogging()) { | ||
logger.warn(message); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning does not says anything about the compatibility checked in the previous if clause. Maybe it will be nice to add this to the warning to do not scare users writing compatible versions...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure I get what you're saying. Are you saying to only log this message if the versions are incompatible ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that the message could include information about the compatibility...
public enum VCFHeaderLineCount { | ||
INTEGER, A, R, G, UNBOUNDED; | ||
|
||
// A default int value used to represent an integral count value (not a count *type*) when the | ||
// actual count is derived and not a fixed integer (i.e., when isFixedCount()==false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make this a javadoc and not a comment, because it is public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point.
import java.util.*; | ||
import java.util.stream.Collectors; | ||
|
||
//TODO: should this class be public so consumers can use in place of Set<VCFHeaderLine> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is any other change planned for the next PRs? If not, maybe it is worthy to make it public in this PR to make people familiar with the API from the beginning...
// Allow duplicate non-ID keys, unless fileformat, reference ? | ||
|
||
// TODO: Should we reject attempts to add two contig header lines with the same contigIndex ? | ||
// TODO: GATK VcfUtilsUnitTest.createHeaderLines test creates headers with contig lines with identical (0) indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I think that it is not a good idea to block htsjdk development for failing tests in downstream projects...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, these are just notes for myself, and for discussion. The GATK comment is only there to reinforce the idea that this would be a breaking change.
// TODO: duplicate ID lines, which will have to be modified in order for the tests to pass. | ||
// | ||
// TODO: Is it sufficient to log a warning for this case (duplicate key) ? Should we add code | ||
// TODO: here to throw only for v4.3+ ? Only if VCFUtils.getStrictVCFVersionValidation is set ? Or both ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading through the whole PR, I start to think that VCF requires a ValidationStringency
enum similar to SAM/BAM parsing to use in different parts, with STRICT value by default (maybe SILENT until the 3 PRs are in to keep the actual behaviour).
|
||
//TODO: Is this useful ? It returns the first match for the given key, no matter how many | ||
//TODO: there are. Should we deprecate it (and add a new one that returns a collection) | ||
//TODO: or just change this one to return a collection ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for a breaking change to return a Collection, because anyway this PR will introduce lots of backwards incompatible changes...
How about we close this PR and move the code to htsjdk3? |
Closing due complexity. author will work with htsjdk-next-beta to include these great features. |
This is the first of 3 PRs for adding support for VCF4.3 and BCF 2.2; the first PR is mostly a refactoring of VCFHeader and the VCFHeaderLine class hierarchy in order to fix a bunch of bugs, and to provide the foundation for supporting new VCF versions.
There are a number of TODO comments embedded in these changes that are there either to:
Fixes:
Changes to VCFHeader:
Changes to the VCFHeaderLine hierarchy:
VcfUtils.smartMergeHeaders delegates to VCFHeader, which in turn sometimes delegates to VCFHeaderLine classes to resolve conflicts when merging headers
Non-backward compatible API changes:
Notes: These changes have been run against the Picard and GATK test suites. All of the tests pass, though some require minor changes; these are mostly tests that do line-by-line comparisons against expected results files that have duplicate ID header lines. These now fail because the new implementation no longer retains/roundtrips the duplicate lines. One tool (LiftOverVcf) requires a minor code change because the version validation framework when it attempts to set a header version on a header that contains a conflicting file format line (which would previously have been stripped out by VCFHeader).
Outstanding question: There are several places where code nw either logs or throws on input that is not conformant for the version; was previously silently dropped; was previously silently retained when it shouldn't have been, or has correct behavior that differs from the previous implementation. For now, the throwing/logging is based on “strict reporting” mode setting that is controlled by a global variable. We need to determine the right level and mechanism for opt-out/opt-in should be for these cases.