-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support VCF v4.3 (read only). #1359
Conversation
5a4e48b
to
a9dae56
Compare
Codecov Report
@@ Coverage Diff @@
## master #1359 +/- ##
===============================================
+ Coverage 67.863% 67.959% +0.096%
- Complexity 8289 8326 +37
===============================================
Files 563 570 +7
Lines 33718 33797 +79
Branches 5659 5660 +1
===============================================
+ Hits 22882 22968 +86
+ Misses 8656 8649 -7
Partials 2180 2180
|
4ed4850
to
d28da51
Compare
public void testVCF43ReadUTF8Attributes() { | ||
final Tuple<VCFHeader, List<VariantContext>> entireVCF = VariantContextTestUtils.readEntireVCFIntoMemory( | ||
Paths.get(TEST_PATH + TEST_43_UTF8_FILE)); | ||
final List<VCFHeaderLine> headerLines = getIDHeaderLinesWithKey(entireVCF.a,"COMMENT"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespace
final List<VCFHeaderLine> headerLines = getIDHeaderLinesWithKey(entireVCF.a,"COMMENT"); | |
final List<VCFHeaderLine> headerLines = getIDHeaderLinesWithKey(entireVCF.a, "COMMENT"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* needs to be done. | ||
*/ | ||
public class VCFTextTransformer { | ||
final static private String ENCODING_SENTNEL_STRING = "%"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
final static private String ENCODING_SENTNEL_STRING = "%"; | |
final static private String ENCODING_SENTINEL_STRING = "%"; |
@@ -44,6 +44,7 @@ | |||
mapping.put(VCFHeaderVersion.VCF4_0,new VCF4Parser()); | |||
mapping.put(VCFHeaderVersion.VCF4_1,new VCF4Parser()); | |||
mapping.put(VCFHeaderVersion.VCF4_2,new VCF4Parser()); | |||
mapping.put(VCFHeaderVersion.VCF4_3,new VCF4Parser()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mapping.put(VCFHeaderVersion.VCF4_3,new VCF4Parser()); | |
mapping.put(VCFHeaderVersion.VCF4_3, new VCF4Parser()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmnbroad Looks sane to me. I have some worries about baking utf-8 characters into our source files, we've seen problems with them before. I think we should try to set the source file encoding properly so we no longer have issues around that before we include them.
@@ -146,6 +167,52 @@ public VCFHeader(final Set<VCFHeaderLine> metaData, final List<String> genotypeS | |||
buildVCFReaderMaps(genotypeSampleNames); | |||
} | |||
|
|||
/** | |||
* Establish the header version for this header. If the header version has already been established | |||
* for this header, the new version will be subject to version transition vaidation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* for this header, the new version will be subject to version transition vaidation. | |
* for this header, the new version will be subject to version transition validation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* Throw if {@code fromVersion} is not compatible with a {@code toVersion}. Generally, any version before | ||
* version 4.2 can be up-converted to version 4.2, but not to version 4.3. Once a header is established as | ||
* version 4.3, it cannot be up or down converted, and it must remain at version 4.3. | ||
* @param fromVersion version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe mention something about the behavior of null fromVersion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if (fromVersion != null) { | ||
if (toVersion.isAtLeastAsRecentAs(VCFHeaderVersion.VCF4_3)) { | ||
if (!fromVersion.isAtLeastAsRecentAs(VCFHeaderVersion.VCF4_3)) { | ||
// we're trying to go from pre-v43 to v43+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd write it v4.3 for slightly more clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
map.put(VCFHeaderVersion.VCF4_3, new VCF4Parser()); | ||
map.put(VCFHeaderVersion.VCF3_3, new VCF3Parser()); | ||
map.put(VCFHeaderVersion.VCF3_2, new VCF3Parser()); | ||
mapping = Collections.unmodifiableMap(map); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
import java.util.ArrayList; | ||
import java.util.List; | ||
|
||
public class VariantContextTestUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding this new class I would just push this down into VariantBaseTest. Unless you think it's generally useful for downstream users.. in which case we should make this a final class with a private constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
private static final long serialVersionUID = 1L; | ||
|
||
public VCFSampleHeaderLine(String line, VCFHeaderVersion version) { | ||
super(VCFConstants.SAMPLE_HEADER_KEY, new VCF4Parser().parseLine(line, null)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VCF4Parser is stateless so maybe there should be a static instance of it instead of creating new ones all over the place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, VCFHeaderLineTranslator already keeps static ones. Done.
@@ -114,10 +115,12 @@ private static String clean(String s) { | |||
*/ | |||
public boolean isAtLeastAsRecentAs(final VCFHeaderVersion target) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I being dumb, or can this method be replaced by a >= sign on the ordinal value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as the enum is kept in order, yes. That seems less error prone too, as I almost missed making this change at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, with tests to keep it from regressing.
* @return a string with all encoded characters replaced with the corresponding character | ||
* @throws TribbleException if the the encoding is uninterpretable | ||
*/ | ||
protected static String decodePercentEncodedChars(final String rawText) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not totally clear to me if general percent encoding is allowed by the vcf 4.3 spec or only percent encoding of specific characters.
using the capitalized percent encoding:
%3A : (colon)
%3B ; (semicolon)
%3D = (equal sign)
%25 % (percent sign)
%2C , (comma)
%0D CR
%0A LF
%09 TAB
* @return the decoded string | ||
* @throws TribbleException if the the encoding is uninterpretable | ||
*/ | ||
public String transformEncodedText(final String rawPart) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be more clearly named decode
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
/* | ||
************************************************************************************************** | ||
* IMPORTANT NOTE: this class contains string constants that contain embedded non-ASCII characters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that this will cause problems similar to broadinstitute/gatk#5934. I think we might want to figure out how to fix the encoding in the build.gradle so it doesn't use system specific encodings during build.
I think we can probably fix it with something like
tasks.withType(JavaCompile) {
options.encoding = "UTF-8"
}
tasks.withType(Javadoc) {
options.addStringOption('encoding', 'UTF-8')
}
added to build.gradle but I haven't reproduced the issue locally to test yet.
Back to you @lbergelson. I relaxed the percent encoding to be more permissive and only transfor encodings if the |
@lbergelson Waiting for a final 👍 from you on this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed I had some comments from a review I didn't post. Here they are, up to you if you want to do anything with them.
@@ -258,10 +260,21 @@ public void add(final VariantContext context) { | |||
|
|||
@Override | |||
public void setHeader(final VCFHeader header) { | |||
rejectVCFV43Headers(header); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why check here as well as in write
? Is it illegal to create a VCFWriter
in this state, even if it's never used to write a VCF?
if (outputHasBeenWritten) { | ||
throw new IllegalStateException("The header cannot be modified after the header or variants have been written to the output stream."); | ||
} | ||
this.mHeader = doNotWriteGenotypes ? new VCFHeader(header.getMetaDataInSortedOrder()) : header; | ||
this.vcfEncoder = new VCFEncoder(this.mHeader, this.allowMissingFieldsInHeader, this.writeFullFormatField); | ||
} | ||
|
||
// writing vcf v4.3 is not implemented | ||
private static void rejectVCFV43Headers(final VCFHeader targetHeader) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the interest of adding future checks, how about calling this ensureValidToWrite()
or something like that? Personally, I use 'ensure' for (non-testing) methods that assert a state and throw an error if the state isn't valid. 'check' is for methods that assert but return a result code instead of throwing. Then the specific comment about checking for 4.3 can be moved below, to the if
.
|
||
public static final String PEDIGREE_HEADER_KEY = "PEDIGREE"; | ||
public static final String PEDIGREE_HEADER_START = VCFHeader.METADATA_INDICATOR + PEDIGREE_HEADER_KEY; | ||
public static final int PEDIGREE_HEADER_OFFSET = PEDIGREE_HEADER_START.length() + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for not continuing the use of magic numbers here!
* @param targetVersion the version for which a transformer is bing requested | ||
* @return a {@link VCFTextTransformer} suitable for the targetVersion | ||
*/ | ||
private static VCFTextTransformer getTextTransformerForVCFVersion(final VCFHeaderVersion targetVersion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is not overridable, what about moving it to VCFTextTransformer
and make it a factory method, e.g. VCFTextTransformer.fromVersion()
.
Another possible place for this is VCFHeaderVersion
, e.g. VCFHeaderVersion.createTextTransformer()
.
Read-only support for VCF v4.3:
VCFSimpleHeaderLine
-derivedVCAltHeaderLine
class (previously these were modeled directly withVCFSimpleHeaderLine
objects). SAMPLE, META and PEDIGREE header lines are also modeled withVCFSimpleHeaderLine
-derived classes (previously these would have been tolerated on read, even for v4.x files, and modeled asVCFHeaderLine
). VCFHeaderLine does not allow attribute queries.VCFHeaderVersion
field to VCF header; once a VCFHeader is established as v4.3, it cannot be transitioned to any other version.