Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support VCF v4.3 (read only). #1359

Merged
merged 5 commits into from
May 31, 2019
Merged

Conversation

cmnbroad
Copy link
Collaborator

@cmnbroad cmnbroad commented Apr 30, 2019

Read-only support for VCF v4.3:

  • ALT header lines are now modeled with a new VCFSimpleHeaderLine-derived VCAltHeaderLine class (previously these were modeled directly with VCFSimpleHeaderLine objects). SAMPLE, META and PEDIGREE header lines are also modeled with VCFSimpleHeaderLine-derived classes (previously these would have been tolerated on read, even for v4.x files, and modeled as VCFHeaderLine). VCFHeaderLine does not allow attribute queries.
  • Supports reading files with UTF8 character encoding (v4.3).
  • Supports percent-encoded values (v43).
  • Adds a VCFHeaderVersion field to VCF header; once a VCFHeader is established as v4.3, it cannot be transitioned to any other version.
  • Merging a v4.3 header with other headers is only allowed if all headers being merged are v4.3.
  • VCFWriter throws on any attempt to write out a version v43 header.
  • Except for the class change of ALT header lines, these changes should be invisible to existing callers.

@cmnbroad cmnbroad force-pushed the cn_vcf43_scratch branch 2 times, most recently from 5a4e48b to a9dae56 Compare April 30, 2019 23:03
@codecov-io
Copy link

codecov-io commented Apr 30, 2019

Codecov Report

Merging #1359 into master will increase coverage by 0.096%.
The diff coverage is 94.161%.

@@               Coverage Diff               @@
##              master     #1359       +/-   ##
===============================================
+ Coverage     67.863%   67.959%   +0.096%     
- Complexity      8289      8326       +37     
===============================================
  Files            563       570        +7     
  Lines          33718     33797       +79     
  Branches        5659      5660        +1     
===============================================
+ Hits           22882     22968       +86     
+ Misses          8656      8649        -7     
  Partials        2180      2180
Impacted Files Coverage Δ Complexity Δ
src/main/java/htsjdk/variant/vcf/VCFEncoder.java 86.935% <ø> (ø) 72 <0> (ø) ⬇️
src/main/java/htsjdk/variant/vcf/VCFCodec.java 77.273% <0%> (ø) 17 <0> (+1) ⬆️
...n/java/htsjdk/variant/vcf/VCFSampleHeaderLine.java 100% <100%> (ø) 1 <1> (?)
src/main/java/htsjdk/variant/vcf/VCFHeader.java 90.058% <100%> (+1.243%) 76 <8> (+10) ⬆️
...main/java/htsjdk/variant/vcf/VCFHeaderVersion.java 71.875% <100%> (+7.986%) 13 <2> (-5) ⬇️
...va/htsjdk/variant/vcf/VCFHeaderLineTranslator.java 93.976% <100%> (+1.383%) 3 <1> (ø) ⬇️
src/main/java/htsjdk/variant/vcf/VCFConstants.java 87.5% <100%> (+87.5%) 1 <0> (+1) ⬆️
src/main/java/htsjdk/variant/vcf/VCFUtils.java 44.444% <100%> (+6.391%) 17 <6> (+6) ⬆️
...ava/htsjdk/variant/vcf/VCFStandardHeaderLines.java 77.632% <100%> (+1.243%) 13 <6> (+2) ⬆️
...main/java/htsjdk/variant/vcf/VCFAltHeaderLine.java 100% <100%> (ø) 2 <2> (?)
... and 30 more

@cmnbroad cmnbroad force-pushed the cn_vcf43_scratch branch 4 times, most recently from 4ed4850 to d28da51 Compare May 6, 2019 14:36
public void testVCF43ReadUTF8Attributes() {
final Tuple<VCFHeader, List<VariantContext>> entireVCF = VariantContextTestUtils.readEntireVCFIntoMemory(
Paths.get(TEST_PATH + TEST_43_UTF8_FILE));
final List<VCFHeaderLine> headerLines = getIDHeaderLinesWithKey(entireVCF.a,"COMMENT");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace

Suggested change
final List<VCFHeaderLine> headerLines = getIDHeaderLinesWithKey(entireVCF.a,"COMMENT");
final List<VCFHeaderLine> headerLines = getIDHeaderLinesWithKey(entireVCF.a, "COMMENT");

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/main/java/htsjdk/variant/vcf/VCFTextTransformer.java Outdated Show resolved Hide resolved
src/main/java/htsjdk/variant/vcf/VCFTextTransformer.java Outdated Show resolved Hide resolved
* needs to be done.
*/
public class VCFTextTransformer {
final static private String ENCODING_SENTNEL_STRING = "%";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Suggested change
final static private String ENCODING_SENTNEL_STRING = "%";
final static private String ENCODING_SENTINEL_STRING = "%";

src/main/java/htsjdk/variant/vcf/VCFHeaderVersion.java Outdated Show resolved Hide resolved
@@ -44,6 +44,7 @@
mapping.put(VCFHeaderVersion.VCF4_0,new VCF4Parser());
mapping.put(VCFHeaderVersion.VCF4_1,new VCF4Parser());
mapping.put(VCFHeaderVersion.VCF4_2,new VCF4Parser());
mapping.put(VCFHeaderVersion.VCF4_3,new VCF4Parser());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mapping.put(VCFHeaderVersion.VCF4_3,new VCF4Parser());
mapping.put(VCFHeaderVersion.VCF4_3, new VCF4Parser());

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad Looks sane to me. I have some worries about baking utf-8 characters into our source files, we've seen problems with them before. I think we should try to set the source file encoding properly so we no longer have issues around that before we include them.

@@ -146,6 +167,52 @@ public VCFHeader(final Set<VCFHeaderLine> metaData, final List<String> genotypeS
buildVCFReaderMaps(genotypeSampleNames);
}

/**
* Establish the header version for this header. If the header version has already been established
* for this header, the new version will be subject to version transition vaidation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* for this header, the new version will be subject to version transition vaidation.
* for this header, the new version will be subject to version transition validation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* Throw if {@code fromVersion} is not compatible with a {@code toVersion}. Generally, any version before
* version 4.2 can be up-converted to version 4.2, but not to version 4.3. Once a header is established as
* version 4.3, it cannot be up or down converted, and it must remain at version 4.3.
* @param fromVersion version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe mention something about the behavior of null fromVersion here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (fromVersion != null) {
if (toVersion.isAtLeastAsRecentAs(VCFHeaderVersion.VCF4_3)) {
if (!fromVersion.isAtLeastAsRecentAs(VCFHeaderVersion.VCF4_3)) {
// we're trying to go from pre-v43 to v43+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd write it v4.3 for slightly more clarity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

map.put(VCFHeaderVersion.VCF4_3, new VCF4Parser());
map.put(VCFHeaderVersion.VCF3_3, new VCF3Parser());
map.put(VCFHeaderVersion.VCF3_2, new VCF3Parser());
mapping = Collections.unmodifiableMap(map);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

import java.util.ArrayList;
import java.util.List;

public class VariantContextTestUtils {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding this new class I would just push this down into VariantBaseTest. Unless you think it's generally useful for downstream users.. in which case we should make this a final class with a private constructor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

private static final long serialVersionUID = 1L;

public VCFSampleHeaderLine(String line, VCFHeaderVersion version) {
super(VCFConstants.SAMPLE_HEADER_KEY, new VCF4Parser().parseLine(line, null));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VCF4Parser is stateless so maybe there should be a static instance of it instead of creating new ones all over the place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, VCFHeaderLineTranslator already keeps static ones. Done.

@@ -114,10 +115,12 @@ private static String clean(String s) {
*/
public boolean isAtLeastAsRecentAs(final VCFHeaderVersion target) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I being dumb, or can this method be replaced by a >= sign on the ordinal value?

Copy link
Collaborator Author

@cmnbroad cmnbroad May 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the enum is kept in order, yes. That seems less error prone too, as I almost missed making this change at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, with tests to keep it from regressing.

* @return a string with all encoded characters replaced with the corresponding character
* @throws TribbleException if the the encoding is uninterpretable
*/
protected static String decodePercentEncodedChars(final String rawText) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not totally clear to me if general percent encoding is allowed by the vcf 4.3 spec or only percent encoding of specific characters.

using the capitalized percent encoding:
%3A : (colon)
%3B ; (semicolon)
%3D = (equal sign)
%25 % (percent sign)
%2C , (comma)
%0D CR
%0A LF
%09 TAB

* @return the decoded string
* @throws TribbleException if the the encoding is uninterpretable
*/
public String transformEncodedText(final String rawPart) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be more clearly named decode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


/*
**************************************************************************************************
* IMPORTANT NOTE: this class contains string constants that contain embedded non-ASCII characters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that this will cause problems similar to broadinstitute/gatk#5934. I think we might want to figure out how to fix the encoding in the build.gradle so it doesn't use system specific encodings during build.

I think we can probably fix it with something like

 tasks.withType(JavaCompile) {
   options.encoding = "UTF-8"
 }

 tasks.withType(Javadoc) {
    options.addStringOption('encoding', 'UTF-8')
 }

added to build.gradle but I haven't reproduced the issue locally to test yet.

@cmnbroad
Copy link
Collaborator Author

cmnbroad commented May 23, 2019

Back to you @lbergelson. I relaxed the percent encoding to be more permissive and only transfor encodings if the % is followed by two legitimate hex-digit values, otherwise the raw characters are passed through untranslated.

@droazen
Copy link
Contributor

droazen commented May 30, 2019

@lbergelson Waiting for a final 👍 from you on this one.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@lbergelson lbergelson merged commit d5ac863 into samtools:master May 31, 2019
Copy link
Contributor

@pshapiro4broad pshapiro4broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed I had some comments from a review I didn't post. Here they are, up to you if you want to do anything with them.

@@ -258,10 +260,21 @@ public void add(final VariantContext context) {

@Override
public void setHeader(final VCFHeader header) {
rejectVCFV43Headers(header);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why check here as well as in write? Is it illegal to create a VCFWriter in this state, even if it's never used to write a VCF?

if (outputHasBeenWritten) {
throw new IllegalStateException("The header cannot be modified after the header or variants have been written to the output stream.");
}
this.mHeader = doNotWriteGenotypes ? new VCFHeader(header.getMetaDataInSortedOrder()) : header;
this.vcfEncoder = new VCFEncoder(this.mHeader, this.allowMissingFieldsInHeader, this.writeFullFormatField);
}

// writing vcf v4.3 is not implemented
private static void rejectVCFV43Headers(final VCFHeader targetHeader) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of adding future checks, how about calling this ensureValidToWrite() or something like that? Personally, I use 'ensure' for (non-testing) methods that assert a state and throw an error if the state isn't valid. 'check' is for methods that assert but return a result code instead of throwing. Then the specific comment about checking for 4.3 can be moved below, to the if.


public static final String PEDIGREE_HEADER_KEY = "PEDIGREE";
public static final String PEDIGREE_HEADER_START = VCFHeader.METADATA_INDICATOR + PEDIGREE_HEADER_KEY;
public static final int PEDIGREE_HEADER_OFFSET = PEDIGREE_HEADER_START.length() + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for not continuing the use of magic numbers here!

* @param targetVersion the version for which a transformer is bing requested
* @return a {@link VCFTextTransformer} suitable for the targetVersion
*/
private static VCFTextTransformer getTextTransformerForVCFVersion(final VCFHeaderVersion targetVersion) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is not overridable, what about moving it to VCFTextTransformer and make it a factory method, e.g. VCFTextTransformer.fromVersion().

Another possible place for this is VCFHeaderVersion, e.g. VCFHeaderVersion.createTextTransformer().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants