Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VariantsToTable: Include all fields when none specified #7911

Merged
merged 7 commits into from
Jun 30, 2022

Conversation

orlicohen
Copy link
Contributor

Fixes #7677. Defaults to include all fields when no field arguments specified.

@orlicohen orlicohen requested a review from jamesemery June 22, 2022 15:23
@orlicohen orlicohen changed the title Include all fields when none specified VariantsToTable: Include all fields when none specified Jun 22, 2022
Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments and a request to look at the tests again. Otherwise this looks good.

@@ -136,6 +134,9 @@ public final class VariantsToTable extends VariantWalker {
doc="The name of a standard VCF field or an INFO field to include in the output table", optional=true)
protected List<String> fieldsToTake = new ArrayList<>();



Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spurious extra lines

@@ -0,0 +1,117 @@
##fileformat=VCFv4.2
##FILTER=<ID=NC,Description="Inconsistent Genotype Submission For At Least One Sample">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename this to be clear exactly how it was modified.


// add all FORMAT fields present in VCF header
for (final VCFFormatHeaderLine formatLine : inputHeader.getFormatHeaderLines()) {
if(formatLine.getID().equals("GT")){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment as to why this line is here.


// add all mandatory VCF fields (except INFO)
for(VCFHeader.HEADER_FIELDS header : VCFHeader.HEADER_FIELDS.values()){
if(header.name() != "INFO")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we stylistically use the if { ... }
format after if statements

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is using reference equality on strings, but should instead use .equals(). It should also use the string from the field itself:

Suggested change
if(header.name() != "INFO")
if(!header.name().equals(VCFHeader.HEADER_FIELDS.INFO.name()))

logger.warn("No fields were specified. All fields will be included in output table.");

// add all mandatory VCF fields (except INFO)
for(VCFHeader.HEADER_FIELDS header : VCFHeader.HEADER_FIELDS.values()){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this name to headerLine

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm - not sure if headerLine is better, since these are not header lines, they're fields. Maybe headerField ?


final String[] args = new String[] {"--variant", inputFile.getAbsolutePath(),
"-O", outputFile.getAbsolutePath()};
runCommandLine(args);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests don't actually assert anything. Typically (even for "this shouldn't crash" type tests) we would want to assert that the files match the expected outputs. It looks like you are most of the way there though. You already have checked in some output files that look reasonable, now all you need to do is check that the output of THIS run matches those. One way to do that is with IntegrationTestSpec.assertEqualTextFiles(). So have the tool output to a tmp file and then assert that temp file matches the new ones you checked in.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a couple of comments for things I noticed when looking at the header line code .


// add all mandatory VCF fields (except INFO)
for(VCFHeader.HEADER_FIELDS header : VCFHeader.HEADER_FIELDS.values()){
if(header.name() != "INFO")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is using reference equality on strings, but should instead use .equals(). It should also use the string from the field itself:

Suggested change
if(header.name() != "INFO")
if(!header.name().equals(VCFHeader.HEADER_FIELDS.INFO.name()))

@@ -30,6 +27,7 @@
import java.util.function.Function;

import static org.broadinstitute.hellbender.utils.Utils.split;
import static org.broadinstitute.hellbender.utils.Utils.stream;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed since it appears to be unused. (Also, for future reference, we generally try to not use static imports, although it looks like this file already has one...).

logger.warn("No fields were specified. All fields will be included in output table.");

// add all mandatory VCF fields (except INFO)
for(VCFHeader.HEADER_FIELDS header : VCFHeader.HEADER_FIELDS.values()){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm - not sure if headerLine is better, since these are not header lines, they're fields. Maybe headerField ?

@gatk-bot
Copy link

gatk-bot commented Jun 27, 2022

Github actions tests reported job failures from actions build 2570768583
Failures in the following jobs:

Test Type JDK Job ID Logs
integration 11 2570768583.12 logs
integration 8 2570768583.0 logs

@codecov
Copy link

codecov bot commented Jun 27, 2022

Codecov Report

Merging #7911 (cb09924) into master (d22b752) will increase coverage by 0.139%.
The diff coverage is 97.297%.

@@               Coverage Diff               @@
##              master     #7911       +/-   ##
===============================================
+ Coverage     86.933%   87.072%   +0.139%     
- Complexity     36950     36956        +6     
===============================================
  Files           2221      2215        -6     
  Lines         173833    173728      -105     
  Branches       18778     18783        +5     
===============================================
+ Hits          151118    151268      +150     
+ Misses         16083     15831      -252     
+ Partials        6632      6629        -3     
Impacted Files Coverage Δ
...er/tools/walkers/variantutils/VariantsToTable.java 93.333% <92.308%> (-0.063%) ⬇️
...s/variantutils/VariantsToTableIntegrationTest.java 100.000% <100.000%> (ø)
...on/MultisampleMultidimensionalKernelSegmenter.java 89.375% <0.000%> (-4.375%) ⬇️
...stitute/hellbender/tools/walkers/vqsr/Tranche.java 62.222% <0.000%> (-3.210%) ⬇️
...lkers/vqsr/VariantRecalibratorIntegrationTest.java 98.315% <0.000%> (-0.481%) ⬇️
...er/tools/walkers/sv/CollectSVEvidenceUnitTest.java 97.521% <0.000%> (-0.010%) ⬇️
...itute/hellbender/tools/LocalAssemblerUnitTest.java 92.448% <0.000%> (ø)
...itute/hellbender/tools/sv/BafEvidenceUnitTest.java 100.000% <0.000%> (ø)
...ls/walkers/vqsr/GatherTranchesIntegrationTest.java 100.000% <0.000%> (ø)
...s/copynumber/models/AlleleFractionLikelihoods.java 100.000% <0.000%> (ø)
... and 31 more

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orlicohen This looks great! Just a few minor issues to address, then this can go in.

if (genotypeFieldsToTake.isEmpty() && asGenotypeFieldsToTake.isEmpty()) {
samples = Collections.emptySortedSet();
} else {
// if no fields specified, default to include all fields listed in header into table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the tool documentation at the top of the class, document the fact that if the tool is run without specifying any fields it defaults to including all fields declared in the VCF header. Also include an example command line for that case in the "Usage Example" section.

} else {
// if no fields specified, default to include all fields listed in header into table
if(fieldsToTake.isEmpty() && genotypeFieldsToTake.isEmpty() && asFieldsToTake.isEmpty() && asGenotypeFieldsToTake.isEmpty()){
logger.warn("No fields were specified. All fields will be included in output table.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"All fields will be included in output table" -> "All fields declared in the VCF header will be included in the output table"

// add all FORMAT fields present in VCF header
for (final VCFFormatHeaderLine formatLine : inputHeader.getFormatHeaderLines()) {
// ensure GT field listed as first FORMAT field
if(formatLine.getID().equals("GT")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"GT" -> VCFConstants.GENOTYPE_KEY

}

// if fields specified, but none are genotype fields, set samples to empty
if (genotypeFieldsToTake.isEmpty() && asGenotypeFieldsToTake.isEmpty() && (!fieldsToTake.isEmpty() || !asFieldsToTake.isEmpty())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The && (!fieldsToTake.isEmpty() || !asFieldsToTake.isEmpty()) part of this condition here is unnecessary, I believe. If the genotype fields are empty, then either the user originally specified some non-genotype fields, or we hit your new code that populates the non-genotype fields with (at a minimum) the mandatory fields from the VCF.

@Test
public void testNoFieldsSpecified() throws IOException {
final File inputFile = new File(getToolTestDataDir(), "extraheaderlinesdeleted_dbsnp_138.snippet.vcf");
final File outputFile = createTempFile(getToolTestDataDir(), "noFieldsSpecifiedOutput.table");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first argument to createTempFile() is supposed to be a prefix to the file name, and the second argument should be the file extension. So, eg., in this case you want something like:

final File outputFile = createTempFile("noFieldsSpecifiedOutput", ".table");

The temp file creation code will take care of appending a unique string to the name so that it doesn't collide with other temp files.

This same comment also applies to the outputFile in testNoFieldsSpecifiedWithSamples() below.


IntegrationTestSpec.assertEqualTextFiles(outputFile, expectedFile);
}

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing the changes to the tool, I think you should add a third integration test with a VCF that contains declarations for FORMAT fields in its header, but no samples. This will exercise the following case:

            if (samples.isEmpty()) {
                genotypeFieldsToTake.clear();
                asGenotypeFieldsToTake.clear();

It's a bit weird, but theoretically possible. You should be able to create this by making a copy of your dbsnp_138.snippet.vcf file and manually adding in a declaration for a FORMAT field to its header. Note that you can use git add to explicitly add new files to a commit.

@droazen droazen assigned orlicohen and unassigned droazen and jamesemery Jun 28, 2022
Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orlicohen Back to you with some final cleanup tasks, then we'll be ready to merge

@@ -236,4 +237,44 @@ public void testMoltenOutputWithMultipleAlleles() throws IOException {
spec.setTrimWhiteSpace(false);
spec.executeTest("testMoltenOutputWithMultipleAlleles", this);
}

@Test
public void testNoFieldsSpecified() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case should probably be renamed testNoFieldsSpecifiedNoSamples() for clarity (and rename the corresponding expected output file)

public void testNoFieldsSpecifiedFormatFieldInHeaderNoSamples() throws IOException {
final File inputFile = new File(getToolTestDataDir(), "VCFWithoutGenotypesWithFormatField_dbsnp_138.snippet.vcf");
final File outputFile = createTempFile("noFieldsSpecifiedNoSamplesOutput", ".table");
final File expectedFile = new File(getToolTestDataDir(), "expected.noFieldsSpecifiedNoSamples.table");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case should produce output identical to the first test case above, so you should use the same expected output file as that test uses to make this clear.

@@ -0,0 +1,48 @@
##fileformat=VCFv4.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file (1000G.phase3.snippet.vcf) is now unused. You can remove it from your branch using the git rm command.

@@ -0,0 +1,117 @@
##fileformat=VCFv4.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file (extraheaderlinesdeleted_dbsnp_138.snippet.vcf) is also unused now -- you can delete it with git rm

if (genotypeFieldsToTake.isEmpty() && asGenotypeFieldsToTake.isEmpty()) {
samples = Collections.emptySortedSet();
} else {
samples = Collections.emptySortedSet();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is now indented one level too deep.

@droazen droazen merged commit 6596ea8 into master Jun 30, 2022
@droazen droazen deleted the oc_fix7677_variantstotabledefaultallfields branch June 30, 2022 19:52
orlicohen added a commit that referenced this pull request Jul 18, 2022
VariantsToTable now outputs all fields declared in the VCF header when no fields are selected. 

Added integration tests to cover this new functionality

Fixes #7677
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simple way to have VariantsToTable output all fields. Just want the entire VCF as a table.
5 participants