-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VariantsToTable: Include all fields when none specified #7911
Conversation
…ified, integration tests and expected output files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments and a request to look at the tests again. Otherwise this looks good.
@@ -136,6 +134,9 @@ public final class VariantsToTable extends VariantWalker { | |||
doc="The name of a standard VCF field or an INFO field to include in the output table", optional=true) | |||
protected List<String> fieldsToTake = new ArrayList<>(); | |||
|
|||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spurious extra lines
@@ -0,0 +1,117 @@ | |||
##fileformat=VCFv4.2 | |||
##FILTER=<ID=NC,Description="Inconsistent Genotype Submission For At Least One Sample"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename this to be clear exactly how it was modified.
|
||
// add all FORMAT fields present in VCF header | ||
for (final VCFFormatHeaderLine formatLine : inputHeader.getFormatHeaderLines()) { | ||
if(formatLine.getID().equals("GT")){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment as to why this line is here.
|
||
// add all mandatory VCF fields (except INFO) | ||
for(VCFHeader.HEADER_FIELDS header : VCFHeader.HEADER_FIELDS.values()){ | ||
if(header.name() != "INFO") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we stylistically use the if { ... }
format after if statements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is using reference equality on strings, but should instead use .equals()
. It should also use the string from the field itself:
if(header.name() != "INFO") | |
if(!header.name().equals(VCFHeader.HEADER_FIELDS.INFO.name())) |
logger.warn("No fields were specified. All fields will be included in output table."); | ||
|
||
// add all mandatory VCF fields (except INFO) | ||
for(VCFHeader.HEADER_FIELDS header : VCFHeader.HEADER_FIELDS.values()){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change this name to headerLine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm - not sure if headerLine
is better, since these are not header lines, they're fields. Maybe headerField
?
|
||
final String[] args = new String[] {"--variant", inputFile.getAbsolutePath(), | ||
"-O", outputFile.getAbsolutePath()}; | ||
runCommandLine(args); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests don't actually assert anything. Typically (even for "this shouldn't crash" type tests) we would want to assert that the files match the expected outputs. It looks like you are most of the way there though. You already have checked in some output files that look reasonable, now all you need to do is check that the output of THIS run matches those. One way to do that is with IntegrationTestSpec.assertEqualTextFiles().
So have the tool output to a tmp file and then assert that temp file matches the new ones you checked in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a couple of comments for things I noticed when looking at the header line code .
|
||
// add all mandatory VCF fields (except INFO) | ||
for(VCFHeader.HEADER_FIELDS header : VCFHeader.HEADER_FIELDS.values()){ | ||
if(header.name() != "INFO") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is using reference equality on strings, but should instead use .equals()
. It should also use the string from the field itself:
if(header.name() != "INFO") | |
if(!header.name().equals(VCFHeader.HEADER_FIELDS.INFO.name())) |
@@ -30,6 +27,7 @@ | |||
import java.util.function.Function; | |||
|
|||
import static org.broadinstitute.hellbender.utils.Utils.split; | |||
import static org.broadinstitute.hellbender.utils.Utils.stream; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed since it appears to be unused. (Also, for future reference, we generally try to not use static imports, although it looks like this file already has one...).
logger.warn("No fields were specified. All fields will be included in output table."); | ||
|
||
// add all mandatory VCF fields (except INFO) | ||
for(VCFHeader.HEADER_FIELDS header : VCFHeader.HEADER_FIELDS.values()){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm - not sure if headerLine
is better, since these are not header lines, they're fields. Maybe headerField
?
Github actions tests reported job failures from actions build 2570768583
|
Codecov Report
@@ Coverage Diff @@
## master #7911 +/- ##
===============================================
+ Coverage 86.933% 87.072% +0.139%
- Complexity 36950 36956 +6
===============================================
Files 2221 2215 -6
Lines 173833 173728 -105
Branches 18778 18783 +5
===============================================
+ Hits 151118 151268 +150
+ Misses 16083 15831 -252
+ Partials 6632 6629 -3
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orlicohen This looks great! Just a few minor issues to address, then this can go in.
if (genotypeFieldsToTake.isEmpty() && asGenotypeFieldsToTake.isEmpty()) { | ||
samples = Collections.emptySortedSet(); | ||
} else { | ||
// if no fields specified, default to include all fields listed in header into table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the tool documentation at the top of the class, document the fact that if the tool is run without specifying any fields it defaults to including all fields declared in the VCF header. Also include an example command line for that case in the "Usage Example" section.
} else { | ||
// if no fields specified, default to include all fields listed in header into table | ||
if(fieldsToTake.isEmpty() && genotypeFieldsToTake.isEmpty() && asFieldsToTake.isEmpty() && asGenotypeFieldsToTake.isEmpty()){ | ||
logger.warn("No fields were specified. All fields will be included in output table."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"All fields will be included in output table" -> "All fields declared in the VCF header will be included in the output table"
// add all FORMAT fields present in VCF header | ||
for (final VCFFormatHeaderLine formatLine : inputHeader.getFormatHeaderLines()) { | ||
// ensure GT field listed as first FORMAT field | ||
if(formatLine.getID().equals("GT")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"GT" -> VCFConstants.GENOTYPE_KEY
} | ||
|
||
// if fields specified, but none are genotype fields, set samples to empty | ||
if (genotypeFieldsToTake.isEmpty() && asGenotypeFieldsToTake.isEmpty() && (!fieldsToTake.isEmpty() || !asFieldsToTake.isEmpty())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The && (!fieldsToTake.isEmpty() || !asFieldsToTake.isEmpty())
part of this condition here is unnecessary, I believe. If the genotype fields are empty, then either the user originally specified some non-genotype fields, or we hit your new code that populates the non-genotype fields with (at a minimum) the mandatory fields from the VCF.
@Test | ||
public void testNoFieldsSpecified() throws IOException { | ||
final File inputFile = new File(getToolTestDataDir(), "extraheaderlinesdeleted_dbsnp_138.snippet.vcf"); | ||
final File outputFile = createTempFile(getToolTestDataDir(), "noFieldsSpecifiedOutput.table"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first argument to createTempFile()
is supposed to be a prefix to the file name, and the second argument should be the file extension. So, eg., in this case you want something like:
final File outputFile = createTempFile("noFieldsSpecifiedOutput", ".table");
The temp file creation code will take care of appending a unique string to the name so that it doesn't collide with other temp files.
This same comment also applies to the outputFile
in testNoFieldsSpecifiedWithSamples()
below.
|
||
IntegrationTestSpec.assertEqualTextFiles(outputFile, expectedFile); | ||
} | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reviewing the changes to the tool, I think you should add a third integration test with a VCF that contains declarations for FORMAT fields in its header, but no samples. This will exercise the following case:
if (samples.isEmpty()) {
genotypeFieldsToTake.clear();
asGenotypeFieldsToTake.clear();
It's a bit weird, but theoretically possible. You should be able to create this by making a copy of your dbsnp_138.snippet.vcf
file and manually adding in a declaration for a FORMAT field to its header. Note that you can use git add
to explicitly add new files to a commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orlicohen Back to you with some final cleanup tasks, then we'll be ready to merge
@@ -236,4 +237,44 @@ public void testMoltenOutputWithMultipleAlleles() throws IOException { | |||
spec.setTrimWhiteSpace(false); | |||
spec.executeTest("testMoltenOutputWithMultipleAlleles", this); | |||
} | |||
|
|||
@Test | |||
public void testNoFieldsSpecified() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case should probably be renamed testNoFieldsSpecifiedNoSamples()
for clarity (and rename the corresponding expected output file)
public void testNoFieldsSpecifiedFormatFieldInHeaderNoSamples() throws IOException { | ||
final File inputFile = new File(getToolTestDataDir(), "VCFWithoutGenotypesWithFormatField_dbsnp_138.snippet.vcf"); | ||
final File outputFile = createTempFile("noFieldsSpecifiedNoSamplesOutput", ".table"); | ||
final File expectedFile = new File(getToolTestDataDir(), "expected.noFieldsSpecifiedNoSamples.table"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case should produce output identical to the first test case above, so you should use the same expected output file as that test uses to make this clear.
@@ -0,0 +1,48 @@ | |||
##fileformat=VCFv4.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file (1000G.phase3.snippet.vcf
) is now unused. You can remove it from your branch using the git rm
command.
@@ -0,0 +1,117 @@ | |||
##fileformat=VCFv4.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file (extraheaderlinesdeleted_dbsnp_138.snippet.vcf
) is also unused now -- you can delete it with git rm
if (genotypeFieldsToTake.isEmpty() && asGenotypeFieldsToTake.isEmpty()) { | ||
samples = Collections.emptySortedSet(); | ||
} else { | ||
samples = Collections.emptySortedSet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is now indented one level too deep.
VariantsToTable now outputs all fields declared in the VCF header when no fields are selected. Added integration tests to cover this new functionality Fixes #7677
Fixes #7677. Defaults to include all fields when no field arguments specified.