Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix VCFFuncotation output order bug #6178

Merged
merged 5 commits into from
Oct 4, 2019
Merged

Conversation

lbergelson
Copy link
Member

Fixes #6173

I added a single line test that shows the problem, and regenerated the other test files. I didn't write a specific unit test that proves it's solved. Feel free to do so if you think it's necessary.
I validated the output by adding the field name to the output value and checking it by eye against the header lines. This could fairly simply made into a unit test if desired.

I'm not sure if there are other large files that need to be regenerated.

I had initially said this bug only affects vcf but I think it happens to Maf output as well.

I removed a weakly typed method that allowed this bug to occur.

@lbergelson
Copy link
Member Author

@jonn-smith Handing this one off to you for now. I wasn't sure what to do about test exactly.

@jonn-smith
Copy link
Collaborator

jonn-smith commented Sep 24, 2019

@lbergelson The trouble with the large tests is that they can be automatically generated and have lots of output so it's hard to check the details.

Do you have an example of where this was causing an error before? If so we can build a unit or integration test out of that variant to ensure this doesn't regress.

If you can provide that kind of a locus I could go in and add in a test for it.

@lbergelson
Copy link
Member Author

I added a new test file with 1 line that exhibits the problem. If you run it with the old code the dbSnp annotations are shuffled. It's easy to see if you edit the output renderer to include the field name in the field value.

@lbergelson
Copy link
Member Author

I think a good test would be to add a new test vcf datasource with a number of fields, and the values of each field are just the field header. It's easy to assert that and obvious when there's an error.

@tomwhite
Copy link
Contributor

This PR also fixes the ordering issue when run under Java 11, see #6119

@@ -333,10 +335,13 @@ private static Funcotation mergeDuplicateFuncotationFactoryVariant(final Funcota
final LinkedHashSet<String> allFieldNames = funcotation1.getFieldNames();
allFieldNames.addAll(funcotation2.getFieldNames());

final Map<String, Object> mergedFieldsMap = allFieldNames.stream()
.collect(Collectors.toMap(f -> f, f -> mergeFuncotationValue(f, funcotation1, funcotation2, VcfFuncotationFactory::renderFieldConflicts)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use the Collectors.toMap() overloaded method that takes a Supplier and pass in LinkedHashMap::new. The default uses HashMap::new.

In this case, using streaming doesn't add much, but for longer chains of stream operations this would be a good change. Might be worth having a utility function in GATK to make it easy.

See https://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html#toMap-java.util.function.Function-java.util.function.Function-java.util.function.BinaryOperator-java.util.function.Supplier-

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you could do that. You end up with a funny unused merge function if you do because this is assuming that you've pre-uniqued the inputs so I figured I'd just switch it to a for loop.

Fixes #6173

I added a single line test that shows the problem, and regenerated the other test files.  I didn't write a specific unit test that proves it's solved.  Feel free to do so if you think it's necessary.
I validated the output by adding the field name to the output value and checking it by eye against the header lines.  This could fairly simply made into a unit test if desired.

I'm not sure if there are other large files that need to be regenerated.

I had initially said this bug only affects vcf but I think it happens to Maf output as well.
@lbergelson lbergelson marked this pull request as ready for review October 3, 2019 21:17
@lbergelson
Copy link
Member Author

@jonn-smith This should be good to go now. I added a non-regenerated integration test for the specific problem.

@@ -696,6 +706,10 @@ public void exhaustiveArgumentTest(final String dataSourcesPath,
arguments.addArgument(FuncotatorArgumentDefinitions.ANNOTATION_OVERRIDES_LONG_NAME, "Oreganno_Build:BUILDED_GOOD_REAL_BIG");

// Run the beast:
final File tmp = new File(System.getProperty("java.io.tmpdir"));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

um.. this wasn't meant to be here.. let me fix that

Copy link
Collaborator

@jonn-smith jonn-smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for adding in that test.

When the integration tests pass, feel free to merge

@lbergelson lbergelson merged commit 0b36dd3 into master Oct 4, 2019
@lbergelson lbergelson deleted the lb_funcotator_vcffix branch October 4, 2019 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Funcotator outputs fields in wrong order in some cases when writing vcf
3 participants