-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GenomicsDBImport: add the ability to specify explicit index locations via the sample name map file #7967
Conversation
@rickymagner / @meganshand, here you go! @lbergelson, @mlathara, and one of @rickymagner / @meganshand, please review |
… the sample name map file The sample name map file accepted by GenomicsDBImport can now optionally contain a third column giving an explicit path to an index for the corresponding VCF. It is allowed to specify an explicit index in some lines of the sample name map and not others. Added comprehensive unit and integration tests.
b57a45b
to
745273b
Compare
Codecov Report
@@ Coverage Diff @@
## master #7967 +/- ##
================================================
+ Coverage 52.260% 86.691% +34.431%
- Complexity 29146 38496 +9350
================================================
Files 2310 2311 +1
Lines 180344 180590 +246
Branches 19840 19863 +23
================================================
+ Hits 94247 156555 +62308
+ Misses 80124 17090 -63034
- Partials 5973 6945 +972
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droazen I have a couple of comments that you may or may not want to address. I think it looks solid though. If something in the tests cases was mixed up I very well may have mixed it up too and not noticed it. I didn't see any obvious things though.
final Path firstHeaderPath = IOUtils.getPath(sampleNameToVcfPath.entrySet().iterator().next().getValue().toString()); | ||
final VCFHeader header = getHeaderFromPath(firstHeaderPath); | ||
// The SampleNameMap class guarantees that the samples will be sorted correctly. | ||
sampleNameMap = new SampleNameMap(IOUtils.getPath(sampleNameMapFile), bypassFeatureReader); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make the input a GATKPath at some point probably but it doesn't have to happen here.
// that --bypass-feature-reader wasn't also specified: | ||
if ( sampleNameMap != null && sampleNameMap.indicesSpecified() && bypassFeatureReader ) { | ||
throw new UserException("Indices were specified for some VCFs in the sample name map file, but --" + BYPASS_FEATURE_READER + | ||
" was also specified. Specifying explicit indices is not supported when running with --" + BYPASS_FEATURE_READER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should talk with the GenomicsDB team about that. That's probably something they could support without much trouble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, explicit indices are not supported by GenomicsDB. @mlathara, anything else to add?
} | ||
|
||
private VCFHeader getHeaderFromPath(final Path variantPath) { | ||
try(final FeatureReader<VariantContext> reader = getReaderFromPath(variantPath)) { | ||
private VCFHeader getHeaderFromPath(final Path variantPath, final Path variantIndexPath) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be unnecessary because of course you don't need an index to find a header. Probably need some new method in htsjdk or somewhere that just rips out the header without doing the rest.
final int updatedBatchSize = (batchSize == DEFAULT_ZERO_BATCH_SIZE) ? sampleCount : batchSize; | ||
final ImportConfig importConfig = createImportConfig(updatedBatchSize); | ||
|
||
GenomicsDBImporter importer; | ||
try { | ||
importer = new GenomicsDBImporter(importConfig); | ||
// Modify importer directly from updateImportProtobufVidMapping. | ||
org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBUtils.updateImportProtobufVidMapping(importer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good move to rename this so there's no conflict...
@@ -993,10 +924,13 @@ private SortedMap<String, FeatureReader<VariantContext>> getFeatureReadersSerial | |||
* @return Feature reader | |||
* @param variantPath | |||
*/ | |||
private FeatureReader<VariantContext> getReaderFromPath(final Path variantPath) { | |||
private FeatureReader<VariantContext> getReaderFromPath(final Path variantPath, final Path variantIndexPath) { | |||
// TODO: we repeatedly convert between URI, Path, and String in this tool. Is this necessary? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's gross. Switching to the beta api's would probably fix it but I don't think there's an easy fix using the current tribble interfaces.
|
||
@DataProvider | ||
public Object[][] dataForTestExplicitIndicesInSampleNameMapInTheCloud() { | ||
final String GVCFS_WITH_INDICES_BUCKET = "gs://hellbender/test/resources/org/broadinstitute/hellbender/tools/genomicsdb/gvcfs_with_indices/"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These paths ideally would all be built using BaseTest.getGCPTestInputPath
which in theory makes it possible to reproduce our test data on your own system if you want to, although good luck to anyone who tries...
{"Sample1"}, // 1 column no delimiter | ||
{"\tfile"}, // empty first token | ||
{" \tfile"}, // first token only whitespace | ||
{"Sample1\tfile1\t"}, // extra tab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to add Sample1\t\index1
? and Sample1\t \index1
} | ||
|
||
for (final String line : lines) { | ||
final String[] split = line.split("\\t",-1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to look up how -1 is different than nothing every single time.
} | ||
|
||
@Test(expectedExceptions = UserException.class) | ||
public void testCheckVcfIsCompressedAndIndexed() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a positive test where this is specified and doesn't' fail?
} | ||
|
||
@DataProvider | ||
public Object[][] badInputsToAddSample() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be matching rejection tests for the 3 arg addSample ?
Didn't look at the |
Hi @droazen, I tested the changes out locally using VCFs with indices in different cloud buckets, and it works. This is perfect for the applications we have in mind. Thanks! |
The sample name map file accepted by GenomicsDBImport can now optionally contain a third
column giving an explicit path to an index for the corresponding GVCF. It is allowed to
specify an explicit index in some lines of the sample name map and not others.
Added comprehensive unit and integration tests.