Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GenomicsDBImport: add the ability to specify explicit index locations via the sample name map file #7967

Merged
merged 1 commit into from
Oct 11, 2022

Conversation

droazen
Copy link
Contributor

@droazen droazen commented Jul 29, 2022

The sample name map file accepted by GenomicsDBImport can now optionally contain a third
column giving an explicit path to an index for the corresponding GVCF. It is allowed to
specify an explicit index in some lines of the sample name map and not others.

Added comprehensive unit and integration tests.

@droazen
Copy link
Contributor Author

droazen commented Jul 29, 2022

@rickymagner / @meganshand, here you go!

@lbergelson, @mlathara, and one of @rickymagner / @meganshand, please review

@droazen droazen self-assigned this Jul 29, 2022
… the sample name map file

The sample name map file accepted by GenomicsDBImport can now optionally contain a third
column giving an explicit path to an index for the corresponding VCF. It is allowed to
specify an explicit index in some lines of the sample name map and not others.

Added comprehensive unit and integration tests.
@droazen droazen force-pushed the dr_genomicsdbimport_explicit_indices branch from b57a45b to 745273b Compare July 29, 2022 20:49
@codecov
Copy link

codecov bot commented Jul 29, 2022

Codecov Report

Merging #7967 (745273b) into master (c22972a) will increase coverage by 34.431%.
The diff coverage is 80.422%.

@@               Coverage Diff                @@
##              master     #7967        +/-   ##
================================================
+ Coverage     52.260%   86.691%   +34.431%     
- Complexity     29146     38496      +9350     
================================================
  Files           2310      2311         +1     
  Lines         180344    180590       +246     
  Branches       19840     19863        +23     
================================================
+ Hits           94247    156555     +62308     
+ Misses         80124     17090     -63034     
- Partials        5973      6945       +972     
Impacted Files Coverage Δ
...institute/hellbender/engine/FeatureDataSource.java 78.344% <ø> (ø)
...ls/genomicsdb/GenomicsDBImportIntegrationTest.java 84.746% <60.000%> (-3.762%) ⬇️
...llbender/tools/genomicsdb/GATKGenomicsDBUtils.java 84.685% <72.727%> (ø)
...ute/hellbender/tools/genomicsdb/SampleNameMap.java 85.915% <85.915%> (ø)
.../hellbender/tools/genomicsdb/GenomicsDBImport.java 83.636% <92.000%> (+0.586%) ⬆️
...bender/tools/genomicsdb/SampleNameMapUnitTest.java 92.000% <92.000%> (ø)
...roadinstitute/hellbender/tools/LocalAssembler.java 67.425% <0.000%> (+0.073%) ⬆️
...roadinstitute/hellbender/utils/read/ReadUtils.java 82.278% <0.000%> (+0.316%) ⬆️
...stitute/hellbender/cmdline/CommandLineProgram.java 84.516% <0.000%> (+0.645%) ⬆️
...lyBasedSVDiscoveryTestDataProviderForSimpleSV.java 100.000% <0.000%> (+0.894%) ⬆️
... and 590 more

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droazen I have a couple of comments that you may or may not want to address. I think it looks solid though. If something in the tests cases was mixed up I very well may have mixed it up too and not noticed it. I didn't see any obvious things though.

final Path firstHeaderPath = IOUtils.getPath(sampleNameToVcfPath.entrySet().iterator().next().getValue().toString());
final VCFHeader header = getHeaderFromPath(firstHeaderPath);
// The SampleNameMap class guarantees that the samples will be sorted correctly.
sampleNameMap = new SampleNameMap(IOUtils.getPath(sampleNameMapFile), bypassFeatureReader);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make the input a GATKPath at some point probably but it doesn't have to happen here.

// that --bypass-feature-reader wasn't also specified:
if ( sampleNameMap != null && sampleNameMap.indicesSpecified() && bypassFeatureReader ) {
throw new UserException("Indices were specified for some VCFs in the sample name map file, but --" + BYPASS_FEATURE_READER +
" was also specified. Specifying explicit indices is not supported when running with --" + BYPASS_FEATURE_READER);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should talk with the GenomicsDB team about that. That's probably something they could support without much trouble.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, explicit indices are not supported by GenomicsDB. @mlathara, anything else to add?

}

private VCFHeader getHeaderFromPath(final Path variantPath) {
try(final FeatureReader<VariantContext> reader = getReaderFromPath(variantPath)) {
private VCFHeader getHeaderFromPath(final Path variantPath, final Path variantIndexPath) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be unnecessary because of course you don't need an index to find a header. Probably need some new method in htsjdk or somewhere that just rips out the header without doing the rest.

final int updatedBatchSize = (batchSize == DEFAULT_ZERO_BATCH_SIZE) ? sampleCount : batchSize;
final ImportConfig importConfig = createImportConfig(updatedBatchSize);

GenomicsDBImporter importer;
try {
importer = new GenomicsDBImporter(importConfig);
// Modify importer directly from updateImportProtobufVidMapping.
org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBUtils.updateImportProtobufVidMapping(importer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good move to rename this so there's no conflict...

@@ -993,10 +924,13 @@ private SortedMap<String, FeatureReader<VariantContext>> getFeatureReadersSerial
* @return Feature reader
* @param variantPath
*/
private FeatureReader<VariantContext> getReaderFromPath(final Path variantPath) {
private FeatureReader<VariantContext> getReaderFromPath(final Path variantPath, final Path variantIndexPath) {
// TODO: we repeatedly convert between URI, Path, and String in this tool. Is this necessary?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's gross. Switching to the beta api's would probably fix it but I don't think there's an easy fix using the current tribble interfaces.


@DataProvider
public Object[][] dataForTestExplicitIndicesInSampleNameMapInTheCloud() {
final String GVCFS_WITH_INDICES_BUCKET = "gs://hellbender/test/resources/org/broadinstitute/hellbender/tools/genomicsdb/gvcfs_with_indices/";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These paths ideally would all be built using BaseTest.getGCPTestInputPath which in theory makes it possible to reproduce our test data on your own system if you want to, although good luck to anyone who tries...

{"Sample1"}, // 1 column no delimiter
{"\tfile"}, // empty first token
{" \tfile"}, // first token only whitespace
{"Sample1\tfile1\t"}, // extra tab
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to add Sample1\t\index1? and Sample1\t \index1

}

for (final String line : lines) {
final String[] split = line.split("\\t",-1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to look up how -1 is different than nothing every single time.

}

@Test(expectedExceptions = UserException.class)
public void testCheckVcfIsCompressedAndIndexed() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a positive test where this is specified and doesn't' fail?

}

@DataProvider
public Object[][] badInputsToAddSample() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be matching rejection tests for the 3 arg addSample ?

@mlathara
Copy link
Contributor

mlathara commented Aug 7, 2022

Didn't look at the SampleNameMap changes in detail, but the GenomicsDBImport changes look good. We can look into what's needed to support this with --bypass-feature-reader down the line sometime...

@rickymagner
Copy link
Contributor

Hi @droazen, I tested the changes out locally using VCFs with indices in different cloud buckets, and it works. This is perfect for the applications we have in mind. Thanks!

@droazen droazen merged commit 19778c1 into master Oct 11, 2022
@droazen droazen deleted the dr_genomicsdbimport_explicit_indices branch October 11, 2022 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants