New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

GenomicsDBImport: add the ability to specify explicit index locations via the sample name map file #7967

Merged

droazen merged 1 commit into master from dr_genomicsdbimport_explicit_indices

Oct 11, 2022

Contributor

droazen commented Jul 29, 2022

The sample name map file accepted by GenomicsDBImport can now optionally contain a third
column giving an explicit path to an index for the corresponding GVCF. It is allowed to
specify an explicit index in some lines of the sample name map and not others.

Added comprehensive unit and integration tests.

Contributor Author

droazen commented Jul 29, 2022

@rickymagner / @meganshand, here you go!

@lbergelson, @mlathara, and one of @rickymagner / @meganshand, please review

droazen requested review from lbergelson, rickymagner and meganshand

July 29, 2022 20:30

droazen self-assigned this


          GenomicsDBImport: add ability to specify explicit index locations via…

745273b

… the sample name map file

The sample name map file accepted by GenomicsDBImport can now optionally contain a third
column giving an explicit path to an index for the corresponding VCF. It is allowed to
specify an explicit index in some lines of the sample name map and not others.

Added comprehensive unit and integration tests.

droazen force-pushed the dr_genomicsdbimport_explicit_indices branch from b57a45b to 745273b Compare

July 29, 2022 20:49

codecov bot commented Jul 29, 2022 •

edited

Loading

Codecov Report

Merging #7967 (745273b) into master (c22972a) will increase coverage by 34.431%.
The diff coverage is 80.422%.

@@               Coverage Diff                @@
##              master     #7967        +/-   ##
================================================
+ Coverage     52.260%   86.691%   +34.431%     
- Complexity     29146     38496      +9350     
================================================
  Files           2310      2311         +1     
  Lines         180344    180590       +246     
  Branches       19840     19863        +23     
================================================
+ Hits           94247    156555     +62308     
+ Misses         80124     17090     -63034     
- Partials        5973      6945       +972

Impacted Files	Coverage Δ
...institute/hellbender/engine/FeatureDataSource.java	`78.344% <ø> (ø)`
...ls/genomicsdb/GenomicsDBImportIntegrationTest.java	`84.746% <60.000%> (-3.762%)`	⬇️
...llbender/tools/genomicsdb/GATKGenomicsDBUtils.java	`84.685% <72.727%> (ø)`
...ute/hellbender/tools/genomicsdb/SampleNameMap.java	`85.915% <85.915%> (ø)`
.../hellbender/tools/genomicsdb/GenomicsDBImport.java	`83.636% <92.000%> (+0.586%)`	⬆️
...bender/tools/genomicsdb/SampleNameMapUnitTest.java	`92.000% <92.000%> (ø)`
...roadinstitute/hellbender/tools/LocalAssembler.java	`67.425% <0.000%> (+0.073%)`	⬆️
...roadinstitute/hellbender/utils/read/ReadUtils.java	`82.278% <0.000%> (+0.316%)`	⬆️
...stitute/hellbender/cmdline/CommandLineProgram.java	`84.516% <0.000%> (+0.645%)`	⬆️
...lyBasedSVDiscoveryTestDataProviderForSimpleSV.java	`100.000% <0.000%> (+0.894%)`	⬆️
... and 590 more

lbergelson approved these changes

View reviewed changes

Member

lbergelson left a comment

@droazen I have a couple of comments that you may or may not want to address. I think it looks solid though. If something in the tests cases was mixed up I very well may have mixed it up too and not noticed it. I didn't see any obvious things though.

src/main/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImport.java

-                          final Path firstHeaderPath = IOUtils.getPath(sampleNameToVcfPath.entrySet().iterator().next().getValue().toString());
-                          final VCFHeader header = getHeaderFromPath(firstHeaderPath);
+                          // The SampleNameMap class guarantees that the samples will be sorted correctly.
+                          sampleNameMap = new SampleNameMap(IOUtils.getPath(sampleNameMapFile), bypassFeatureReader);

Member

lbergelson Aug 4, 2022

We should make the input a GATKPath at some point probably but it doesn't have to happen here.

src/main/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImport.java

+                      // that --bypass-feature-reader wasn't also specified:
+                      if ( sampleNameMap != null && sampleNameMap.indicesSpecified() && bypassFeatureReader ) {
+                          throw new UserException("Indices were specified for some VCFs in the sample name map file, but --" + BYPASS_FEATURE_READER +
+                                  " was also specified. Specifying explicit indices is not supported when running with --" + BYPASS_FEATURE_READER);

Member

lbergelson Aug 4, 2022

We should talk with the GenomicsDB team about that. That's probably something they could support without much trouble.

Collaborator

nalinigans Aug 6, 2022

Yes, explicit indices are not supported by GenomicsDB. @mlathara, anything else to add?

src/main/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImport.java

                   }
-                  private VCFHeader getHeaderFromPath(final Path variantPath) {
-                      try(final FeatureReader<VariantContext> reader = getReaderFromPath(variantPath)) {
+                  private VCFHeader getHeaderFromPath(final Path variantPath, final Path variantIndexPath) {

Member

lbergelson Aug 4, 2022

This should be unnecessary because of course you don't need an index to find a header. Probably need some new method in htsjdk or somewhere that just rips out the header without doing the rest.

src/main/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImport.java

                       final int updatedBatchSize = (batchSize == DEFAULT_ZERO_BATCH_SIZE) ? sampleCount : batchSize;
                       final ImportConfig importConfig = createImportConfig(updatedBatchSize);
                       GenomicsDBImporter importer;
                       try {
                           importer = new GenomicsDBImporter(importConfig);
                           // Modify importer directly from updateImportProtobufVidMapping.
-                          org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBUtils.updateImportProtobufVidMapping(importer);

Member

lbergelson Aug 4, 2022

Good move to rename this so there's no conflict...

src/main/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImport.java

                    * @return  Feature reader
                    * @param variantPath
                    */
-                  private FeatureReader<VariantContext> getReaderFromPath(final Path variantPath) {
+                  private FeatureReader<VariantContext> getReaderFromPath(final Path variantPath, final Path variantIndexPath) {
+                      // TODO: we repeatedly convert between URI, Path, and String in this tool. Is this necessary?

Member

lbergelson Aug 4, 2022

it's gross. Switching to the beta api's would probably fix it but I don't think there's an easy fix using the current tribble interfaces.

...est/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImportIntegrationTest.java

+                  @DataProvider
+                  public Object[][] dataForTestExplicitIndicesInSampleNameMapInTheCloud() {
+                      final String GVCFS_WITH_INDICES_BUCKET = "gs://hellbender/test/resources/org/broadinstitute/hellbender/tools/genomicsdb/gvcfs_with_indices/";

Member

lbergelson Aug 4, 2022

These paths ideally would all be built using BaseTest.getGCPTestInputPath which in theory makes it possible to reproduce our test data on your own system if you want to, although good luck to anyone who tries...

src/test/java/org/broadinstitute/hellbender/tools/genomicsdb/SampleNameMapUnitTest.java

+                              {"Sample1"},                    // 1 column no delimiter
+                              {"\tfile"},                     // empty first token
+                              {" \tfile"},                    // first token only whitespace
+                              {"Sample1\tfile1\t"},           // extra tab

Member

lbergelson Aug 4, 2022

do we want to add Sample1\t\index1? and Sample1\t \index1

src/main/java/org/broadinstitute/hellbender/tools/genomicsdb/SampleNameMap.java

+                          }
+                          for (final String line : lines) {
+                              final String[] split = line.split("\\t",-1);

Member

lbergelson Aug 4, 2022

I have to look up how -1 is different than nothing every single time.

src/test/java/org/broadinstitute/hellbender/tools/genomicsdb/SampleNameMapUnitTest.java

+                  }
+                  @Test(expectedExceptions = UserException.class)
+                  public void testCheckVcfIsCompressedAndIndexed() {

Member

lbergelson Aug 4, 2022

Should there be a positive test where this is specified and doesn't' fail?

src/test/java/org/broadinstitute/hellbender/tools/genomicsdb/SampleNameMapUnitTest.java

+                  }
+                  @DataProvider
+                  public Object[][] badInputsToAddSample() {

Member

lbergelson Aug 4, 2022

Should there be matching rejection tests for the 3 arg addSample ?

Contributor

mlathara commented Aug 7, 2022

Didn't look at the SampleNameMap changes in detail, but the GenomicsDBImport changes look good. We can look into what's needed to support this with --bypass-feature-reader down the line sometime...

Contributor

rickymagner commented Aug 8, 2022

Hi @droazen, I tested the changes out locally using VCFs with indices in different cloud buckets, and it works. This is perfect for the applications we have in mind. Thanks!

rickymagner approved these changes

View reviewed changes

droazen merged commit 19778c1 into master

droazen deleted the dr_genomicsdbimport_explicit_indices branch

October 11, 2022 18:51

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet