Possible regression in merging bams on GCS in Spark #2287

droazen · 2016-11-30T16:41:01Z

@vdauwera reports that the following ApplyBQSRSpark command fails with an NIO-related error on dataproc:

time ./gatk-launch ApplyBQSRSpark \
    -I gs://hellbender/test/resources/benchmark/CEUTrio.HiSeq.WEx.b37.NA12892.bam \
    -R gs://gatk-legacy-bundles/b37/human_g1k_v37.2bit \
    -O gs://gatk-demo/TEST/gatk4-spark/recalibrated.bam \
    -bqsr gs://gatk-demo/TEST/gatk4-spark/recalibration.table \
    -apiKey $APIKEY_ORTMP \
    -- \
    --sparkRunner GCS \
    --cluster gvda-test-bqsr \
    --num-executors 40 \
    --executor-cores 4 \
    --executor-memory 20g

java.nio.file.FileSystemNotFoundException: Provider "gs" not installed
    at java.nio.file.Paths.get(Paths.java:147)
    at org.seqdoop.hadoop_bam.util.NIOFileUtil.asPath(NIOFileUtil.java:30)
    at org.seqdoop.hadoop_bam.util.SAMFileMerger.mergeParts(SAMFileMerger.java:51)
    at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReadsSingle(ReadsSparkSink.java:231)
    at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReads(ReadsSparkSink.java:153)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.writeReads(GATKSparkTool.java:247)
    at org.broadinstitute.hellbender.tools.spark.ApplyBQSRSpark.runTool(ApplyBQSRSpark.java:49)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:348)
    at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:109)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:167)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:186)
    at org.broadinstitute.hellbender.Main.instanceMain(Main.java:76)
    at org.broadinstitute.hellbender.Main.main(Main.java:92)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/11/29 23:44:27 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/11/29 23:44:27 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
ERROR: (gcloud.dataproc.jobs.submit.spark) Job [155c3731-4071-4aa9-bed8-f0eb3426b805] entered state [ERROR] while waiting for [DONE].

The text was updated successfully, but these errors were encountered:

droazen · 2016-11-30T16:43:46Z

One possible culprit is HadoopGenomics/Hadoop-BAM#111, which introduced the NIOFileUtil class to Hadoop-BAM.

@tomwhite @jean-philippe-martin Could you comment with your thoughts on this?

jean-philippe-martin · 2016-11-30T17:44:42Z

Sorry, I won't be able to get to this until next week at the earliest.

tomwhite · 2016-11-30T21:13:52Z

I'm not sure why the gs provider doesn't get installed (I think JP was seeing this before), but in any case there's a workaround in GATK for it: https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/utils/gcs/BucketUtils.java#L380-L390

However, since the Hadoop-BAM code is doing the path lookups, it can't call that code directly (the dependency is the wrong way round). It would be best if we could fix the underlying problem of course, so that it gets picked up properly - I wonder if this can be done by fixing the service provider so it survives relocation (see http://maven.apache.org/plugins/maven-shade-plugin/examples/resource-transformers.html#ServicesResourceTransformer).

BTW I'm afraid I'm travelling this week, so I won't have time to look at it until next week either :(

vdauwera · 2016-11-30T21:45:38Z

There's no rush on my end, FYI.

droazen · 2016-12-01T14:46:59Z

Next week is fine guys, no worries! :)

Does the GS NIO provider currently support writing @jean-philippe-martin? This particular method is trying to write a merged bam file to a bucket. I'm wondering if it would work even if the gs provider were present...

jean-philippe-martin · 2016-12-01T17:30:19Z

@droazen yes, the GS NIO provider supports writing (though I've spent most of my focus on the reading side). You have to pick reading or writing when you open the file, you cannot do both on the same stream at the same time (I don't think that's the problem here).

What might be going on here is confusion between the NIO paths (they start with "gs://") and the GCS connector paths (they also start with "gs://"). Last time I checked the NIO paths were passed via a different input argument so we didn't confuse the two, but it's possible maybe some wires got crossed inside?

tomwhite · 2016-12-05T16:56:22Z

I checked in the spark JAR, and META-INF/services/java.nio.file.spi.FileSystemProvider contains

com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider
hdfs.jsr203.HadoopFileSystemProvider

So it looks like the service loader file is being created correctly.

I tried to reproduce on GCS by running (essentially the same as @vdauwera's command.)

time ./gatk-launch ApplyBQSRSpark \
    -I gs://hellbender/test/resources/benchmark/CEUTrio.HiSeq.WEx.b37.NA12892.bam \
    -R gs://gatk-legacy-bundles/b37/human_g1k_v37.2bit \
    -O gs://gatk-demo-tom/TEST/gatk4-spark/recalibrated.bam \
    -bqsr gs://gatk-demo/TEST/gatk4-spark/recalibration.table \
    -apiKey $GOOGLE_APPLICATION_CREDENTIALS \
    -- \
    --sparkRunner GCS \
    --cluster cluster-tom \
    --num-executors 40 \
    --executor-cores 4 \
    --executor-memory 10g

But I got another error earlier on. Any ideas what this could be? (I can see the input bam with gsutil cp)

org.broadinstitute.hellbender.exceptions.UserException: A USER ERROR has occurred: Failed to read bam header from gs://hellbender/test/resources/benchmark/CEUTrio.HiSeq.WEx.b37.NA12892.bam
 Caused by:Error reading null at position 0
	at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSource.getHeader(ReadsSparkSource.java:182)
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeReads(GATKSparkTool.java:376)
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeToolInputs(GATKSparkTool.java:357)
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:347)
	at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:109)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:167)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:186)
	at org.broadinstitute.hellbender.Main.instanceMain(Main.java:95)
	at org.broadinstitute.hellbender.Main.instanceMain(Main.java:102)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:115)
	at org.broadinstitute.hellbender.Main.main(Main.java:157)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: Error reading null at position 0
	at com.google.cloud.genomics.dataflow.readers.bam.SeekableGCSStream.openStream(SeekableGCSStream.java:126)
	at com.google.cloud.genomics.dataflow.readers.bam.SeekableGCSStream.seek(SeekableGCSStream.java:103)
	at com.google.cloud.genomics.dataflow.readers.bam.SeekableGCSStream.<init>(SeekableGCSStream.java:59)
	at com.google.cloud.genomics.dataflow.readers.bam.BAMIO.openBAMFile(BAMIO.java:67)
	at com.google.cloud.genomics.dataflow.readers.bam.BAMIO.openBAM(BAMIO.java:51)
	at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSource.getHeader(ReadsSparkSource.java:178)
	... 20 more
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
Bad Request
	at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1056)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeMedia(AbstractGoogleClientRequest.java:380)
	at com.google.api.services.storage.Storage$Objects$Get.executeMedia(Storage.java:4792)
	at com.google.cloud.genomics.dataflow.readers.bam.SeekableGCSStream.openStream(SeekableGCSStream.java:121)
	... 25 more
ERROR: (gcloud.dataproc.jobs.submit.spark) Job [06be217b-bc58-49b4-afde-3cd5ec16f18d] entered state [ERROR] while waiting for [DONE].

tomwhite · 2016-12-05T17:24:45Z

@droazen helped me out with this. It was because I wasn't specifying the API key correctly.

jean-philippe-martin · 2016-12-05T22:41:38Z

Spark hiding the GCS-NIO jar is a problem we indeed ran into before (see IOUtils.getPath). Given that this never worked, I wonder how we were outputting merged bams to GCS before?

droazen · 2016-12-05T22:46:17Z

@jean-philippe-martin Possibly we were using the GCS<->HDFS adapter previously, and something changed in the code to make us use NIO here instead? (possibly HadoopGenomics/Hadoop-BAM#111?)

jean-philippe-martin · 2016-12-05T23:42:33Z

I concur, what it looks like we have here is code that switched (accidentally?) from the GCS adapter to GCS-NIO instead.

tomwhite · 2016-12-06T11:54:47Z

Yes, Hadoop-BAM uses the NIO API to do file merging, whereas in GATK we were using the Hadoop APIs (and therefore the GCS<->HDFS adapter) to do it.

It looks like there are a couple of things needed in GCS-NIO to use the NIO API for this.

[NIO] Make CloudStorageFileSystem#newFileSystem more lenient in the URIs it accepts googleapis/google-cloud-java#1450 so that we don't have to special-case gs URIs to remove everything except the scheme and host when looking up the filesystem (see https://github.com/HadoopGenomics/Hadoop-BAM/blob/master/src/main/java/org/seqdoop/hadoop_bam/util/NIOFileUtil.java#L40)
NIO: Implement directory listing googleapis/google-cloud-java#813 to support path matching (https://github.com/HadoopGenomics/Hadoop-BAM/blob/master/src/main/java/org/seqdoop/hadoop_bam/util/NIOFileUtil.java#L90)

There may be more, as I stopped there. The best way forward is probably to go back to the old code in GATK while the deficiencies in GCS-NIO are fixed and then released.

The stacktrace I got for 1 was:

java.lang.IllegalArgumentException: GCS FileSystem URIs mustn't have: port, userinfo, path, query, or fragment: gs://gatk-demo-tom/TEST/markdups.parts/_SUCCESS
	at shaded.cloud-nio.com.google.common.base.Preconditions.checkArgument(Preconditions.java:146)
	at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newFileSystem(CloudStorageFileSystemProvider.java:192)
	at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newFileSystem(CloudStorageFileSystemProvider.java:83)
	at java.nio.file.FileSystems.newFileSystem(FileSystems.java:336)
	at org.seqdoop.hadoop_bam.util.NIOFileUtil.asPath(NIOFileUtil.java:40)
	at org.seqdoop.hadoop_bam.util.NIOFileUtil.asPath(NIOFileUtil.java:54)
	at org.seqdoop.hadoop_bam.util.SAMFileMerger.mergeParts(SAMFileMerger.java:51)
	at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReadsSingle(ReadsSparkSink.java:230)

And for 2:

java.lang.UnsupportedOperationException
	at com.google.cloud.storage.contrib.nio.CloudStorageFileSystem.getPathMatcher(CloudStorageFileSystem.java:219)
	at org.seqdoop.hadoop_bam.util.NIOFileUtil.getFilesMatching(NIOFileUtil.java:94)
	at org.seqdoop.hadoop_bam.util.SAMFileMerger.mergeParts(SAMFileMerger.java:62)
	at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReadsSingle(ReadsSparkSink.java:230)

The command I ran was:

time ./gatk-launch MarkDuplicatesSpark \
    -I gs://hellbender/test/resources/large/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam \
    -O gs://gatk-demo-tom/TEST/markdups \
    --METRICS_FILE /tmp/markdups-metrics \
    -apiKey $APIKEY_ORTMP \
    -- \
    --sparkRunner GCS \
    --cluster cluster-tom \
    --num-executors 40 \
    --executor-cores 4 \
    --executor-memory 10g

tomwhite · 2016-12-06T12:16:10Z

It would be helpful to have a failing unit test for this that runs on files in GCS.

droazen · 2016-12-06T19:33:58Z

@tomwhite Can we get a quick fix here while we wait for the NIO improvements you suggested? Perhaps we could create a separate copy of the bam merger in GATK4 that uses the old code path, and use that copy for now instead of Hadoop-BAM's?

tomwhite · 2016-12-08T17:25:29Z

@droazen That's probably the best way to do it, although I don't think we can be sure of a quick fix until we have a failing unit test for this on GCS. The turnaround time for manual testing is slow (for me at the moment). If someone could write such a test that would be very helpful.

droazen · 2016-12-08T18:33:22Z

@tomwhite If you could put together a quick PR against GATK4 with a temporary copy of the bam merger that uses the old code path, I'd be happy to test it.

tomwhite · 2016-12-09T12:03:16Z

@droazen thanks. I've created #2306 which reverts the commits that caused the problem on GCS. I've tested the above command (MarkDuplicatesSpark) on GCS and it works fine.

jean-philippe-martin · 2016-12-09T23:28:41Z

I have pull requests in flight for both (1) and (2). They are 1469 <googleapis/google-cloud-java#1469> and 1470 <googleapis/google-cloud-java#1470>. Cheers, JP

…

On Tue, Dec 6, 2016 at 3:54 AM, Tom White ***@***.***> wrote: Yes, Hadoop-BAM uses the NIO API to do file merging, whereas in GATK we were using the Hadoop APIs (and therefore the GCS<->HDFS adapter) to do it. It looks like there are a couple of things needed in GCS-NIO to use the NIO API for this. 1. googleapis/google-cloud-java#1450 <googleapis/google-cloud-java#1450> so that we don't have to special-case gs URIs to remove everything except the scheme and host when looking up the filesystem (see https://github.com/HadoopGenomics/Hadoop-BAM/ blob/master/src/main/java/org/seqdoop/hadoop_bam/util/ NIOFileUtil.java#L40 <https://github.com/HadoopGenomics/Hadoop-BAM/blob/master/src/main/java/org/seqdoop/hadoop_bam/util/NIOFileUtil.java#L40> ) 2. googleapis/google-cloud-java#813 <googleapis/google-cloud-java#813> to support path matching (https://github.com/HadoopGenomics/Hadoop-BAM/ blob/master/src/main/java/org/seqdoop/hadoop_bam/util/ NIOFileUtil.java#L90 <https://github.com/HadoopGenomics/Hadoop-BAM/blob/master/src/main/java/org/seqdoop/hadoop_bam/util/NIOFileUtil.java#L90> ) There may be more, as I stopped there. The best way forward is probably to go back to the old code in GATK while the deficiencies in GCS-NIO are fixed and then released.

tomwhite · 2017-01-10T17:00:17Z

@jean-philippe-martin thanks for fixing the two GCS problems. I've filed another issue to do with walking the filesystem here: googleapis/google-cloud-java#1519. /cc @droazen @lbergelson

droazen · 2017-05-30T22:12:01Z

@tomwhite @jean-philippe-martin Is this one done? Can we close this?

tomwhite · 2017-05-31T10:36:29Z

Yes

droazen added this to the alpha-4 milestone Nov 30, 2016

droazen added bug PRIORITY_HIGH Spark labels Nov 30, 2016

droazen assigned tomwhite and jean-philippe-martin Nov 30, 2016

tomwhite mentioned this issue Dec 9, 2016

Revert use of NIO-based SAMFileMerger from Hadoop-BAM #2306

Closed

tomwhite mentioned this issue Jan 11, 2017

revert nio sam file merger #2333

Closed

droazen mentioned this issue Mar 8, 2017

Update google-cloud-java dependency to a snapshot containing newly added NIO API methods #2441

Closed

droazen added the NIO label Mar 22, 2017

droazen added this to the 4.0 release milestone May 30, 2017

droazen modified the milestones: beta, 4.0 release May 30, 2017

droazen closed this as completed May 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible regression in merging bams on GCS in Spark #2287

Possible regression in merging bams on GCS in Spark #2287

droazen commented Nov 30, 2016

droazen commented Nov 30, 2016

jean-philippe-martin commented Nov 30, 2016

tomwhite commented Nov 30, 2016

vdauwera commented Nov 30, 2016

droazen commented Dec 1, 2016

jean-philippe-martin commented Dec 1, 2016

tomwhite commented Dec 5, 2016

tomwhite commented Dec 5, 2016

jean-philippe-martin commented Dec 5, 2016

droazen commented Dec 5, 2016 •

edited

Loading

jean-philippe-martin commented Dec 5, 2016

tomwhite commented Dec 6, 2016

tomwhite commented Dec 6, 2016

droazen commented Dec 6, 2016

tomwhite commented Dec 8, 2016

droazen commented Dec 8, 2016

tomwhite commented Dec 9, 2016

jean-philippe-martin commented Dec 9, 2016 via email •

edited

Loading

tomwhite commented Jan 10, 2017

droazen commented May 30, 2017

tomwhite commented May 31, 2017

Possible regression in merging bams on GCS in Spark #2287

Possible regression in merging bams on GCS in Spark #2287

Comments

droazen commented Nov 30, 2016

droazen commented Nov 30, 2016

jean-philippe-martin commented Nov 30, 2016

tomwhite commented Nov 30, 2016

vdauwera commented Nov 30, 2016

droazen commented Dec 1, 2016

jean-philippe-martin commented Dec 1, 2016

tomwhite commented Dec 5, 2016

tomwhite commented Dec 5, 2016

jean-philippe-martin commented Dec 5, 2016

droazen commented Dec 5, 2016 • edited Loading

jean-philippe-martin commented Dec 5, 2016

tomwhite commented Dec 6, 2016

tomwhite commented Dec 6, 2016

droazen commented Dec 6, 2016

tomwhite commented Dec 8, 2016

droazen commented Dec 8, 2016

tomwhite commented Dec 9, 2016

jean-philippe-martin commented Dec 9, 2016 via email • edited Loading

tomwhite commented Jan 10, 2017

droazen commented May 30, 2017

tomwhite commented May 31, 2017

droazen commented Dec 5, 2016 •

edited

Loading

jean-philippe-martin commented Dec 9, 2016 via email •

edited

Loading