Skip to content

[BUG][ORC] GpuInsertIntoHadoopFsRelationCommand should use staging directory for dynamic partition overwrite #7378

@abellina

Description

@abellina

Spark is treating Parquet differently than Orc when dealing with a dynamic partition overwrite (https://issues.apache.org/jira/browse/SPARK-20236).

With this mode, Spark will ask the FileOutputCommitter (e.g. ParquetOutputCommitter) if it is set in SQLConf.OUTPUT_COMMITTER_CLASS. Parquet writers have this config set, but Orc does not (and neither do we since this setup code was ported over from Spark).

The issue comes that InsertIntoHadoopFsRelationCommand has special code for dynamic partition overwrite, which was introduced as a bug fix here: apache/spark#29000. Our GpuInsertHadoopFsRelationCommand is missing part of this patch, and so the fix is to bring that into our plugin, specifically: https://github.com/apache/spark/pull/29000/files#diff-15b529afe19e971b138fc604909bcab2e42484babdcea937f41d18cb22d9401dR167

Stack trace for the repro case for the test I added:

Caused by: java.io.IOException: Failed to rename hdfs://localhost:9000/tmp/pyspark_tests/main-15540-933867848/ORC_DATA/GPU/.spark-staging-931a039e-17b3-468b-8a20-30a3f90d562a/my_partition=PART to hdfs://localhost:9000/tmp/pyspark_tests/main-15540-933867848/ORC_DATA/GPU/my_partition=PART when committing files staged for overwriting dynamic partitions
  at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.$anonfun$commitJob$13(HadoopMapReduceCommitProtocol.scala:224)
  at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.$anonfun$commitJob$13$adapted(HadoopMapReduceCommitProtocol.scala:208)
  at scala.collection.immutable.Set$Set1.foreach(Set.scala:97)
  at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:208)
  at org.apache.spark.sql.rapids.GpuFileFormatWriter$.$anonfun$write$18(GpuFileFormatWriter.scala:256)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at com.nvidia.spark.TimingUtils$.timeTakenMs(TimingUtils.scala:25)
  at org.apache.spark.sql.rapids.GpuFileFormatWriter$.write(GpuFileFormatWriter.scala:256)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions