[SPARK-28218][SQL] Migrate Avro to File Data Source V2 #25017

gengliangwang · 2019-06-30T12:22:40Z

What changes were proposed in this pull request?

Migrate Avro to File source V2.

How was this patch tested?

Unit test

gengliangwang · 2019-06-30T12:32:00Z

This is the last migration for file source V2. It is a relatively simple one. Please help review it.
@cloud-fan @dongjoon-hyun

SparkQA · 2019-06-30T12:52:58Z

Test build #107051 has finished for PR 25017 at commit 747eeb5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class AvroDataSourceV2 extends FileDataSourceV2
case class AvroPartitionReaderFactory(
case class AvroScan(
class AvroScanBuilder (
case class AvroTable(
class AvroWriteBuilder(

dongjoon-hyun · 2019-07-01T20:20:46Z

Thank you for pinging me, @gengliangwang .
cc @dbtsai , too.

dongjoon-hyun · 2019-07-03T03:09:06Z

Retest this please.

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroLogicalTypeSuite.scala

SparkQA · 2019-07-03T03:40:35Z

Test build #107144 has finished for PR 25017 at commit 747eeb5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class AvroDataSourceV2 extends FileDataSourceV2
case class AvroPartitionReaderFactory(
case class AvroScan(
class AvroScanBuilder (
case class AvroTable(
class AvroWriteBuilder(

SparkQA · 2019-07-04T04:19:12Z

Test build #107217 has finished for PR 25017 at commit 92d2f47.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class AvroV1LogicalTypeSuite extends AvroLogicalTypeSuite
class AvroV2LogicalTypeSuite extends AvroLogicalTypeSuite

dongjoon-hyun · 2019-07-05T02:27:27Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala

+      job: Job,
+      options: Map[String, String],
+      dataSchema: StructType): OutputWriterFactory = {
+    val parsedOptions = new AvroOptions(options, job.getConfiguration)


Previously, this was the following (sharedState.sparkContext.hadoopConfiguration + SQLConf). Is job.getConfiguration enough for Avro?

val parsedOptions = new AvroOptions(options, spark.sessionState.newHadoopConf())

Yes, it is enough. Orc/Parquet also use the configuration from job.

external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroDataSourceV2.scala

external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroPartitionReaderFactory.scala

dongjoon-hyun · 2019-07-05T02:38:35Z

external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroPartitionReaderFactory.scala

+    val parsedOptions = new AvroOptions(options, conf)
+    val userProvidedSchema = parsedOptions.schema.map(new Schema.Parser().parse)
+
+    if (parsedOptions.ignoreExtension || partitionedFile.filePath.endsWith(".avro")) {


Shall we have the same comment above this line in order not to forget that?

// TODO Removes this check once `FileFormat` gets a general file filtering interface method. // Doing input file filtering is improper because we may generate empty tasks that process no // input files but stress the scheduler. We should probably add a more general input file // filtering mechanism for `FileFormat` data sources. See SPARK-16317.

Actually, there is an option pathGlobFilter for it. I have marked it as deprecated in #24518.
I think we can still support it in 3.0. So I am not sure what to comment here.

external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroPartitionReaderFactory.scala

external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroWriteBuilder.scala

external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScan.scala

dongjoon-hyun · 2019-07-05T02:58:08Z

external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroTable.scala

+    paths: Seq[String],
+    userSpecifiedSchema: Option[StructType],
+    fallbackFileFormat: Class[_ <: FileFormat])
+  extends FileTable(sparkSession, options, paths, userSpecifiedSchema) with Logging {


Let's remove with Logging and line 23.

external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroTable.scala

dongjoon-hyun

I left a few comments. Could you update the PR, @gengliangwang ?

external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroDataSourceV2.scala

dongjoon-hyun · 2019-07-05T03:31:17Z

external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroPartitionReaderFactory.scala

+import org.apache.spark.sql.avro.{AvroDeserializer, AvroOptions}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.PartitionedFile
+import org.apache.spark.sql.execution.datasources.v2._


-import org.apache.spark.sql.execution.datasources.v2._ +import org.apache.spark.sql.execution.datasources.v2.{EmptyPartitionReader, FilePartitionReaderFactory, PartitionReaderWithPartitionValues}

external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroTable.scala

external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroWriteBuilder.scala

gengliangwang · 2019-07-05T07:01:09Z

@dongjoon-hyun I have updated the code. Thanks for reviewing this in your vacation!

SparkQA · 2019-07-05T07:30:59Z

Test build #107263 has finished for PR 25017 at commit fc98bd5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Merged to master.
Thank you, @gengliangwang !

avro v2

747eeb5

dongjoon-hyun added the SQL label Jul 1, 2019

dongjoon-hyun reviewed Jul 3, 2019

View reviewed changes

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroLogicalTypeSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jul 3, 2019

View reviewed changes

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroLogicalTypeSuite.scala Outdated Show resolved Hide resolved

correct AvroV1LogicalTypeSuite/AvroV1LogicalTypeSuite

92d2f47