[SPARK-25348][SQL] Data source for binary files #24354

WeichenXu123 · 2019-04-12T00:15:07Z

What changes were proposed in this pull request?

Implement binary file data source in Spark.

Format name: "binaryFile" (case-insensitive)

Schema:

content: BinaryType
status: StructType
- path: StringType
- modificationTime: TimestampType
- length: LongType

Options:

pathGlobFilter (instead of pathFilterRegex) to reply on GlobFilter behavior
maxBytesPerPartition is not implemented since it is controlled by two SQL confs: maxPartitionBytes and openCostInBytes.

How was this patch tested?

Unit test added.

Please review http://spark.apache.org/contributing.html before opening a pull request.

WeichenXu123 · 2019-04-12T00:20:24Z

@mengxr

Discussion:

Change option name pathFilterRegex ? The hadoop glob do not support full regex syntax.
Figure out a proper way to implement option maxBytesPerPartition ? Current datasource will automatically determine the partition number but do not providing related interface to control it.
The option pathFilterRegex do we need to specify the full path filter ? Because in the spark.read.format(...).load(path) we has already specify the directory path, so the pathFilterRegex we could only specify the last part of the full path.

.../main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileDataSource.scala

mengxr · 2019-04-12T00:52:58Z

.../main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileDataSource.scala

+/**
+ * `binaryfile` package implements Spark SQL data source API for loading binary file data
+ * as `DataFrame`.
+ *


Please also document how to control the input partition size. cc: @cloud-fan

.../main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileDataSource.scala

...e/src/test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileSuite.scala

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

.../main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileDataSource.scala

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

SparkQA · 2019-04-12T04:08:37Z

Test build #104531 has finished for PR 24354 at commit 42d1fc9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-12T04:58:40Z

Test build #104533 has finished for PR 24354 at commit 373af0f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2019-04-12T05:53:12Z

Jenkins, retest this please.

SparkQA · 2019-04-12T06:01:37Z

Test build #104534 has finished for PR 24354 at commit a7aed42.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

SparkQA · 2019-04-12T07:05:02Z

Test build #104538 has finished for PR 24354 at commit a7aed42.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

mengxr

Still need some minor changes.

...e/src/test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileSuite.scala

mengxr · 2019-04-12T18:41:52Z

@cloud-fan @gatorsmile I think this PR is almost ready to merge. Could you make a pass?

SparkQA · 2019-04-12T22:41:48Z

Test build #104556 has finished for PR 24354 at commit 55a6858.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-12T22:53:15Z

Test build #104557 has finished for PR 24354 at commit c3d4411.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-04-14T07:57:50Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

+      sparkSession: SparkSession,
+      options: Map[String, String],
+      path: Path): Boolean = {
+    false


are we sure about this? Always return false means one file one RDD partition.

I don't think binary partitions should be splittable like binaryFiles because usually one binary is a minimal logical unit for arbitrary binary files.

@cloud-fan Does it mean that the file itself cannot be split into multiple parts? It shouldn't lead to one file per partition.

if isSplitable returns false, then Spark can only read the entire file with a single thread, so it's one file per partition.

The file splitting is actually very complicated. For example, the text format splits the file w.r.t. the line boundary. A line of text will not be split into multiple partitions. I'm not sure how to define the file splitting logic for binary files.

It shouldn't lead to one file per partition.

@mengxr, do you mean one binary file should be split into multiple parts? In that case, the splitting rule should be defined and fixed so that end users can process it and use it. If users are not aware of the rule to split up, there wouldn't be a way for users to use it (for instance image).

I thought this will be implemented by maxBytesPerPartition, and by default one file per one partition.

@cloud-fan Why single thread leads to one file per partition? One partition can still have multiple files, but one file cannot be split into multiple records.

@HyukjinKwon We don't want to split a file into parts.

to be clear: one file per file partition. It's still possible that one RDD partition contains many files.

(Oops, I was confused that we were talking about one partition that has multiple parts)

cloud-fan · 2019-04-14T07:58:59Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

+
+      // TODO: Improve performance here: each file will recompile the glob pattern here.
+      val globFilter = if (pathGlobPattern.isEmpty) { null } else {
+        new GlobFilter(pathGlobPattern)


Is GlobFilter serializable? If it is then we can create it outside of the closure.

Not serializable so I put it inside.

cloud-fan · 2019-04-14T08:15:41Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

+
+        val requiredColumns = GenerateUnsafeProjection.generate(requiredOutput, fullOutput)
+
+        val row = Row(Row(path, modificationTime, length), content)


since the schema is simple, we can create InternalRow directly, instead of creating Row and using RowEncoder.

string type should be UTF8String, timestamp type should be a long that is microseconds count since January 1, 1970 UTC.

cloud-fan · 2019-04-14T08:17:16Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

+      dataSchema: StructType,
+      partitionSchema: StructType,
+      requiredSchema: StructType,
+      filters: Seq[Filter],


are we going to leverage the filters here?

I can put it in later PR.

HyukjinKwon

Sorry but i have to say this. What's our plan to add new data source? Is it going to be external module like Avro or Kafka or is it decided per case? For instance, do only datasources having a rather complex dependencies go into external modules?

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

...e/src/test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileSuite.scala

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

WeichenXu123 · 2019-04-16T04:23:28Z

@HyukjinKwon Done. Thanks!

mengxr

final pass

mengxr · 2019-04-16T05:27:03Z

.../main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileDataSource.scala

+   * See doc in `BinaryFileDataSource`
+   */
+  val binaryFileSchema = StructType(
+    StructField("content", BinaryType, false)::


space before ::

just a note: we might keep this column nullable in case to handle potential I/O failures

mengxr · 2019-04-16T05:31:05Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

+          content,
+          InternalRow(
+            UTF8String.fromString(path),
+            DateTimeUtils.fromJavaTimestamp(modificationTime),


Is it more straightforward to use fromMillis?

yes, we should use DateTimeUtils.fromMillis(fileStatus.getModificationTime())

mengxr · 2019-04-16T05:32:44Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

+   * only include files with path matching the glob pattern.
+   */
+  val pathGlobFilter: Option[String] = {
+    val filter = parameters.getOrElse("pathGlobFilter", null)


Just parameters.get("pathGlobFilter") should work

SparkQA · 2019-04-16T07:05:01Z

Test build #104614 has finished for PR 24354 at commit 2b1780f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-16T07:05:02Z

Test build #104613 has finished for PR 24354 at commit aab4dcd.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-04-16T07:07:34Z

retest this please

cloud-fan · 2019-04-16T10:14:30Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

+          requiredSchema.fieldNames.contains(a.name)
+        }
+
+        val requiredColumns = GenerateUnsafeProjection.generate(requiredOutput, fullOutput)


this does not help the performance. We still read the file content even if content column is not required.

This is OK for now, maybe we can leave a TODO and implement the real column pruning in the future.

SparkQA · 2019-04-16T11:10:42Z

Test build #104618 has finished for PR 24354 at commit 2b1780f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-04-16T16:39:58Z

.../main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileDataSource.scala

+ * {{{
+ *   // Scala
+ *   val df = spark.read.format("binaryFile")
+ *     .option("pathGlobFilter", "*.txt")


Nit: how about changing the extension name "*.txt" in the example, e.g. *.png or *.jpg

gengliangwang · 2019-04-16T16:59:28Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

+      val path = file.filePath
+      val fsPath = new Path(path)
+
+      // TODO: Improve performance here: each file will recompile the glob pattern here.


I think we should make it a general option, which can be applied in all data sources. Also we should pass the option to FileIndex, so that Spark can split file partition more precisely.
We can have a follow-up PR for this.

gengliangwang · 2019-04-16T17:03:49Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

+import org.apache.spark.util.SerializableConfiguration
+
+
+private[binaryfile] class BinaryFileFormat extends FileFormat with DataSourceRegister {


As per https://issues.apache.org/jira/browse/SPARK-16964, I think we can remove private[binaryfile]

gengliangwang · 2019-04-16T17:04:17Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

+  }
+}
+
+private[binaryfile] class BinaryFileSourceOptions(


Remove private[binaryfile] here as well.

mengxr · 2019-04-16T19:07:10Z

@WeichenXu123 I sent you a PR at WeichenXu123#6 to address @HyukjinKwon 's comment on the docs.

SparkQA · 2019-04-16T21:49:38Z

Test build #104634 has finished for PR 24354 at commit 46a07e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class BinaryFileFormat extends FileFormat with DataSourceRegister
class BinaryFileSourceOptions(

HyukjinKwon · 2019-04-16T22:34:51Z

Thanks, @WeichenXu123 and @mengxr for bearing with me. I'm okay with this.

mengxr · 2019-04-16T22:43:26Z

LGTM. Merged into master. I created two follow-up tasks:

filter push down: SPARK-27473
user guide: SPARK-27472

SparkQA · 2019-04-17T00:07:00Z

Test build #104637 has finished for PR 24354 at commit dd8e8c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-04-24T14:50:51Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

+        val stream = fs.open(fsPath)
+
+        val content = try {
+          ByteStreams.toByteArray(stream)


If I remember correctly, the usual behavior in Spark is not to throw an exception but prefers null value. At this point, should we assign content null value instead of throwing exception?

Oh, we can control it with ignoreCorruptFiles.

viirya · 2019-04-24T14:52:33Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

+        val content = try {
+          ByteStreams.toByteArray(stream)
+        } finally {
+          Closeables.close(stream, true)


Related to above comment, should we not propagate IO exceptions?

HyukjinKwon · 2019-04-28T15:56:03Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala

+      StructField("status", fileStatusSchema, false) :: Nil)
+}
+
+class BinaryFileSourceOptions(


Not a big deal at all but let me leave a note before I forget. BinaryFileSourceOptions -> BinaryFileOptions to be consistent with [SourceName]Options - TextOptions, OrcOptions, ParquetOptions, CSVOptions,JDBCOptions, ImageOptions, etc.

…or all file sources ## What changes were proposed in this pull request? ### Background: The data source option `pathGlobFilter` is introduced for Binary file format: #24354 , which can be used for filtering file names, e.g. reading `.png` files only while there is `.json` files in the same directory. ### Proposal: Make the option `pathGlobFilter` as a general option for all file sources. The path filtering should happen in the path globbing on Driver. ### Motivation: Filtering the file path names in file scan tasks on executors is kind of ugly. ### Impact: 1. The splitting of file partitions will be more balanced. 2. The metrics of file scan will be more accurate. 3. Users can use the option for reading other file sources. ## How was this patch tested? Unit tests Closes #24518 from gengliangwang/globFilter. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

## What changes were proposed in this pull request? Implement binary file data source in Spark. Format name: "binaryFile" (case-insensitive) Schema: - content: BinaryType - status: StructType - path: StringType - modificationTime: TimestampType - length: LongType Options: * pathGlobFilter (instead of pathFilterRegex) to reply on GlobFilter behavior * maxBytesPerPartition is not implemented since it is controlled by two SQL confs: maxPartitionBytes and openCostInBytes. ## How was this patch tested? Unit test added. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#24354 from WeichenXu123/binary_file_datasource. Lead-authored-by: WeichenXu <[email protected]> Co-authored-by: Xiangrui Meng <[email protected]> Signed-off-by: Xiangrui Meng <[email protected]>

…or all file sources The data source option `pathGlobFilter` is introduced for Binary file format: apache#24354 , which can be used for filtering file names, e.g. reading `.png` files only while there is `.json` files in the same directory. Make the option `pathGlobFilter` as a general option for all file sources. The path filtering should happen in the path globbing on Driver. Filtering the file path names in file scan tasks on executors is kind of ugly. 1. The splitting of file partitions will be more balanced. 2. The metrics of file scan will be more accurate. 3. Users can use the option for reading other file sources. Unit tests Closes apache#24518 from gengliangwang/globFilter. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

init pr

42d1fc9

mengxr requested changes Apr 12, 2019

View reviewed changes

mengxr changed the title ~~[SPARK-25348][SQL][ML] Data source for binary files~~ [SPARK-25348][SQL] Data source for binary files Apr 12, 2019

WeichenXu123 added 2 commits April 11, 2019 18:46

update

373af0f

address comments

a7aed42

felixcheung reviewed Apr 12, 2019

View reviewed changes

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala Show resolved Hide resolved

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala Show resolved Hide resolved

mengxr requested changes Apr 12, 2019

View reviewed changes

address comments

55a6858

mengxr reviewed Apr 12, 2019

View reviewed changes

...e/src/test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileSuite.scala Outdated Show resolved Hide resolved

address comments

c3d4411

mengxr approved these changes Apr 12, 2019

View reviewed changes

cloud-fan reviewed Apr 14, 2019

View reviewed changes

HyukjinKwon reviewed Apr 14, 2019

View reviewed changes

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Apr 14, 2019

View reviewed changes

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala Show resolved Hide resolved

HyukjinKwon reviewed Apr 14, 2019

View reviewed changes

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala Show resolved Hide resolved

HyukjinKwon reviewed Apr 14, 2019

View reviewed changes

...e/src/test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileSuite.scala Show resolved Hide resolved

HyukjinKwon reviewed Apr 14, 2019

View reviewed changes

.../src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala Outdated Show resolved Hide resolved

update

2b1780f

mengxr requested changes Apr 16, 2019

View reviewed changes

cloud-fan reviewed Apr 16, 2019

View reviewed changes

gengliangwang reviewed Apr 16, 2019

View reviewed changes

address comments

46a07e3

Xiangrui address remaining comments (#6)

dd8e8c6

mengxr approved these changes Apr 16, 2019

View reviewed changes

asfgit closed this in 1bb0c8e Apr 16, 2019

WeichenXu123 deleted the binary_file_datasource branch April 16, 2019 23:25

viirya reviewed Apr 24, 2019

View reviewed changes

HyukjinKwon reviewed Apr 28, 2019

View reviewed changes

gengliangwang mentioned this pull request May 2, 2019

[SPARK-27627][SQL] Make option "pathGlobFilter" as a general option for all file sources #24518

Closed


		val requiredColumns = GenerateUnsafeProjection.generate(requiredOutput, fullOutput)

		val row = Row(Row(path, modificationTime, length), content)

		import org.apache.spark.util.SerializableConfiguration


		private[binaryfile] class BinaryFileFormat extends FileFormat with DataSourceRegister {

[SPARK-25348][SQL] Data source for binary files #24354

[SPARK-25348][SQL] Data source for binary files #24354

Uh oh!

Conversation

WeichenXu123 commented Apr 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

WeichenXu123 commented Apr 12, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Apr 12, 2019

Uh oh!

SparkQA commented Apr 12, 2019

Uh oh!

WeichenXu123 commented Apr 12, 2019

Uh oh!

SparkQA commented Apr 12, 2019

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Apr 12, 2019

Uh oh!

mengxr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mengxr commented Apr 12, 2019

Uh oh!

SparkQA commented Apr 12, 2019

Uh oh!

SparkQA commented Apr 12, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Apr 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Apr 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 commented Apr 12, 2019 •

edited

Loading

HyukjinKwon Apr 14, 2019 •

edited

Loading

HyukjinKwon Apr 16, 2019 •

edited

Loading

cloud-fan Apr 14, 2019 •

edited

Loading

gengliangwang Apr 16, 2019 •

edited

Loading