Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Apr 12, 2019

What changes were proposed in this pull request?

Implement binary file data source in Spark.

Format name: "binaryFile" (case-insensitive)

Schema:

  • content: BinaryType
  • status: StructType
    • path: StringType
    • modificationTime: TimestampType
    • length: LongType

Options:

  • pathGlobFilter (instead of pathFilterRegex) to reply on GlobFilter behavior
  • maxBytesPerPartition is not implemented since it is controlled by two SQL confs: maxPartitionBytes and openCostInBytes.

How was this patch tested?

Unit test added.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@WeichenXu123
Copy link
Contributor Author

@mengxr

Discussion:

  • Change option name pathFilterRegex ? The hadoop glob do not support full regex syntax.
  • Figure out a proper way to implement option maxBytesPerPartition ? Current datasource will automatically determine the partition number but do not providing related interface to control it.
  • The option pathFilterRegex do we need to specify the full path filter ? Because in the spark.read.format(...).load(path) we has already specify the directory path, so the pathFilterRegex we could only specify the last part of the full path.

/**
* `binaryfile` package implements Spark SQL data source API for loading binary file data
* as `DataFrame`.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also document how to control the input partition size. cc: @cloud-fan

@mengxr mengxr changed the title [SPARK-25348][SQL][ML] Data source for binary files [SPARK-25348][SQL] Data source for binary files Apr 12, 2019
@SparkQA
Copy link

SparkQA commented Apr 12, 2019

Test build #104531 has finished for PR 24354 at commit 42d1fc9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2019

Test build #104533 has finished for PR 24354 at commit 373af0f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Apr 12, 2019

Test build #104534 has finished for PR 24354 at commit a7aed42.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2019

Test build #104538 has finished for PR 24354 at commit a7aed42.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@mengxr mengxr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need some minor changes.

@mengxr
Copy link
Contributor

mengxr commented Apr 12, 2019

@cloud-fan @gatorsmile I think this PR is almost ready to merge. Could you make a pass?

@SparkQA
Copy link

SparkQA commented Apr 12, 2019

Test build #104556 has finished for PR 24354 at commit 55a6858.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2019

Test build #104557 has finished for PR 24354 at commit c3d4411.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

sparkSession: SparkSession,
options: Map[String, String],
path: Path): Boolean = {
false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure about this? Always return false means one file one RDD partition.

Copy link
Member

@HyukjinKwon HyukjinKwon Apr 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think binary partitions should be splittable like binaryFiles because usually one binary is a minimal logical unit for arbitrary binary files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Does it mean that the file itself cannot be split into multiple parts? It shouldn't lead to one file per partition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if isSplitable returns false, then Spark can only read the entire file with a single thread, so it's one file per partition.

The file splitting is actually very complicated. For example, the text format splits the file w.r.t. the line boundary. A line of text will not be split into multiple partitions. I'm not sure how to define the file splitting logic for binary files.

Copy link
Member

@HyukjinKwon HyukjinKwon Apr 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't lead to one file per partition.

@mengxr, do you mean one binary file should be split into multiple parts? In that case, the splitting rule should be defined and fixed so that end users can process it and use it. If users are not aware of the rule to split up, there wouldn't be a way for users to use it (for instance image).

I thought this will be implemented by maxBytesPerPartition, and by default one file per one partition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Why single thread leads to one file per partition? One partition can still have multiple files, but one file cannot be split into multiple records.

@HyukjinKwon We don't want to split a file into parts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be clear: one file per file partition. It's still possible that one RDD partition contains many files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Oops, I was confused that we were talking about one partition that has multiple parts)


// TODO: Improve performance here: each file will recompile the glob pattern here.
val globFilter = if (pathGlobPattern.isEmpty) { null } else {
new GlobFilter(pathGlobPattern)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is GlobFilter serializable? If it is then we can create it outside of the closure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not serializable so I put it inside.


val requiredColumns = GenerateUnsafeProjection.generate(requiredOutput, fullOutput)

val row = Row(Row(path, modificationTime, length), content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the schema is simple, we can create InternalRow directly, instead of creating Row and using RowEncoder.

string type should be UTF8String, timestamp type should be a long that is microseconds count since January 1, 1970 UTC.

dataSchema: StructType,
partitionSchema: StructType,
requiredSchema: StructType,
filters: Seq[Filter],
Copy link
Contributor

@cloud-fan cloud-fan Apr 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we going to leverage the filters here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can put it in later PR.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but i have to say this. What's our plan to add new data source? Is it going to be external module like Avro or Kafka or is it decided per case? For instance, do only datasources having a rather complex dependencies go into external modules?

@WeichenXu123
Copy link
Contributor Author

@HyukjinKwon Done. Thanks!

Copy link
Contributor

@mengxr mengxr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final pass

* See doc in `BinaryFileDataSource`
*/
val binaryFileSchema = StructType(
StructField("content", BinaryType, false)::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • space before ::
  • just a note: we might keep this column nullable in case to handle potential I/O failures

content,
InternalRow(
UTF8String.fromString(path),
DateTimeUtils.fromJavaTimestamp(modificationTime),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it more straightforward to use fromMillis?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we should use DateTimeUtils.fromMillis(fileStatus.getModificationTime())

* only include files with path matching the glob pattern.
*/
val pathGlobFilter: Option[String] = {
val filter = parameters.getOrElse("pathGlobFilter", null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just parameters.get("pathGlobFilter") should work

@SparkQA
Copy link

SparkQA commented Apr 16, 2019

Test build #104614 has finished for PR 24354 at commit 2b1780f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 16, 2019

Test build #104613 has finished for PR 24354 at commit aab4dcd.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor

retest this please

requiredSchema.fieldNames.contains(a.name)
}

val requiredColumns = GenerateUnsafeProjection.generate(requiredOutput, fullOutput)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not help the performance. We still read the file content even if content column is not required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OK for now, maybe we can leave a TODO and implement the real column pruning in the future.

@SparkQA
Copy link

SparkQA commented Apr 16, 2019

Test build #104618 has finished for PR 24354 at commit 2b1780f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* {{{
* // Scala
* val df = spark.read.format("binaryFile")
* .option("pathGlobFilter", "*.txt")
Copy link
Member

@gengliangwang gengliangwang Apr 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: how about changing the extension name "*.txt" in the example, e.g. *.png or *.jpg

val path = file.filePath
val fsPath = new Path(path)

// TODO: Improve performance here: each file will recompile the glob pattern here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make it a general option, which can be applied in all data sources. Also we should pass the option to FileIndex, so that Spark can split file partition more precisely.
We can have a follow-up PR for this.

import org.apache.spark.util.SerializableConfiguration


private[binaryfile] class BinaryFileFormat extends FileFormat with DataSourceRegister {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per https://issues.apache.org/jira/browse/SPARK-16964, I think we can remove private[binaryfile]

}
}

private[binaryfile] class BinaryFileSourceOptions(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove private[binaryfile] here as well.

@mengxr
Copy link
Contributor

mengxr commented Apr 16, 2019

@WeichenXu123 I sent you a PR at WeichenXu123#6 to address @HyukjinKwon 's comment on the docs.

@SparkQA
Copy link

SparkQA commented Apr 16, 2019

Test build #104634 has finished for PR 24354 at commit 46a07e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class BinaryFileFormat extends FileFormat with DataSourceRegister
  • class BinaryFileSourceOptions(

@HyukjinKwon
Copy link
Member

Thanks, @WeichenXu123 and @mengxr for bearing with me. I'm okay with this.

@mengxr
Copy link
Contributor

mengxr commented Apr 16, 2019

LGTM. Merged into master. I created two follow-up tasks:

@asfgit asfgit closed this in 1bb0c8e Apr 16, 2019
@WeichenXu123 WeichenXu123 deleted the binary_file_datasource branch April 16, 2019 23:25
@SparkQA
Copy link

SparkQA commented Apr 17, 2019

Test build #104637 has finished for PR 24354 at commit dd8e8c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val stream = fs.open(fsPath)

val content = try {
ByteStreams.toByteArray(stream)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, the usual behavior in Spark is not to throw an exception but prefers null value. At this point, should we assign content null value instead of throwing exception?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, we can control it with ignoreCorruptFiles.

val content = try {
ByteStreams.toByteArray(stream)
} finally {
Closeables.close(stream, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to above comment, should we not propagate IO exceptions?

StructField("status", fileStatusSchema, false) :: Nil)
}

class BinaryFileSourceOptions(
Copy link
Member

@HyukjinKwon HyukjinKwon Apr 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal at all but let me leave a note before I forget. BinaryFileSourceOptions -> BinaryFileOptions to be consistent with [SourceName]Options - TextOptions, OrcOptions, ParquetOptions, CSVOptions,JDBCOptions, ImageOptions, etc.

HyukjinKwon pushed a commit that referenced this pull request May 8, 2019
…or all file sources

## What changes were proposed in this pull request?

### Background:
The data source option `pathGlobFilter` is introduced for Binary file format: #24354 , which can be used for filtering file names, e.g. reading `.png` files only while there is `.json` files in the same directory.

### Proposal:
Make the option `pathGlobFilter` as a general option for all file sources. The path filtering should happen in the path globbing on Driver.

### Motivation:
Filtering the file path names in file scan tasks on executors is kind of ugly.

### Impact:
1. The splitting of file partitions will be more balanced.
2. The metrics of file scan will be more accurate.
3. Users can use the option for reading other file sources.

## How was this patch tested?

Unit tests

Closes #24518 from gengliangwang/globFilter.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
lwwmanning pushed a commit to palantir/spark that referenced this pull request Jan 9, 2020
## What changes were proposed in this pull request?

Implement binary file data source in Spark.

Format name: "binaryFile" (case-insensitive)

Schema:
- content: BinaryType
- status: StructType
  - path: StringType
  - modificationTime: TimestampType
  - length: LongType

Options:
* pathGlobFilter (instead of pathFilterRegex) to reply on GlobFilter behavior
* maxBytesPerPartition is not implemented since it is controlled by two SQL confs: maxPartitionBytes and openCostInBytes.

## How was this patch tested?

Unit test added.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes apache#24354 from WeichenXu123/binary_file_datasource.

Lead-authored-by: WeichenXu <[email protected]>
Co-authored-by: Xiangrui Meng <[email protected]>
Signed-off-by: Xiangrui Meng <[email protected]>
lwwmanning pushed a commit to palantir/spark that referenced this pull request Jan 9, 2020
…or all file sources

The data source option `pathGlobFilter` is introduced for Binary file format: apache#24354 , which can be used for filtering file names, e.g. reading `.png` files only while there is `.json` files in the same directory.

Make the option `pathGlobFilter` as a general option for all file sources. The path filtering should happen in the path globbing on Driver.

Filtering the file path names in file scan tasks on executors is kind of ugly.

1. The splitting of file partitions will be more balanced.
2. The metrics of file scan will be more accurate.
3. Users can use the option for reading other file sources.

Unit tests

Closes apache#24518 from gengliangwang/globFilter.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants