[SPARK-27990][SQL][ML] Provide a way to recursively load data from datasource #24830

WeichenXu123 · 2019-06-10T09:49:33Z

What changes were proposed in this pull request?

Provide a way to recursively load data from datasource.
I add a "recursiveFileLookup" option.

When "recursiveFileLookup" option turn on, then partition inferring is turned off and all files from the directory will be loaded recursively.

If some datasource explicitly specify the partitionSpec, then if user turn on "recursive" option, then exception will be thrown.

How was this patch tested?

Unit tests.

Please review https://spark.apache.org/contributing.html before opening a pull request.

SparkQA · 2019-06-10T12:50:00Z

Test build #106348 has finished for PR 24830 at commit 1003e5d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mllib/src/test/scala/org/apache/spark/ml/source/image/ImageFileFormatSuite.scala

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala

SparkQA · 2019-06-17T06:03:50Z

Test build #106570 has finished for PR 24830 at commit 784e63b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2019-06-17T12:55:20Z

Jenkins, retest this please.

SparkQA · 2019-06-17T16:07:55Z

Test build #106583 has finished for PR 24830 at commit 784e63b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2019-06-17T23:23:15Z

cc @gengliangwang @cloud-fan

mengxr · 2019-06-17T23:25:49Z

@cloud-fan @gengliangwang It would be great if you can make a pass. The PartitioningAwareFileIndex APIs do not seem to me to have clear semantics defined. So I'm not 100% sure the approach @WeichenXu123 took is correct.

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala

gengliangwang · 2019-06-18T06:00:49Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala

    val selectedPartitions = if (partitionSpec().partitionColumns.isEmpty) {
      PartitionDirectory(InternalRow.empty, allFiles().filter(isNonEmptyFile)) :: Nil
    } else {
+      if (recursiveFileLookup) {


This branch seems not reachable. Should we simply use assert here?

Oh, it is reachable I think.
See class PrunedInMemoryFileIndex which explicitly set partitionSpec.

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

gengliangwang · 2019-06-18T06:04:46Z

Nit: Please update the PR description

"recursive" option => "recursiveFileLookup" option
update the "How was this patch tested?" section.

SparkQA · 2019-06-18T12:47:59Z

Test build #106621 has finished for PR 24830 at commit d5090b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2019-06-18T15:40:01Z

@cloud-fan @gengliangwang PR updated. Thanks!

SparkQA · 2019-06-18T17:55:48Z

Test build #106630 has finished for PR 24830 at commit 5251ac7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

mllib/src/test/scala/org/apache/spark/ml/source/image/ImageFileFormatSuite.scala

gengliangwang

LGTM except one comment

SparkQA · 2019-06-19T03:37:24Z

Test build #106648 has finished for PR 24830 at commit 0072083.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-19T07:05:02Z

Test build #106656 has finished for PR 24830 at commit 6f2e75a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-06-19T07:35:16Z

retest this please.

SparkQA · 2019-06-19T09:50:13Z

Test build #106665 has finished for PR 24830 at commit 6f2e75a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2019-06-19T09:58:38Z

retest this please.

SparkQA · 2019-06-19T13:00:30Z

Test build #106672 has finished for PR 24830 at commit 6f2e75a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-06-20T04:35:21Z

thanks, merging to master!

cloud-fan · 2019-11-08T03:53:32Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala

  }

+  protected lazy val recursiveFileLookup = {
+    parameters.getOrElse("recursiveFileLookup", "false").toBoolean


shall we document the option in DataFrameReader?

@Ngone51 Could you submit a follow-up PR to document this? This affects all the built-in file sources. We need to update the documentation of both PySpark and Scala APIs.

FYI, there is a Jira about adding this documentation which you will want to reference: SPARK-29903

@nchammas Could you submit a PR to fix readwriter.py for supporting this new option?

Sure, will do. I suppose we'll do that separately from adding the docs, which will get their own PR, correct?

Guys, we should also update DataStreamReadaer and streaming.py.

Ok, I'll submit a PR to document it. @gatorsmile

FYI @Ngone51: #26718

HeartSaVioR · 2019-11-19T01:21:15Z

Sorry to late revisit and post-hoc reviewing. There's no document for this feature so I feel it's better to ask here; what's expected behavior if we have glob in source path like /a/b/c/*/e and turn on this feature? Does Spark read all possible paths starts with /a/b/c/*/e?

Btw, I guess this is related to SPARK-20568 (#22952) - previously we could assume the possible depths of source file based on the source path (even if it contains glob) to avoid checking the pattern, and this patch may expand the upper bound of depths as infinity. There's no problem, but just wanted to confirm my understanding is correct.

…Python DataFrameReader ### What changes were proposed in this pull request? As a follow-up to #24830, this PR adds the `recursiveFileLookup` option to the Python DataFrameReader API. ### Why are the changes needed? This PR maintains Python feature parity with Scala. ### Does this PR introduce any user-facing change? Yes. Before this PR, you'd only be able to use this option as follows: ```python spark.read.option("recursiveFileLookup", True).text("test-data").show() ``` With this PR, you can reference the option from within the format-specific method: ```python spark.read.text("test-data", recursiveFileLookup=True).show() ``` This option now also shows up in the Python API docs. ### How was this patch tested? I tested this manually by creating the following directories with dummy data: ``` test-data ├── 1.txt └── nested └── 2.txt test-parquet ├── nested │ ├── _SUCCESS │ ├── part-00000-...-.parquet ├── _SUCCESS ├── part-00000-...-.parquet ``` I then ran the following tests and confirmed the output looked good: ```python spark.read.parquet("test-parquet", recursiveFileLookup=True).show() spark.read.text("test-data", recursiveFileLookup=True).show() spark.read.csv("test-data", recursiveFileLookup=True).show() ``` `python/pyspark/sql/tests/test_readwriter.py` seems pretty sparse. I'm happy to add my tests there, though it seems we have been deferring testing like this to the Scala side of things. Closes #26718 from nchammas/SPARK-27990-recursiveFileLookup-python. Authored-by: Nicholas Chammas <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…Python DataFrameReader ### What changes were proposed in this pull request? As a follow-up to apache#24830, this PR adds the `recursiveFileLookup` option to the Python DataFrameReader API. ### Why are the changes needed? This PR maintains Python feature parity with Scala. ### Does this PR introduce any user-facing change? Yes. Before this PR, you'd only be able to use this option as follows: ```python spark.read.option("recursiveFileLookup", True).text("test-data").show() ``` With this PR, you can reference the option from within the format-specific method: ```python spark.read.text("test-data", recursiveFileLookup=True).show() ``` This option now also shows up in the Python API docs. ### How was this patch tested? I tested this manually by creating the following directories with dummy data: ``` test-data ├── 1.txt └── nested └── 2.txt test-parquet ├── nested │ ├── _SUCCESS │ ├── part-00000-...-.parquet ├── _SUCCESS ├── part-00000-...-.parquet ``` I then ran the following tests and confirmed the output looked good: ```python spark.read.parquet("test-parquet", recursiveFileLookup=True).show() spark.read.text("test-data", recursiveFileLookup=True).show() spark.read.csv("test-data", recursiveFileLookup=True).show() ``` `python/pyspark/sql/tests/test_readwriter.py` seems pretty sparse. I'm happy to add my tests there, though it seems we have been deferring testing like this to the Scala side of things. Closes apache#26718 from nchammas/SPARK-27990-recursiveFileLookup-python. Authored-by: Nicholas Chammas <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…p' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC ### What changes were proposed in this pull request? This PR adds and exposes the options, 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC, into documentation. - `recursiveFileLookup` at file sources: #24830 ([SPARK-27627](https://issues.apache.org/jira/browse/SPARK-27627)) - `pathGlobFilter` at file sources: #24518 ([SPARK-27990](https://issues.apache.org/jira/browse/SPARK-27990)) - `mergeSchema` at ORC: #24043 ([SPARK-11412](https://issues.apache.org/jira/browse/SPARK-11412)) **Note that** `timeZone` option was not moved from `DataFrameReader.options` as I assume it will likely affect other datasources as well once DSv2 is complete. ### Why are the changes needed? To document available options in sources properly. ### Does this PR introduce any user-facing change? In PySpark, `pathGlobFilter` can be set via `DataFrameReader.(text|orc|parquet|json|csv)` and `DataStreamReader.(text|orc|parquet|json|csv)`. ### How was this patch tested? Manually built the doc and checked the output. Option setting in PySpark is rather a logical change. I manually tested one only: ```bash $ ls -al tmp ... -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 aa -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 ab -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 ac -rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 cc ``` ```python >>> spark.read.text("tmp", pathGlobFilter="*c").show() ``` ``` +-----+ |value| +-----+ | ac| | cc| +-----+ ``` Closes #26958 from HyukjinKwon/doc-followup. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…tasource Provide a way to recursively load data from datasource. I add a "recursiveFileLookup" option. When "recursiveFileLookup" option turn on, then partition inferring is turned off and all files from the directory will be loaded recursively. If some datasource explicitly specify the partitionSpec, then if user turn on "recursive" option, then exception will be thrown. Unit tests. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes apache#24830 from WeichenXu123/recursive_ds. Authored-by: WeichenXu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

init pr

1003e5d

dongjoon-hyun added the NEW FEATURE label Jun 12, 2019

mengxr requested changes Jun 14, 2019

View reviewed changes

dongjoon-hyun added ML SQL and removed NEW FEATURE labels Jun 14, 2019

address comments

784e63b

WeichenXu123 changed the title ~~[SPARK-27990][SQL][ML][WIP] Provide a way to recursively load data from datasource~~ [SPARK-27990][SQL][ML] Provide a way to recursively load data from datasource Jun 17, 2019

cloud-fan reviewed Jun 18, 2019

View reviewed changes

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala Outdated Show resolved Hide resolved

gengliangwang mentioned this pull request Jun 18, 2019

[SPARK-28089][SQL] File source v2: support reading output of file streaming Sink #24900

Closed

gengliangwang reviewed Jun 18, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala Show resolved Hide resolved

address comments

d5090b0

add tests

5251ac7

Merge branch 'master' into recursive_ds

0072083

cloud-fan reviewed Jun 19, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala Show resolved Hide resolved

cloud-fan approved these changes Jun 19, 2019

View reviewed changes

gengliangwang reviewed Jun 19, 2019

View reviewed changes

mllib/src/test/scala/org/apache/spark/ml/source/image/ImageFileFormatSuite.scala Outdated Show resolved Hide resolved

gengliangwang approved these changes Jun 19, 2019

View reviewed changes

remove redundant test

6f2e75a

cloud-fan closed this in b276788 Jun 20, 2019

WeichenXu123 deleted the recursive_ds branch June 20, 2019 05:01

cloud-fan reviewed Nov 8, 2019

View reviewed changes

LantaoJin mentioned this pull request Nov 14, 2019

[SPARK-29899][SQL] Recursively load data in Hive table via TBLPROPERTIES #26525

Closed

nchammas mentioned this pull request Nov 29, 2019

[SPARK-27990][SPARK-29903][PYTHON] Add recursiveFileLookup option to Python DataFrameReader #26718

Closed

HyukjinKwon mentioned this pull request Dec 20, 2019

[SPARK-30128][DOCS][PYTHON][SQL] Document/promote 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC #26958

Closed

lwwmanning mentioned this pull request Jan 9, 2020

Cherry pick SPARK-27291 and SPARK-27990 palantir/spark#631

Merged

HyukjinKwon mentioned this pull request Oct 11, 2022

[SPARK-40600] Support recursiveFileLookup for partitioned datasource #38053

Closed

[SPARK-27990][SQL][ML] Provide a way to recursively load data from datasource #24830

[SPARK-27990][SQL][ML] Provide a way to recursively load data from datasource #24830

Uh oh!

Conversation

WeichenXu123 commented Jun 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 10, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Jun 17, 2019

Uh oh!

WeichenXu123 commented Jun 17, 2019

Uh oh!

SparkQA commented Jun 17, 2019

Uh oh!

jiangxb1987 commented Jun 17, 2019

Uh oh!

mengxr commented Jun 17, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gengliangwang commented Jun 18, 2019

Uh oh!

SparkQA commented Jun 18, 2019

Uh oh!

WeichenXu123 commented Jun 18, 2019

Uh oh!

SparkQA commented Jun 18, 2019

Uh oh!

Uh oh!

Uh oh!

gengliangwang left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 19, 2019

Uh oh!

SparkQA commented Jun 19, 2019

Uh oh!

gengliangwang commented Jun 19, 2019

Uh oh!

SparkQA commented Jun 19, 2019

Uh oh!

WeichenXu123 commented Jun 19, 2019

Uh oh!

SparkQA commented Jun 19, 2019

Uh oh!

cloud-fan commented Jun 20, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Nov 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

WeichenXu123 commented Jun 10, 2019 •

edited

Loading

HeartSaVioR commented Nov 19, 2019 •

edited

Loading