[SPARK-27990][SPARK-29903][PYTHON] Add recursiveFileLookup option to Python DataFrameReader #26718

nchammas · 2019-11-29T18:51:39Z

What changes were proposed in this pull request?

As a follow-up to #24830, this PR adds the recursiveFileLookup option to the Python DataFrameReader API.

Why are the changes needed?

This PR maintains Python feature parity with Scala.

Does this PR introduce any user-facing change?

Yes.

Before this PR, you'd only be able to use this option as follows:

spark.read.option("recursiveFileLookup", True).text("test-data").show()

With this PR, you can reference the option from within the format-specific method:

spark.read.text("test-data", recursiveFileLookup=True).show()

This option now also shows up in the Python API docs.

How was this patch tested?

I tested this manually by creating the following directories with dummy data:

test-data
├── 1.txt
└── nested
   └── 2.txt
test-parquet
├── nested
│  ├── _SUCCESS
│  ├── part-00000-...-.parquet
├── _SUCCESS
├── part-00000-...-.parquet

I then ran the following tests and confirmed the output looked good:

spark.read.parquet("test-parquet", recursiveFileLookup=True).show()
spark.read.text("test-data", recursiveFileLookup=True).show()
spark.read.csv("test-data", recursiveFileLookup=True).show()

python/pyspark/sql/tests/test_readwriter.py seems pretty sparse. I'm happy to add my tests there, though it seems we have been deferring testing like this to the Scala side of things.

SparkQA · 2019-11-29T18:59:10Z

Test build #114631 has finished for PR 26718 at commit b3689ee.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

nchammas · 2019-11-29T22:41:10Z

python/pyspark/sql/readwriter.py


    @since(1.4)
-    def parquet(self, *paths):
+    def parquet(self, *paths, **options):


To support Python 2, we need to say **options instead of recursiveFileLookup=None because Python 2 doesn't support keyword-only arguments.

seems fine.

SparkQA · 2019-11-29T23:11:08Z

Test build #114638 has finished for PR 26718 at commit 28d5162.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nchammas · 2019-11-30T01:13:17Z

cc @gatorsmile and @cloud-fan, per the discussion here.

For some reason, I can't add you as reviewers using the GitHub interface.

HyukjinKwon

@nchammas, can we at least update/test in streaming.py too?

nchammas · 2019-11-30T16:35:56Z

Certainly! Oversight on my part.

SparkQA · 2019-11-30T18:35:05Z

Test build #114673 has finished for PR 26718 at commit 72f33da.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

Haven't closely checked but seems fine. If this doesn't get merged, I will take a look again and get this in.

HyukjinKwon · 2019-12-04T01:10:05Z

Merged to master.

…Python DataFrameReader ### What changes were proposed in this pull request? As a follow-up to apache#24830, this PR adds the `recursiveFileLookup` option to the Python DataFrameReader API. ### Why are the changes needed? This PR maintains Python feature parity with Scala. ### Does this PR introduce any user-facing change? Yes. Before this PR, you'd only be able to use this option as follows: ```python spark.read.option("recursiveFileLookup", True).text("test-data").show() ``` With this PR, you can reference the option from within the format-specific method: ```python spark.read.text("test-data", recursiveFileLookup=True).show() ``` This option now also shows up in the Python API docs. ### How was this patch tested? I tested this manually by creating the following directories with dummy data: ``` test-data ├── 1.txt └── nested └── 2.txt test-parquet ├── nested │ ├── _SUCCESS │ ├── part-00000-...-.parquet ├── _SUCCESS ├── part-00000-...-.parquet ``` I then ran the following tests and confirmed the output looked good: ```python spark.read.parquet("test-parquet", recursiveFileLookup=True).show() spark.read.text("test-data", recursiveFileLookup=True).show() spark.read.csv("test-data", recursiveFileLookup=True).show() ``` `python/pyspark/sql/tests/test_readwriter.py` seems pretty sparse. I'm happy to add my tests there, though it seems we have been deferring testing like this to the Scala side of things. Closes apache#26718 from nchammas/SPARK-27990-recursiveFileLookup-python. Authored-by: Nicholas Chammas <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

add recursiveFileLookup to python DataFrameReader

b3689ee

add API docs

101fb43

nchammas commented Nov 29, 2019

View reviewed changes

remove test string

28d5162

nchammas marked this pull request as ready for review November 30, 2019 01:11

HyukjinKwon changed the title ~~[SPARK-27990] [SPARK-29903] Add recursiveFileLookup option to Python DataFrameReader~~ [SPARK-27990][SPARK-29903][PYTHON] Add recursiveFileLookup option to Python DataFrameReader Nov 30, 2019

HyukjinKwon reviewed Nov 30, 2019

View reviewed changes

update streaming.py too

72f33da

HyukjinKwon reviewed Dec 1, 2019

View reviewed changes

nchammas mentioned this pull request Dec 2, 2019

[SPARK-27990][SQL][ML] Provide a way to recursively load data from datasource #24830

Closed

HyukjinKwon closed this in 3dd3a62 Dec 4, 2019

nchammas deleted the SPARK-27990-recursiveFileLookup-python branch December 4, 2019 01:19

nchammas mentioned this pull request Dec 20, 2019

[SPARK-30128][DOCS][PYTHON][SQL] Document/promote 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC #26958

Closed

zero323 mentioned this pull request Jan 7, 2020

Sync with changes merged after 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 zero323/pyspark-stubs#230

Closed

47 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-27990][SPARK-29903][PYTHON] Add recursiveFileLookup option to Python DataFrameReader #26718

[SPARK-27990][SPARK-29903][PYTHON] Add recursiveFileLookup option to Python DataFrameReader #26718

Uh oh!

nchammas commented Nov 29, 2019 •

edited

Loading

Uh oh!

SparkQA commented Nov 29, 2019

Uh oh!

nchammas Nov 29, 2019 •

edited

Loading

Uh oh!

HyukjinKwon Nov 30, 2019

Uh oh!

SparkQA commented Nov 29, 2019

Uh oh!

nchammas commented Nov 30, 2019

Uh oh!

HyukjinKwon left a comment

Uh oh!

nchammas commented Nov 30, 2019

Uh oh!

SparkQA commented Nov 30, 2019

Uh oh!

HyukjinKwon left a comment

Uh oh!

HyukjinKwon commented Dec 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-27990][SPARK-29903][PYTHON] Add recursiveFileLookup option to Python DataFrameReader #26718

[SPARK-27990][SPARK-29903][PYTHON] Add recursiveFileLookup option to Python DataFrameReader #26718

Uh oh!

Conversation

nchammas commented Nov 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Nov 29, 2019

Uh oh!

nchammas Nov 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 30, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 29, 2019

Uh oh!

nchammas commented Nov 30, 2019

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

nchammas commented Nov 30, 2019

Uh oh!

SparkQA commented Nov 30, 2019

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Dec 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nchammas commented Nov 29, 2019 •

edited

Loading

nchammas Nov 29, 2019 •

edited

Loading