-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27990][SPARK-29903][PYTHON] Add recursiveFileLookup option to Python DataFrameReader #26718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27990][SPARK-29903][PYTHON] Add recursiveFileLookup option to Python DataFrameReader #26718
Conversation
|
Test build #114631 has finished for PR 26718 at commit
|
|
|
||
| @since(1.4) | ||
| def parquet(self, *paths): | ||
| def parquet(self, *paths, **options): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To support Python 2, we need to say **options instead of recursiveFileLookup=None because Python 2 doesn't support keyword-only arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems fine.
|
Test build #114638 has finished for PR 26718 at commit
|
|
cc @gatorsmile and @cloud-fan, per the discussion here. For some reason, I can't add you as reviewers using the GitHub interface. |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nchammas, can we at least update/test in streaming.py too?
|
Certainly! Oversight on my part. |
|
Test build #114673 has finished for PR 26718 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't closely checked but seems fine. If this doesn't get merged, I will take a look again and get this in.
|
Merged to master. |
…Python DataFrameReader ### What changes were proposed in this pull request? As a follow-up to apache#24830, this PR adds the `recursiveFileLookup` option to the Python DataFrameReader API. ### Why are the changes needed? This PR maintains Python feature parity with Scala. ### Does this PR introduce any user-facing change? Yes. Before this PR, you'd only be able to use this option as follows: ```python spark.read.option("recursiveFileLookup", True).text("test-data").show() ``` With this PR, you can reference the option from within the format-specific method: ```python spark.read.text("test-data", recursiveFileLookup=True).show() ``` This option now also shows up in the Python API docs. ### How was this patch tested? I tested this manually by creating the following directories with dummy data: ``` test-data ├── 1.txt └── nested └── 2.txt test-parquet ├── nested │ ├── _SUCCESS │ ├── part-00000-...-.parquet ├── _SUCCESS ├── part-00000-...-.parquet ``` I then ran the following tests and confirmed the output looked good: ```python spark.read.parquet("test-parquet", recursiveFileLookup=True).show() spark.read.text("test-data", recursiveFileLookup=True).show() spark.read.csv("test-data", recursiveFileLookup=True).show() ``` `python/pyspark/sql/tests/test_readwriter.py` seems pretty sparse. I'm happy to add my tests there, though it seems we have been deferring testing like this to the Scala side of things. Closes apache#26718 from nchammas/SPARK-27990-recursiveFileLookup-python. Authored-by: Nicholas Chammas <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
As a follow-up to #24830, this PR adds the
recursiveFileLookupoption to the Python DataFrameReader API.Why are the changes needed?
This PR maintains Python feature parity with Scala.
Does this PR introduce any user-facing change?
Yes.
Before this PR, you'd only be able to use this option as follows:
With this PR, you can reference the option from within the format-specific method:
This option now also shows up in the Python API docs.
How was this patch tested?
I tested this manually by creating the following directories with dummy data:
I then ran the following tests and confirmed the output looked good:
python/pyspark/sql/tests/test_readwriter.pyseems pretty sparse. I'm happy to add my tests there, though it seems we have been deferring testing like this to the Scala side of things.