Skip to content

Conversation

@nchammas
Copy link
Contributor

@nchammas nchammas commented Nov 29, 2019

What changes were proposed in this pull request?

As a follow-up to #24830, this PR adds the recursiveFileLookup option to the Python DataFrameReader API.

Why are the changes needed?

This PR maintains Python feature parity with Scala.

Does this PR introduce any user-facing change?

Yes.

Before this PR, you'd only be able to use this option as follows:

spark.read.option("recursiveFileLookup", True).text("test-data").show()

With this PR, you can reference the option from within the format-specific method:

spark.read.text("test-data", recursiveFileLookup=True).show()

This option now also shows up in the Python API docs.

How was this patch tested?

I tested this manually by creating the following directories with dummy data:

test-data
├── 1.txt
└── nested
   └── 2.txt
test-parquet
├── nested
│  ├── _SUCCESS
│  ├── part-00000-...-.parquet
├── _SUCCESS
├── part-00000-...-.parquet

I then ran the following tests and confirmed the output looked good:

spark.read.parquet("test-parquet", recursiveFileLookup=True).show()
spark.read.text("test-data", recursiveFileLookup=True).show()
spark.read.csv("test-data", recursiveFileLookup=True).show()

python/pyspark/sql/tests/test_readwriter.py seems pretty sparse. I'm happy to add my tests there, though it seems we have been deferring testing like this to the Scala side of things.

@SparkQA
Copy link

SparkQA commented Nov 29, 2019

Test build #114631 has finished for PR 26718 at commit b3689ee.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


@since(1.4)
def parquet(self, *paths):
def parquet(self, *paths, **options):
Copy link
Contributor Author

@nchammas nchammas Nov 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To support Python 2, we need to say **options instead of recursiveFileLookup=None because Python 2 doesn't support keyword-only arguments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems fine.

@SparkQA
Copy link

SparkQA commented Nov 29, 2019

Test build #114638 has finished for PR 26718 at commit 28d5162.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nchammas nchammas marked this pull request as ready for review November 30, 2019 01:11
@nchammas
Copy link
Contributor Author

cc @gatorsmile and @cloud-fan, per the discussion here.

For some reason, I can't add you as reviewers using the GitHub interface.

@HyukjinKwon HyukjinKwon changed the title [SPARK-27990] [SPARK-29903] Add recursiveFileLookup option to Python DataFrameReader [SPARK-27990][SPARK-29903][PYTHON] Add recursiveFileLookup option to Python DataFrameReader Nov 30, 2019
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nchammas, can we at least update/test in streaming.py too?

@nchammas
Copy link
Contributor Author

Certainly! Oversight on my part.

@SparkQA
Copy link

SparkQA commented Nov 30, 2019

Test build #114673 has finished for PR 26718 at commit 72f33da.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't closely checked but seems fine. If this doesn't get merged, I will take a look again and get this in.

@HyukjinKwon
Copy link
Member

Merged to master.

@nchammas nchammas deleted the SPARK-27990-recursiveFileLookup-python branch December 4, 2019 01:19
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Dec 6, 2019
…Python DataFrameReader

### What changes were proposed in this pull request?

As a follow-up to apache#24830, this PR adds the `recursiveFileLookup` option to the Python DataFrameReader API.

### Why are the changes needed?

This PR maintains Python feature parity with Scala.

### Does this PR introduce any user-facing change?

Yes.

Before this PR, you'd only be able to use this option as follows:

```python
spark.read.option("recursiveFileLookup", True).text("test-data").show()
```

With this PR, you can reference the option from within the format-specific method:

```python
spark.read.text("test-data", recursiveFileLookup=True).show()
```

This option now also shows up in the Python API docs.

### How was this patch tested?

I tested this manually by creating the following directories with dummy data:

```
test-data
├── 1.txt
└── nested
   └── 2.txt
test-parquet
├── nested
│  ├── _SUCCESS
│  ├── part-00000-...-.parquet
├── _SUCCESS
├── part-00000-...-.parquet
```

I then ran the following tests and confirmed the output looked good:

```python
spark.read.parquet("test-parquet", recursiveFileLookup=True).show()
spark.read.text("test-data", recursiveFileLookup=True).show()
spark.read.csv("test-data", recursiveFileLookup=True).show()
```

`python/pyspark/sql/tests/test_readwriter.py` seems pretty sparse. I'm happy to add my tests there, though it seems we have been deferring testing like this to the Scala side of things.

Closes apache#26718 from nchammas/SPARK-27990-recursiveFileLookup-python.

Authored-by: Nicholas Chammas <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants