Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Jun 10, 2019

What changes were proposed in this pull request?

Provide a way to recursively load data from datasource.
I add a "recursiveFileLookup" option.

When "recursiveFileLookup" option turn on, then partition inferring is turned off and all files from the directory will be loaded recursively.

If some datasource explicitly specify the partitionSpec, then if user turn on "recursive" option, then exception will be thrown.

How was this patch tested?

Unit tests.

Please review https://spark.apache.org/contributing.html before opening a pull request.

@SparkQA
Copy link

SparkQA commented Jun 10, 2019

Test build #106348 has finished for PR 24830 at commit 1003e5d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123 WeichenXu123 changed the title [SPARK-27990][SQL][ML][WIP] Provide a way to recursively load data from datasource [SPARK-27990][SQL][ML] Provide a way to recursively load data from datasource Jun 17, 2019
@SparkQA
Copy link

SparkQA commented Jun 17, 2019

Test build #106570 has finished for PR 24830 at commit 784e63b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jun 17, 2019

Test build #106583 has finished for PR 24830 at commit 784e63b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

cc @gengliangwang @cloud-fan

@mengxr
Copy link
Contributor

mengxr commented Jun 17, 2019

@cloud-fan @gengliangwang It would be great if you can make a pass. The PartitioningAwareFileIndex APIs do not seem to me to have clear semantics defined. So I'm not 100% sure the approach @WeichenXu123 took is correct.

val selectedPartitions = if (partitionSpec().partitionColumns.isEmpty) {
PartitionDirectory(InternalRow.empty, allFiles().filter(isNonEmptyFile)) :: Nil
} else {
if (recursiveFileLookup) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch seems not reachable. Should we simply use assert here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it is reachable I think.
See class PrunedInMemoryFileIndex which explicitly set partitionSpec.

@gengliangwang
Copy link
Member

Nit: Please update the PR description

  1. "recursive" option => "recursiveFileLookup" option
  2. update the "How was this patch tested?" section.

@SparkQA
Copy link

SparkQA commented Jun 18, 2019

Test build #106621 has finished for PR 24830 at commit d5090b0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

@cloud-fan @gengliangwang PR updated. Thanks!

@SparkQA
Copy link

SparkQA commented Jun 18, 2019

Test build #106630 has finished for PR 24830 at commit 5251ac7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one comment

@SparkQA
Copy link

SparkQA commented Jun 19, 2019

Test build #106648 has finished for PR 24830 at commit 0072083.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 19, 2019

Test build #106656 has finished for PR 24830 at commit 6f2e75a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 19, 2019

Test build #106665 has finished for PR 24830 at commit 6f2e75a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 19, 2019

Test build #106672 has finished for PR 24830 at commit 6f2e75a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in b276788 Jun 20, 2019
@WeichenXu123 WeichenXu123 deleted the recursive_ds branch June 20, 2019 05:01
}

protected lazy val recursiveFileLookup = {
parameters.getOrElse("recursiveFileLookup", "false").toBoolean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we document the option in DataFrameReader?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ngone51 Could you submit a follow-up PR to document this? This affects all the built-in file sources. We need to update the documentation of both PySpark and Scala APIs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, there is a Jira about adding this documentation which you will want to reference: SPARK-29903

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nchammas Could you submit a PR to fix readwriter.py for supporting this new option?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do. I suppose we'll do that separately from adding the docs, which will get their own PR, correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guys, we should also update DataStreamReadaer and streaming.py.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll submit a PR to document it. @gatorsmile

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Nov 19, 2019

Sorry to late revisit and post-hoc reviewing. There's no document for this feature so I feel it's better to ask here; what's expected behavior if we have glob in source path like /a/b/c/*/e and turn on this feature? Does Spark read all possible paths starts with /a/b/c/*/e?

Btw, I guess this is related to SPARK-20568 (#22952) - previously we could assume the possible depths of source file based on the source path (even if it contains glob) to avoid checking the pattern, and this patch may expand the upper bound of depths as infinity. There's no problem, but just wanted to confirm my understanding is correct.

HyukjinKwon pushed a commit that referenced this pull request Dec 4, 2019
…Python DataFrameReader

### What changes were proposed in this pull request?

As a follow-up to #24830, this PR adds the `recursiveFileLookup` option to the Python DataFrameReader API.

### Why are the changes needed?

This PR maintains Python feature parity with Scala.

### Does this PR introduce any user-facing change?

Yes.

Before this PR, you'd only be able to use this option as follows:

```python
spark.read.option("recursiveFileLookup", True).text("test-data").show()
```

With this PR, you can reference the option from within the format-specific method:

```python
spark.read.text("test-data", recursiveFileLookup=True).show()
```

This option now also shows up in the Python API docs.

### How was this patch tested?

I tested this manually by creating the following directories with dummy data:

```
test-data
├── 1.txt
└── nested
   └── 2.txt
test-parquet
├── nested
│  ├── _SUCCESS
│  ├── part-00000-...-.parquet
├── _SUCCESS
├── part-00000-...-.parquet
```

I then ran the following tests and confirmed the output looked good:

```python
spark.read.parquet("test-parquet", recursiveFileLookup=True).show()
spark.read.text("test-data", recursiveFileLookup=True).show()
spark.read.csv("test-data", recursiveFileLookup=True).show()
```

`python/pyspark/sql/tests/test_readwriter.py` seems pretty sparse. I'm happy to add my tests there, though it seems we have been deferring testing like this to the Scala side of things.

Closes #26718 from nchammas/SPARK-27990-recursiveFileLookup-python.

Authored-by: Nicholas Chammas <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Dec 6, 2019
…Python DataFrameReader

### What changes were proposed in this pull request?

As a follow-up to apache#24830, this PR adds the `recursiveFileLookup` option to the Python DataFrameReader API.

### Why are the changes needed?

This PR maintains Python feature parity with Scala.

### Does this PR introduce any user-facing change?

Yes.

Before this PR, you'd only be able to use this option as follows:

```python
spark.read.option("recursiveFileLookup", True).text("test-data").show()
```

With this PR, you can reference the option from within the format-specific method:

```python
spark.read.text("test-data", recursiveFileLookup=True).show()
```

This option now also shows up in the Python API docs.

### How was this patch tested?

I tested this manually by creating the following directories with dummy data:

```
test-data
├── 1.txt
└── nested
   └── 2.txt
test-parquet
├── nested
│  ├── _SUCCESS
│  ├── part-00000-...-.parquet
├── _SUCCESS
├── part-00000-...-.parquet
```

I then ran the following tests and confirmed the output looked good:

```python
spark.read.parquet("test-parquet", recursiveFileLookup=True).show()
spark.read.text("test-data", recursiveFileLookup=True).show()
spark.read.csv("test-data", recursiveFileLookup=True).show()
```

`python/pyspark/sql/tests/test_readwriter.py` seems pretty sparse. I'm happy to add my tests there, though it seems we have been deferring testing like this to the Scala side of things.

Closes apache#26718 from nchammas/SPARK-27990-recursiveFileLookup-python.

Authored-by: Nicholas Chammas <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Dec 23, 2019
…p' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC

### What changes were proposed in this pull request?

This PR adds and exposes the options, 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC, into documentation.

- `recursiveFileLookup` at file sources: #24830 ([SPARK-27627](https://issues.apache.org/jira/browse/SPARK-27627))
- `pathGlobFilter` at file sources: #24518 ([SPARK-27990](https://issues.apache.org/jira/browse/SPARK-27990))
- `mergeSchema` at ORC: #24043 ([SPARK-11412](https://issues.apache.org/jira/browse/SPARK-11412))

**Note that** `timeZone` option was not moved from `DataFrameReader.options` as I assume it will likely affect other datasources as well once DSv2 is complete.

### Why are the changes needed?

To document available options in sources properly.

### Does this PR introduce any user-facing change?

In PySpark, `pathGlobFilter` can be set via `DataFrameReader.(text|orc|parquet|json|csv)` and `DataStreamReader.(text|orc|parquet|json|csv)`.

### How was this patch tested?

Manually built the doc and checked the output. Option setting in PySpark is rather a logical change. I manually tested one only:

```bash
$ ls -al tmp
...
-rw-r--r--   1 hyukjin.kwon  staff     3 Dec 20 12:19 aa
-rw-r--r--   1 hyukjin.kwon  staff     3 Dec 20 12:19 ab
-rw-r--r--   1 hyukjin.kwon  staff     3 Dec 20 12:19 ac
-rw-r--r--   1 hyukjin.kwon  staff     3 Dec 20 12:19 cc
```

```python
>>> spark.read.text("tmp", pathGlobFilter="*c").show()
```

```
+-----+
|value|
+-----+
|   ac|
|   cc|
+-----+
```

Closes #26958 from HyukjinKwon/doc-followup.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
lwwmanning pushed a commit to palantir/spark that referenced this pull request Jan 9, 2020
…tasource

Provide a way to recursively load data from datasource.
I add a "recursiveFileLookup" option.

When "recursiveFileLookup" option turn on, then partition inferring is turned off and all files from the directory will be loaded recursively.

If some datasource explicitly specify the partitionSpec, then if user turn on "recursive" option, then exception will be thrown.

Unit tests.

Please review https://spark.apache.org/contributing.html before opening a pull request.

Closes apache#24830 from WeichenXu123/recursive_ds.

Authored-by: WeichenXu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.