Skip to content

Conversation

@Yaohua628
Copy link
Contributor

What changes were proposed in this pull request?

Follow-up PR of #34575. Filtering files if metadata columns are present in the data filter.

Why are the changes needed?

Performance improvements.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing UTs and a new UT.

@github-actions github-actions bot added the SQL label Dec 29, 2021
@Yaohua628
Copy link
Contributor Author

Hi @cloud-fan, here's metadata filtering PR, please take a look whenever you have a chance. Also, got a question: what about df.inputFiles? Do you have any idea how to make it work as well? Thanks!

)
}

metadataColumnsTest("filter on metadata and user data", schema) { (df, _, f1) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test fail before?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also check the physical plan and make sure the final file list is pruned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it won't fail before. I added this test just wanna make sure we only get metadata filters in listFiles(...) and do the correct filtering here

thanks for the suggestion! updated tests to check files!

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@Yaohua628 Yaohua628 requested a review from cloud-fan December 30, 2021 00:22
def matchFileMetadataPredicate(f: FileStatus): Boolean = {
// use option.forall, so if there is no filter, return true
boundedFilterOpt.forall(_.eval(
InternalRow.fromSeq(Seq(InternalRow.fromSeq(Seq(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a util method to create InternalRow from Path, length: Long and modificationTime: Long? To share code with FileScanRDD

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we have this util method and share the common code maybe along with the metadata schema pruning PR?

I am thinking after pruning, we don't need to always create InternalRow with all fields in both PartitioningAwareFileIndex.listFiles(...) and FileScanRDD. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@Yaohua628
Copy link
Contributor Author

@cloud-fan Hi Wenchen, updated this PR on top of the schema pruning.

Also, add 2 utility methods and tried to share more code between FileScanRDD and PartitioningAwareFileIndex to create/update metadata internal row. Please let me know what do you think, thanks!!

@cloud-fan
Copy link
Contributor

LGTM if tests pass

@Yaohua628
Copy link
Contributor Author

LGTM if tests pass

Thanks, passed!

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 817d1d7 Jan 19, 2022
@Yaohua628
Copy link
Contributor Author

thanks, Wenchen!

HyukjinKwon added a commit that referenced this pull request Jan 19, 2022
### What changes were proposed in this pull request?
This PR fixes the import missing. Logical conflict between #35068 and #35055.

### Why are the changes needed?

To fix up the complication.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI should test it out in compliation.

Closes #35245 from HyukjinKwon/SPARK-37896.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Feb 17, 2022
…ot contain any references

### What changes were proposed in this pull request?

skip non-references filter during binding metadata-based filiter

### Why are the changes needed?

this issue is from #35055.

reproduce:
```sql
CREATE TABLE t (c1 int) USING PARQUET;

SET spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.BooleanSimplification;

SELECT * FROM t WHERE c1 = 1 AND 2 > 1;
```

and the error msg:
```
java.util.NoSuchElementException: next on empty iterator
	at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
	at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
	at scala.collection.mutable.LinkedHashSet$$anon$1.next(LinkedHashSet.scala:89)
	at scala.collection.IterableLike.head(IterableLike.scala:109)
	at scala.collection.IterableLike.head$(IterableLike.scala:108)
	at org.apache.spark.sql.catalyst.expressions.AttributeSet.head(AttributeSet.scala:69)
	at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.$anonfun$listFiles$3(PartitioningAwareFileIndex.scala:85)
	at scala.Option.map(Option.scala:230)
	at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.listFiles(PartitioningAwareFileIndex.scala:84)
	at org.apache.spark.sql.execution.FileSourceScanExec.selectedPartitions$lzycompute(DataSourceScanExec.scala:249)
```

### Does this PR introduce _any_ user-facing change?

yes, a bug fix

### How was this patch tested?

add a new test

Closes #35487 from ulysses-you/SPARK-38182.

Authored-by: ulysses-you <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants