[SPARK-37769][SQL][FOLLOWUP] Filtering files if metadata columns are present in the data filter #35055

Yaohua628 · 2021-12-29T08:31:20Z

What changes were proposed in this pull request?

Follow-up PR of #34575. Filtering files if metadata columns are present in the data filter.

Why are the changes needed?

Performance improvements.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing UTs and a new UT.

Yaohua628 · 2021-12-29T08:35:10Z

Hi @cloud-fan, here's metadata filtering PR, please take a look whenever you have a chance. Also, got a question: what about df.inputFiles? Do you have any idea how to make it work as well? Thanks!

cloud-fan · 2021-12-29T09:11:00Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala

    )
  }

+  metadataColumnsTest("filter on metadata and user data", schema) { (df, _, f1) =>


Does this test fail before?

I think we should also check the physical plan and make sure the final file list is pruned.

no, it won't fail before. I added this test just wanna make sure we only get metadata filters in listFiles(...) and do the correct filtering here

thanks for the suggestion! updated tests to check files!

AmplabJenkins · 2021-12-29T23:05:07Z

Can one of the admins verify this patch?

cloud-fan · 2021-12-30T03:38:32Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala

+    def matchFileMetadataPredicate(f: FileStatus): Boolean = {
+      // use option.forall, so if there is no filter, return true
+      boundedFilterOpt.forall(_.eval(
+        InternalRow.fromSeq(Seq(InternalRow.fromSeq(Seq(


Can we have a util method to create InternalRow from Path, length: Long and modificationTime: Long? To share code with FileScanRDD

could we have this util method and share the common code maybe along with the metadata schema pruning PR?

I am thinking after pruning, we don't need to always create InternalRow with all fields in both PartitioningAwareFileIndex.listFiles(...) and FileScanRDD. WDYT?

Yaohua628 · 2022-01-18T08:49:26Z

@cloud-fan Hi Wenchen, updated this PR on top of the schema pruning.

Also, add 2 utility methods and tried to share more code between FileScanRDD and PartitioningAwareFileIndex to create/update metadata internal row. Please let me know what do you think, thanks!!

cloud-fan · 2022-01-18T12:49:59Z

LGTM if tests pass

Yaohua628 · 2022-01-18T23:35:02Z

LGTM if tests pass

Thanks, passed!

cloud-fan · 2022-01-19T05:25:32Z

thanks, merging to master!

Yaohua628 · 2022-01-19T05:28:42Z

thanks, Wenchen!

### What changes were proposed in this pull request? This PR fixes the import missing. Logical conflict between #35068 and #35055. ### Why are the changes needed? To fix up the complication. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI should test it out in compliation. Closes #35245 from HyukjinKwon/SPARK-37896. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…ot contain any references ### What changes were proposed in this pull request? skip non-references filter during binding metadata-based filiter ### Why are the changes needed? this issue is from #35055. reproduce: ```sql CREATE TABLE t (c1 int) USING PARQUET; SET spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.BooleanSimplification; SELECT * FROM t WHERE c1 = 1 AND 2 > 1; ``` and the error msg: ``` java.util.NoSuchElementException: next on empty iterator at scala.collection.Iterator$$anon$2.next(Iterator.scala:41) at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) at scala.collection.mutable.LinkedHashSet$$anon$1.next(LinkedHashSet.scala:89) at scala.collection.IterableLike.head(IterableLike.scala:109) at scala.collection.IterableLike.head$(IterableLike.scala:108) at org.apache.spark.sql.catalyst.expressions.AttributeSet.head(AttributeSet.scala:69) at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.$anonfun$listFiles$3(PartitioningAwareFileIndex.scala:85) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.listFiles(PartitioningAwareFileIndex.scala:84) at org.apache.spark.sql.execution.FileSourceScanExec.selectedPartitions$lzycompute(DataSourceScanExec.scala:249) ``` ### Does this PR introduce _any_ user-facing change? yes, a bug fix ### How was this patch tested? add a new test Closes #35487 from ulysses-you/SPARK-38182. Authored-by: ulysses-you <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

first version

55379fd

github-actions bot added the SQL label Dec 29, 2021

cloud-fan reviewed Dec 29, 2021

View reviewed changes

comment: check filtered files

d126e4b

Yaohua628 requested a review from cloud-fan December 30, 2021 00:22

cloud-fan reviewed Dec 30, 2021

View reviewed changes

cloud-fan approved these changes Dec 30, 2021

View reviewed changes

Yaohua628 added 4 commits December 30, 2021 14:47

Merge branch 'master' into spark-37769

8b8bf82

Merge branch 'master' into spark-37769

9fc1382

update filtering based on pruning

0370fa8

utility methods

cc7d86c

cloud-fan approved these changes Jan 18, 2022

View reviewed changes

cloud-fan closed this in 817d1d7 Jan 19, 2022

HyukjinKwon mentioned this pull request Jan 19, 2022

[SPARK-37769][SQL][FOLLOWUP] Add UTF8String import in FileScanRDD.scala #35245

Closed

ulysses-you mentioned this pull request Feb 11, 2022

[SPARK-38182][SQL] Fix NoSuchElementException if pushed filter does not contain any references #35487

Closed

[SPARK-37769][SQL][FOLLOWUP] Filtering files if metadata columns are present in the data filter #35055

[SPARK-37769][SQL][FOLLOWUP] Filtering files if metadata columns are present in the data filter #35055

Uh oh!

Conversation

Yaohua628 commented Dec 29, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Yaohua628 commented Dec 29, 2021

Uh oh!

cloud-fan Dec 29, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 29, 2021

Choose a reason for hiding this comment

Uh oh!

Yaohua628 Dec 30, 2021

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Dec 29, 2021

Uh oh!

cloud-fan Dec 30, 2021

Choose a reason for hiding this comment

Uh oh!

Yaohua628 Dec 30, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 30, 2021

Choose a reason for hiding this comment

Uh oh!

Yaohua628 commented Jan 18, 2022

Uh oh!

cloud-fan commented Jan 18, 2022

Uh oh!

Yaohua628 commented Jan 18, 2022

Uh oh!

cloud-fan commented Jan 19, 2022

Uh oh!

Yaohua628 commented Jan 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants