-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-37769][SQL][FOLLOWUP] Filtering files if metadata columns are present in the data filter #35055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @cloud-fan, here's metadata filtering PR, please take a look whenever you have a chance. Also, got a question: what about |
| ) | ||
| } | ||
|
|
||
| metadataColumnsTest("filter on metadata and user data", schema) { (df, _, f1) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test fail before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also check the physical plan and make sure the final file list is pruned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it won't fail before. I added this test just wanna make sure we only get metadata filters in listFiles(...) and do the correct filtering here
thanks for the suggestion! updated tests to check files!
|
Can one of the admins verify this patch? |
| def matchFileMetadataPredicate(f: FileStatus): Boolean = { | ||
| // use option.forall, so if there is no filter, return true | ||
| boundedFilterOpt.forall(_.eval( | ||
| InternalRow.fromSeq(Seq(InternalRow.fromSeq(Seq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a util method to create InternalRow from Path, length: Long and modificationTime: Long? To share code with FileScanRDD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we have this util method and share the common code maybe along with the metadata schema pruning PR?
I am thinking after pruning, we don't need to always create InternalRow with all fields in both PartitioningAwareFileIndex.listFiles(...) and FileScanRDD. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
|
@cloud-fan Hi Wenchen, updated this PR on top of the schema pruning. Also, add 2 utility methods and tried to share more code between |
|
LGTM if tests pass |
Thanks, passed! |
|
thanks, merging to master! |
|
thanks, Wenchen! |
### What changes were proposed in this pull request? This PR fixes the import missing. Logical conflict between #35068 and #35055. ### Why are the changes needed? To fix up the complication. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI should test it out in compliation. Closes #35245 from HyukjinKwon/SPARK-37896. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ot contain any references ### What changes were proposed in this pull request? skip non-references filter during binding metadata-based filiter ### Why are the changes needed? this issue is from #35055. reproduce: ```sql CREATE TABLE t (c1 int) USING PARQUET; SET spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.BooleanSimplification; SELECT * FROM t WHERE c1 = 1 AND 2 > 1; ``` and the error msg: ``` java.util.NoSuchElementException: next on empty iterator at scala.collection.Iterator$$anon$2.next(Iterator.scala:41) at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) at scala.collection.mutable.LinkedHashSet$$anon$1.next(LinkedHashSet.scala:89) at scala.collection.IterableLike.head(IterableLike.scala:109) at scala.collection.IterableLike.head$(IterableLike.scala:108) at org.apache.spark.sql.catalyst.expressions.AttributeSet.head(AttributeSet.scala:69) at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.$anonfun$listFiles$3(PartitioningAwareFileIndex.scala:85) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.listFiles(PartitioningAwareFileIndex.scala:84) at org.apache.spark.sql.execution.FileSourceScanExec.selectedPartitions$lzycompute(DataSourceScanExec.scala:249) ``` ### Does this PR introduce _any_ user-facing change? yes, a bug fix ### How was this patch tested? add a new test Closes #35487 from ulysses-you/SPARK-38182. Authored-by: ulysses-you <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Follow-up PR of #34575. Filtering files if metadata columns are present in the data filter.
Why are the changes needed?
Performance improvements.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing UTs and a new UT.