-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-37769][SQL][FOLLOWUP] Filtering files if metadata columns are present in the data filter #35055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37769][SQL][FOLLOWUP] Filtering files if metadata columns are present in the data filter #35055
Changes from 1 commit
55379fd
d126e4b
8b8bf82
9fc1382
0370fa8
cc7d86c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -288,6 +288,20 @@ class FileMetadataStructSuite extends QueryTest with SharedSparkSession { | |
| ) | ||
| } | ||
|
|
||
| metadataColumnsTest("filter on metadata and user data", schema) { (df, _, f1) => | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this test fail before?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should also check the physical plan and make sure the final file list is pruned.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no, it won't fail before. I added this test just wanna make sure we only get metadata filters in thanks for the suggestion! updated tests to check files! |
||
| checkAnswer( | ||
| df.select("name", "age", "info", | ||
| METADATA_FILE_NAME, METADATA_FILE_PATH, | ||
| METADATA_FILE_SIZE, METADATA_FILE_MODIFICATION_TIME) | ||
| .where(Column(METADATA_FILE_NAME) === f1(METADATA_FILE_NAME) and Column("name") === "lily") | ||
| .where(Column(METADATA_FILE_PATH) === f1(METADATA_FILE_PATH)) | ||
| .where("age == 31"), | ||
| Seq(Row("lily", 31, Row(54321L, "ucb"), | ||
| f1(METADATA_FILE_NAME), f1(METADATA_FILE_PATH), | ||
| f1(METADATA_FILE_SIZE), f1(METADATA_FILE_MODIFICATION_TIME))) | ||
| ) | ||
| } | ||
|
|
||
| Seq(true, false).foreach { caseSensitive => | ||
| metadataColumnsTest(s"upper/lower case when case " + | ||
| s"sensitive is $caseSensitive", schemaWithNameConflicts) { (df, f0, f1) => | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a util method to create InternalRow from
Path,length: LongandmodificationTime: Long? To share code withFileScanRDDThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we have this util method and share the common code maybe along with the metadata schema pruning PR?
I am thinking after pruning, we don't need to always create
InternalRowwith all fields in bothPartitioningAwareFileIndex.listFiles(...)andFileScanRDD. WDYT?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM