-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25558][SQL] Pushdown predicates for nested fields in DataSource Strategy #22573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
53165b8 to
d59cb55
Compare
|
Test build #96709 has finished for PR 22573 at commit
|
|
Test build #96710 has finished for PR 22573 at commit
|
|
Test build #96711 has finished for PR 22573 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this cause regression for data source supporting dot in column name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any data source currently supporting dot in the column name with pushdown? The worst case will be no pushdown for those data sources.
I know ORC doesn't work for now. We can have another followup PR to address this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JDBC data source seems having no such restrict. So I worry that this change can cause some regressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, @dbtsai . This PR has a regression on ORC at least. The following is ORC result in Spark 2.3.2 and it will slowdown at least 5 times like Parquet.
I know ORC doesn't work for now. We can have another followup PR to address this.
scala> val df = spark.range(Int.MaxValue).sample(0.2).toDF("col.with.dot")
scala> df.write.mode("overwrite").orc("/tmp/orc")
scala> df.write.mode("overwrite").parquet("/tmp/parquet")
scala> spark.sql("set spark.sql.orc.impl=native")
scala> spark.sql("set spark.sql.orc.filterPushdown=true")
scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` = 50000").count)
Time taken: 803 ms
scala> spark.time(spark.read.parquet("/tmp/parquet").where("`col.with.dot` = 50000").count)
Time taken: 5573 ms
scala> spark.version
res6: String = 2.3.2There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apache Spark 2.4.0 RC2 has a regression on this case. So, for now, this PR doesn't have regssion on master branch.
scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` = 50000").count)
Time taken: 2405 msThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a dumb question. Is it possible to store a column with . in its name in parquet?
More holistically, I think it would be better to create an abstraction for a multipart identifier in a filter as opposed to encoding it using a '.'.
|
I think the problem is, the current public Ideally we should extend the API, create a new interface for column and nested column, instead of string. But This PR proposes to encode nested columns as string. This works, but we should think carefully about how to encode, so that column name with dot is still supported. |
|
I was thinking to change the APIs in Without changing the interface of
`column.1`.`attribute.b`It's also easier for people to understand when they are reading the pushdown plans in text format.
What do you think? |
|
Can we update public |
|
Updating |
|
That's great! |
|
The approach we've taken in Iceberg is to allow The reason why we do this is to avoid problem areas:
For example, the schema Binding filters like The only drawback to this approach is that you can't have duplicates in the index: each full field name must be unique. In the example above, the top-level |
|
Thank you, @rdblue . BTW, in general, indexing might be unsafe in Apache Spark when Metastore Schema is different from File Schema. Does it assume schema evolution feature in |
|
@dongjoon-hyun, Iceberg schema evolution is based on the field IDs, not on names. The current table schema's names are the runtime names for columns in that table, and all reads happen by first translating those names to IDs and projecting the IDs from the data files. That way, renames can never cause you to get incorrect data. You're mostly right that Spark has a problem with schema evolution for HadoopFS tables. That wouldn't affect my suggestion here, though. If you're filtering or projecting field |
|
Test build #97748 has finished for PR 22573 at commit
|
|
Test build #97765 has finished for PR 22573 at commit
|
|
Test build #97805 has finished for PR 22573 at commit
|
d59cb55 to
a996547
Compare
|
Test build #101202 has finished for PR 22573 at commit
|
|
@dbtsai Thanks for this PR! As I understand, this effort is to add support for struct type columns. Wondering if there's another effort to support Maps and Arrays. |
|
@prodeezy This is for struct type, and @dongjoon-hyun and I are working on to extend to Maps and Arrays. |
|
@dbtsai can you point me to the jira that tracks the maps/arrays support? thanks! |
a996547 to
ae88eeb
Compare
|
Test build #104311 has finished for PR 22573 at commit
|
|
retest this please. |
|
Test build #104315 has finished for PR 22573 at commit
|
|
@dbtsai what is the status of this PR? |
| case a: Attribute if !a.name.contains(".") => | ||
| Some(a.name) | ||
| case s: GetStructField if !s.childSchema(s.ordinal).name.contains(".") => | ||
| attrName(s.child).map(_ + s".${s.childSchema(s.ordinal).name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_ + "." + s.childSchema(s.ordinal).name?
|
@dbtsai Thanks again for your work on this feature. We'v recently merged a PR to support struct filtering in Iceberg [1]. This still requires Spark to pushdown the filters to the datasource. Would be great to have this work merged as well so that we can leverage it downstream. Can you tell us what's currently blocking this PR (if any)? [1] - apache/iceberg#123 |
|
@prodeezy I am working on it to make it happen in Spark 3.0. The challenging is in DSv1 filter API, there is no easy way to express the nested column, so we just put |
|
@dbtsai, +1 for a better public filter API! Let me know what you need and we can work toward getting it in. |
|
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
|
@dbtsai are we targeting a new api to handle nested filtering as part of a different PR or would that be done here? If so can you point me to it? |
What changes were proposed in this pull request?
Allows Spark to translate a Catalyst
Expressionon a nested field into a data sourceFilter, and it's a building block to have Parquet, ORC, and other data sources to support the nested predicate pushdown.How was this patch tested?
Tests added