[SPARK-25558][SQL] Pushdown predicates for nested fields in DataSource Strategy #22573

dbtsai · 2018-09-27T18:30:28Z

What changes were proposed in this pull request?

Allows Spark to translate a Catalyst Expression on a nested field into a data source Filter, and it's a building block to have Parquet, ORC, and other data sources to support the nested predicate pushdown.

How was this patch tested?

Tests added

dbtsai · 2018-09-27T18:32:40Z

@gatorsmile @cloud-fan @dongjoon-hyun @viirya

SparkQA · 2018-09-27T22:22:56Z

Test build #96709 has finished for PR 22573 at commit 2f21842.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-28T00:02:39Z

Test build #96710 has finished for PR 22573 at commit 53165b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-28T00:15:13Z

Test build #96711 has finished for PR 22573 at commit d59cb55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-09-28T00:43:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

Will this cause regression for data source supporting dot in column name?

Do we have any data source currently supporting dot in the column name with pushdown? The worst case will be no pushdown for those data sources.

I know ORC doesn't work for now. We can have another followup PR to address this.

JDBC data source seems having no such restrict. So I worry that this change can cause some regressions.

Yes, @dbtsai . This PR has a regression on ORC at least. The following is ORC result in Spark 2.3.2 and it will slowdown at least 5 times like Parquet.

I know ORC doesn't work for now. We can have another followup PR to address this.

scala> val df = spark.range(Int.MaxValue).sample(0.2).toDF("col.with.dot") scala> df.write.mode("overwrite").orc("/tmp/orc") scala> df.write.mode("overwrite").parquet("/tmp/parquet") scala> spark.sql("set spark.sql.orc.impl=native") scala> spark.sql("set spark.sql.orc.filterPushdown=true") scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` = 50000").count) Time taken: 803 ms scala> spark.time(spark.read.parquet("/tmp/parquet").where("`col.with.dot` = 50000").count) Time taken: 5573 ms scala> spark.version res6: String = 2.3.2

Apache Spark 2.4.0 RC2 has a regression on this case. So, for now, this PR doesn't have regssion on master branch.

scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` = 50000").count) Time taken: 2405 ms

Probably a dumb question. Is it possible to store a column with . in its name in parquet?

More holistically, I think it would be better to create an abstraction for a multipart identifier in a filter as opposed to encoding it using a '.'.

cloud-fan · 2018-09-28T07:54:20Z

I think the problem is, the current public Filter API uses string as the attribute type, which is hard to represent nested fields.

Ideally we should extend the API, create a new interface for column and nested column, instead of string. But Filter is a public API so this is hard to do.

This PR proposes to encode nested columns as string. This works, but we should think carefully about how to encode, so that column name with dot is still supported.

dbtsai · 2018-09-28T22:31:27Z

I was thinking to change the APIs in Filter so we can represent nested fields easier, but also realized that it's a stable public interface.

Without changing the interface of Filter, we can have the following options,

Use backtick to wrap around the column name and structure name containing dots. For example,

`column.1`.`attribute.b`

It's also easier for people to understand when they are reading the pushdown plans in text format.

We can use ASCII delimited text to avoid delimiter collision, for example \31 is commonly used between fields of a record, or members of a row. This simplifies parsing significantly, but the downside is that it's not readable, so when we print the plan, we need to add the backtick for visualization.

What do you think?

dongjoon-hyun · 2018-09-30T23:15:38Z

Can we update public Filter API in Spark 3.0.0? @cloud-fan and @gatorsmile .

gatorsmile · 2018-10-01T05:11:51Z

Updating Filter APIs sounds reasonable to me. This should be part of our data source API v2. cc @cloud-fan @rxin @rdblue

dongjoon-hyun · 2018-10-01T05:17:23Z

That's great!

rdblue · 2018-10-01T17:31:50Z

The approach we've taken in Iceberg is to allow . in names by using an index in the top-level schema. The full path of every leaf in the schema is produced and added to a map from the full field name to the field's ID.

The reason why we do this is to avoid problem areas:

Parsing the name using . as a delimiter
Traversing the schema structure

For example, the schema 0: a struct<2: x int, 3: y int>, 1: a.z int produces this index: Map("a" -> 0, "a.x" -> 2, "a.y" -> 3, "a.z" -> 1).

Binding filters like a.x > 3 or a.z < 5 is done using the index instead of parsing the field name and traversing, so you get the right result without needing to decide whether "a.x" is nested or if it is the actual name. So the lookup is quick and correctly produces id(2) > 3 and id(1) < 5. This is also used for projection because users want to be able to select nested columns by name using dotted field names.

The only drawback to this approach is that you can't have duplicates in the index: each full field name must be unique. In the example above, the top-level a.z field could not be named a.x or else it would collide with x nested in a.

dongjoon-hyun · 2018-10-01T18:59:09Z

Thank you, @rdblue . BTW, in general, indexing might be unsafe in Apache Spark when Metastore Schema is different from File Schema. Does it assume schema evolution feature in IceBerg?

rdblue · 2018-10-01T19:19:40Z

@dongjoon-hyun, Iceberg schema evolution is based on the field IDs, not on names. The current table schema's names are the runtime names for columns in that table, and all reads happen by first translating those names to IDs and projecting the IDs from the data files. That way, renames can never cause you to get incorrect data.

You're mostly right that Spark has a problem with schema evolution for HadoopFS tables. That wouldn't affect my suggestion here, though. If you're filtering or projecting field m.n, then Spark currently handles that by matching columns by name. If you're matching by name, then m.n can't change across versions, or at least you can always project m.n from the data (in the case of Avro).

SparkQA · 2018-10-22T10:44:10Z

Test build #97748 has finished for PR 22573 at commit d59cb55.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-22T11:27:05Z

Test build #97765 has finished for PR 22573 at commit d59cb55.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-22T14:39:10Z

Test build #97805 has finished for PR 22573 at commit d59cb55.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-14T21:59:59Z

Test build #101202 has finished for PR 22573 at commit a996547.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

prodeezy · 2019-02-28T08:56:20Z

@dbtsai Thanks for this PR! As I understand, this effort is to add support for struct type columns. Wondering if there's another effort to support Maps and Arrays.

dbtsai · 2019-03-01T20:51:59Z

@prodeezy This is for struct type, and @dongjoon-hyun and I are working on to extend to Maps and Arrays.

prodeezy · 2019-03-07T06:29:18Z

@dbtsai can you point me to the jira that tracks the maps/arrays support? thanks!

SparkQA · 2019-04-05T07:05:02Z

Test build #104311 has finished for PR 22573 at commit ae88eeb.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-04-05T07:27:38Z

retest this please.

SparkQA · 2019-04-05T10:50:30Z

Test build #104315 has finished for PR 22573 at commit ae88eeb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2019-04-16T15:01:10Z

@dbtsai what is the status of this PR?

hvanhovell · 2019-04-16T15:04:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

+      case a: Attribute if !a.name.contains(".") =>
+        Some(a.name)
+      case s: GetStructField if !s.childSchema(s.ordinal).name.contains(".") =>
+        attrName(s.child).map(_ + s".${s.childSchema(s.ordinal).name}")


_ + "." + s.childSchema(s.ordinal).name?

prodeezy · 2019-06-12T21:47:14Z

@dbtsai Thanks again for your work on this feature. We'v recently merged a PR to support struct filtering in Iceberg [1]. This still requires Spark to pushdown the filters to the datasource. Would be great to have this work merged as well so that we can leverage it downstream. Can you tell us what's currently blocking this PR (if any)?

[1] - apache/iceberg#123

dbtsai · 2019-06-12T22:16:44Z

@prodeezy I am working on it to make it happen in Spark 3.0. The challenging is in DSv1 filter API, there is no easy way to express the nested column, so we just put dot in the string. In DSv2 API, we have a better design of handling nested column, but unfortunately, it still uses DSv1 filter module. We would like to purpose a new set of filter APIs in v2 to handle this situation.

rdblue · 2019-06-12T22:36:33Z

@dbtsai, +1 for a better public filter API! Let me know what you need and we can work toward getting it in.

github-actions · 2020-01-06T00:07:26Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

prodeezy · 2020-01-06T17:42:04Z

@dbtsai are we targeting a new api to handle nested filtering as part of a different PR or would that be done here? If so can you point me to it?

dbtsai force-pushed the dataSourcePredicate branch 2 times, most recently from 53165b8 to d59cb55 Compare September 27, 2018 20:17

viirya reviewed Sep 28, 2018

View reviewed changes

dongjoon-hyun mentioned this pull request Oct 16, 2018

[SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC predicates #22597

Closed

dbtsai mentioned this pull request Oct 23, 2018

[SPARK-25769][SQL]make UnresolvedAttribute.sql escape nested columns correctly #22788

Closed

dbtsai force-pushed the dataSourcePredicate branch from d59cb55 to a996547 Compare January 14, 2019 18:55

This was referenced Mar 5, 2019

Collect lower and upper bounds for nested struct fields in ParquetMetrics apache/iceberg#78

Closed

Add support for nested struct field based filter expressions in Iceberg apache/iceberg#122

Closed

prodeezy mentioned this pull request Mar 20, 2019

Collect lower/upper bounds for nested struct fields in ParquetMetrics apache/iceberg#136

Merged

DataSourceStrategy

ae88eeb

dbtsai force-pushed the dataSourcePredicate branch from a996547 to ae88eeb Compare April 5, 2019 05:36

hvanhovell reviewed Apr 16, 2019

View reviewed changes

dongjoon-hyun added the SQL label Jun 14, 2019

github-actions bot added the Stale label Jan 6, 2020

github-actions bot closed this Jan 7, 2020

[SPARK-25558][SQL] Pushdown predicates for nested fields in DataSource Strategy #22573

[SPARK-25558][SQL] Pushdown predicates for nested fields in DataSource Strategy #22573

Uh oh!

Conversation

dbtsai commented Sep 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dbtsai commented Sep 27, 2018

Uh oh!

SparkQA commented Sep 27, 2018

Uh oh!

SparkQA commented Sep 28, 2018

Uh oh!

SparkQA commented Sep 28, 2018

Uh oh!

viirya Sep 28, 2018

Choose a reason for hiding this comment

Uh oh!

dbtsai Sep 28, 2018

Choose a reason for hiding this comment

Uh oh!

viirya Sep 28, 2018

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Sep 30, 2018

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Sep 30, 2018

Choose a reason for hiding this comment

Uh oh!

hvanhovell Apr 16, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Sep 28, 2018

Uh oh!

dbtsai commented Sep 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Sep 30, 2018

Uh oh!

gatorsmile commented Oct 1, 2018

Uh oh!

dongjoon-hyun commented Oct 1, 2018

Uh oh!

rdblue commented Oct 1, 2018

Uh oh!

dongjoon-hyun commented Oct 1, 2018

Uh oh!

rdblue commented Oct 1, 2018

Uh oh!

SparkQA commented Oct 22, 2018

Uh oh!

SparkQA commented Oct 22, 2018

Uh oh!

SparkQA commented Oct 22, 2018

Uh oh!

SparkQA commented Jan 14, 2019

Uh oh!

prodeezy commented Feb 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dbtsai commented Mar 1, 2019

Uh oh!

prodeezy commented Mar 7, 2019

Uh oh!

SparkQA commented Apr 5, 2019

Uh oh!

viirya commented Apr 5, 2019

Uh oh!

SparkQA commented Apr 5, 2019

Uh oh!

hvanhovell commented Apr 16, 2019

Uh oh!

hvanhovell Apr 16, 2019

Choose a reason for hiding this comment

Uh oh!

prodeezy commented Jun 12, 2019

Uh oh!

dbtsai commented Sep 27, 2018 •

edited

Loading

dbtsai commented Sep 28, 2018 •

edited

Loading

prodeezy commented Feb 28, 2019 •

edited

Loading