Add support for struct field based filtering #123

prodeezy · 2019-03-07T18:54:25Z

Update:
This PR should be ready for full review. I'v added tests and comments where possible. Addressed most PR comments.

A basic test run with this change run on pre-generated Iceberg formatted data containing struct level metrics (Run with Spark patched to push struct filters down to Iceberg) :

val schema = new StructType().add("age", IntegerType).add("name", StringType).add("friends", MapType(StringType, IntegerType)).add("location", new StructType().add("lat", DoubleType).add("lon", DoubleType))

val iceDf = spark.read.format("iceberg").load("iceberg-people-nestedfield-metrics")
iceDf.createOrReplaceTempView("iceberg_people_nestedfield_metrics")


// Struct filter pushed down by Spark to Iceberg Scan
scala> spark.sql("select * from iceberg_people_nestedfield_metrics where location.lat = 101.123").explain()
== Physical Plan ==
*(1) Project [age#0, name#1, friends#2, location#3]
+- *(1) Filter (isnotnull(location#3) && (location#3.lat = 101.123))
   +- *(1) ScanV2 iceberg[age#0, name#1, friends#2, location#3] (Filters: [isnotnull(location#3), (location#3.lat = 101.123)], Options: [path=iceberg-people-nestedfield-metrics,paths=[]])

// Without this PR following query would fail with the exception listed in Issue#22
scala> spark.sql("select * from iceberg_people_nestedfield_metrics where location.lat = 101.123").show()
+---+----+--------------------+-----------------+
|age|name|             friends|         location|
+---+----+--------------------+-----------------+
| 30|Andy|[Josh -> 10, Bisw...|[101.123, 50.324]|
+---+----+--------------------+-----------------+


scala> spark.sql("select * from iceberg_people_nestedfield_metrics where location.lat <= 101.123").show()
+---+----+--------------------+-----------------+
|age|name|             friends|         location|
+---+----+--------------------+-----------------+
| 30|Andy|[Josh -> 10, Bisw...|[101.123, 50.324]|
+---+----+--------------------+-----------------+


scala> spark.sql("select * from iceberg_people_nestedfield_metrics where location.lat >= 101.123").show()
+---+------+--------------------+-----------------+
|age|  name|             friends|         location|
+---+------+--------------------+-----------------+
| 30|  Andy|[Josh -> 10, Bisw...|[101.123, 50.324]|
| 19|Justin|[Kannan -> 75, Sa...|[175.926, 20.524]|
+---+------+--------------------+-----------------+


scala> spark.sql("select * from iceberg_people_nestedfield_metrics where location.lat > 101.123").show()

+---+------+--------------------+-----------------+
|age|  name|             friends|         location|
+---+------+--------------------+-----------------+
| 19|Justin|[Kannan -> 75, Sa...|[175.926, 20.524]|
+---+------+--------------------+-----------------+

Gist to create data : https://gist.github.com/prodeezy/001cf155ff0675be7d307e9f842e1dac

/cc @rdblue @aokolnychyi @xabriel @fbocse

api/src/main/java/com/netflix/iceberg/expressions/UnboundPredicate.java

fbocse · 2019-03-08T09:22:02Z

api/src/main/java/com/netflix/iceberg/expressions/BoundReference.java

+        }
+      }
+    }
+    return null;


Should this method be returning a null value in case none of the nested fields of this struct match by field id?
Instead of an instance of BoundReference with a null private attribute type would it be acceptable that we instead throw new ValidationException("Cannot find nested field id %d in struct: %s", fieldId, struct); ?

api/src/main/java/com/netflix/iceberg/expressions/BoundReference.java

fbocse · 2019-03-08T09:27:35Z

Thank you @prodeezy for contributing this, it looks really helpful!

fbocse · 2019-03-08T09:40:50Z

General API design related question - how do we bubble up in the API layer the fact that we are supporting the struct field based filtering but not other nested field types filtering? I am not referring to the implementation details of exposing this in the API, I am just curious what a client's expectations should be that the API exposes this new capacity in a coherent manner.

…ding nested field in UnboundPredicate.bind

prodeezy · 2019-03-11T11:16:08Z

General API design related question - how do we bubble up in the API layer the fact that we are supporting the struct field based filtering but not other nested field types filtering? .. I am just curious what a client's expectations should be that the API exposes this new capacity in a coherent manner.

@fbocse Currently it would be implied. The only feedback the client gets right now is in the Physical plan, that this filter is pushed down to Iceberg Scan level and if the client inspects the Scan (using iceTable.newScan().filter(structFilterExp).planFiles ) , should be able to notice the appropriate files being skipped. AFAIK, this is consistent with how top-level field based filter expressions currently communicate to the client.

api/src/main/java/com/netflix/iceberg/expressions/BoundReference.java

rominparekh · 2019-03-11T15:20:25Z

api/src/main/java/com/netflix/iceberg/expressions/UnboundPredicate.java

+    boolean isNestedFieldExp = expressionFieldPath.indexOf('.') > -1;
+
+    field = isNestedFieldExp ? findNestedField(struct, expressionFieldPath, caseSensitive) :
+      caseSensitive ? struct.field(ref().name()) : struct.caseInsensitiveField(ref().name());


minor: can we reuse expressionFieldPath instead of obtaining ref().name() again?

xabriel · 2019-03-11T17:41:05Z

api/src/main/java/com/netflix/iceberg/expressions/BoundReference.java

    this.fieldId = fieldId;
-    this.pos = find(fieldId, struct);
-    this.type = struct.fields().get(pos).type();
+    this.pos = findTopFieldPos(fieldId, struct);


What does it mean to have a pos = -1?

I ask because although this field is private, we do expose it via pos() and toString() methods, so we may need to populate this field with the actual position of the matched inner struct.

Yeah, good point. going to write some unit tests to evaluate impact. will handle accordingly.

handled this using accessors

api/src/main/java/com/netflix/iceberg/expressions/BoundReference.java

xabriel · 2019-03-11T18:31:59Z

api/src/main/java/com/netflix/iceberg/expressions/UnboundPredicate.java

+
+    return subField.field(lastFieldInPath);
+
+  }


Now that we need to recurse, I wonder if it makes sense to reuse the indexes available in TypeUtils: https://github.com/apache/incubator-iceberg/blob/0c9c63140e838875dc8cc52a57be2c8f24ad9975/api/src/main/java/com/netflix/iceberg/types/TypeUtil.java#L78-L84 so that this code simplifies to:

Integer idx = indexByName.get(expressionFieldPath); Types.NestedField field = indexById.get(idx);

That code doesn't check for non-struct parents; we'd have to perhaps create a custom Indexer. Also we would need to calculate the index lazily and perhaps cache it ( As in the Schema, see https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/com/netflix/iceberg/Schema.java#L59-L71 )

Doesn't look like TypeUtils.indexByName & TypeUtils.indexById are used when reading data. It's used by the Manifest reader to index on manifest file statistics data.

I guess you are suggesting we now start using this. i dunno what impact that would have though. do you see a major benefit to it? I can look into it.

I guess you are suggesting we now start using this.

Right.

do you see a major benefit to it?

Code reuse. But it would only make sense if findNestedField() is called many times. The first call to findNestedField() would build the indexes and consult them. Then any subsequent call to findNestedField() would just consult the cached Maps.

Right. seems like it's being called many times but by different expression evaluators. A cache to lookup field name to Types.NestedField would help. Good call. will add one in.

@xabriel I'm now creating an index that keeps ids to accessors and then use accessors to reach type and fields. access to this index is lazy as well.

…trics filters

prodeezy · 2019-03-14T11:43:58Z

Added struct field based unit testing to TestDictionaryRowGroupFilter & TestMetricsRowGroupFilter.

prodeezy · 2019-03-18T12:10:28Z

@rdblue thoughts?

This PR should be ready for full review. I'v added tests and comments where possible. Addressed most PR comments.

rdblue · 2019-03-18T16:52:01Z

@prodeezy, thanks for working on this! I'll review it as soon as I get time this week.

prodeezy · 2019-03-20T05:46:35Z

@aokolnychyi would be good to get your thoughts on this since you did all the upstream work to enable this :-)

prodeezy · 2019-03-21T09:29:48Z

Looks like pull#138 introduced conflicts. Will rebase with latest on master and push.

prodeezy · 2019-03-22T05:41:56Z

Looks like pull#138 introduced conflicts. Will rebase with latest on master and push.

done.

rdblue · 2019-03-23T00:30:21Z

@prodeezy, this is looking good. I think you have the right approach, but some things are missing.

With this patch, BoundReference has the correct type and field ID. That works for InclusiveMetricsEvaluator and the Parquet filters that look up data based on ID. But it isn't enough for other evaluators that require binding:

Evaluator calls BoundReference.get to get a value from a StructLike. That method needs to be updated to handle struct nesting, preferably using an Accessor (that's an unfinished branch of mine working on this problem). As it is now, evaluators would fail because get will try to access position -1 for any nested field.
InclusiveManifestEvaluator uses BoundReference.pos to get the correct PartitionFieldSummary from a ManifestFile instance. The partition field summaries are stored in the order of partition fields, so binding to the partition struct gives the correct position in the field summaries list. I think this would actually work because all fields are top-level, but I think there should be a better signal than returning -1.

This should also bind differently. We never parse identifiers in Iceberg. Instead of parsing, we build indexes using the possible field names. That way, we never have to worry about quoting.

For example, if a user passes in an expression for "a.b.c", that could mean multiple paths: ["a", "b", "c"] or ["a.b", "c"] or ["a", "b.c"] or ["a.b.c"]. But the important thing is that there can be only one column that flattens to "a.b.c" because the columns are ambiguous otherwise. Rather than searching, we keep a map in the schema from "a.b.c" to the right field ID.

Expression binding should use an index like the ones used by Schema and then build accessors using the actual path of field names.

…x using BuildPositionAccessors

rdblue · 2019-03-25T17:04:15Z

api/src/main/java/org/apache/iceberg/expressions/UnboundPredicate.java

-    field = isNestedFieldExp ? findNestedField(struct, expressionFieldPath, caseSensitive) :
-      caseSensitive ? struct.field(expressionFieldPath) : struct.caseInsensitiveField(ref().name());
+    Schema schema = new Schema(struct.fields());
+    Types.NestedField field = schema.findField(caseSensitive? ref().name(): ref().name().toLowerCase());


This should probably use a case insensitive version of findField instead of assuming that passing lower-case in will work.

prodeezy · 2019-03-25T17:07:07Z

@rdblue Is there a way to test the Evaluator as an end to end test? I haven't been able to test that part out properly with my conventional gist examples.

rdblue · 2019-03-25T18:37:07Z

@prodeezy, IcebergGenerics is a way to read a table directly and will use Evaluator to filter records. Just make sure that should updates unit tests for the evaluators as well as adding end-to-end tests.

prodeezy · 2019-03-26T09:22:02Z

Added tests for Evaluator and addressed other review comments.

prodeezy · 2019-03-27T05:22:41Z

@rdblue Thanks for taking a detailed look! I think i'v addressed the review comments and added pending tests.

api/src/main/java/org/apache/iceberg/expressions/BoundReference.java

rdblue · 2019-04-02T19:23:44Z

api/src/main/java/org/apache/iceberg/expressions/BoundReference.java

  }

  public int pos() {
+    if (pos == -1) {


It seems strange to me to throw a ValidationException in some cases. It is likely that code paths will not know whether the bound expression is for a or a.b, but the latter will fail here, during evaluation instead of during binding.

Could the evaluator that uses pos be updated so that this accessor could be used? Maybe PositionAccessor could support accessing summaries from List. That would be cleaner.

Need some clarification here .. The evaluator in question is the InclusiveManifestEvaluator which evaluates partition fields for matching manifests. So the ref.pos() is going to be on partition fields. Afaik partition stats are kept separately in snapshot files and regular fields wont show up in this partition summary list. Is this an issue when filtering on a nested field in the schema? If so, Is the partition source id used always to reference that field in schema? Can I assume this for logical partitions as well?

e.g. This is a partition summary in snapshot which is used in the said evaluator. This is separate from the stats kept on data schema fields.

"partitions": { "array": [ { "contains_null": true, "lower_bound": { "bytes": "\u0013\u0000\u0000\u0000" }, "upper_bound": { "bytes": "\u001e\u0000\u0000\u0000" } } ] }

I think you're saying that this is technically safe, and that's correct. The only code path that calls this that evaluator and it is always binding to a flat partition structure. That's why it can use the position: it knows that the array of partition summaries is in the same order as a tuple of partition values.

My point here is that it is brittle to bind to a struct type and use the position for something else, and also that it is a bad API to expose the position when no normal path uses the position directly. Instead, maybe that evaluator should get this position from the first accessor. That way, it validates that the partition field is not nested (should be a single position accessor). My original thought was to add a method to the accessor that can return one of the partition summaries from a list. That would work, too, but requires another accessor method so it isn't a great idea.

Thanks for clarifying. So i'm now using PositionAccessor to fetch position during manifest evaluation. Throwing an error if it's a different kind of accessor. Took out the pos() api from BoundReference

done. This also cleared out usage of schema in BoundReference

rdblue · 2019-05-07T22:51:42Z

@prodeezy, are you still working on this? I'd like to get it in before we release.

prodeezy · 2019-05-07T23:30:12Z

@rdblue sorry bout the delay. I'l have the pending comments addressed in a day or two.

rdblue · 2019-05-07T23:44:45Z

No problem! Just let me know if you'd like me to help out.

prodeezy · 2019-05-13T23:53:57Z

@rdblue addressed all but one comments. Need some clarification/guidance on evaluating position for the bound exp reference.

prodeezy · 2019-05-14T21:38:58Z

I took another glance. Think i'v addressed all pending comments. @rdblue

prodeezy · 2019-05-21T23:01:10Z

@rdblue gentle reminder. Lemme know if this looks ok to you.

rdblue · 2019-05-22T03:42:22Z

Thanks for the reminder. I should have time to review it this week. Sorry for the delay.

api/src/main/java/org/apache/iceberg/expressions/BoundReference.java

rdblue · 2019-05-25T00:18:19Z

@prodeezy, I thought that it would be easier if I made a few minor changes since it would take longer to ask you to do them than to just move a few things around. I opened a PR against your branch: prodeezy#1

If you agree with those changes, just merge the PR and push and I'll commit this. Thanks!

PR apache#123 changes

rdblue · 2019-05-26T20:24:51Z

@prodeezy, the test failures are because I removed the map column from TestMetricsRowGroupFilter. I thought that wasn't used but I guess it was. If you add that column back, the tests should pass.

prodeezy · 2019-05-28T19:04:19Z

Fixed test.

rdblue · 2019-05-28T19:40:50Z

Merged! Thanks for fixing this @prodeezy!

Add support for struct field based filtering

a48e9d9

fbocse reviewed Mar 8, 2019

View reviewed changes

api/src/main/java/com/netflix/iceberg/expressions/BoundReference.java Outdated Show resolved Hide resolved

throw error when struct field not found and consolidate logic for fin…

0c9c631

…ding nested field in UnboundPredicate.bind

rominparekh reviewed Mar 11, 2019

View reviewed changes

xabriel reviewed Mar 11, 2019

View reviewed changes

prodeezy added 2 commits March 14, 2019 01:32

Fixed exception error, reused variable in bind

a5e4d16

Adding unit tests to skip based on struct fields on dictionary and me…

dae1922

…trics filters

fbocse approved these changes Mar 18, 2019

View reviewed changes

prodeezy mentioned this pull request Mar 20, 2019

Collect lower/upper bounds for nested struct fields in ParquetMetrics #136

Merged

Merge branch 'master' into issue-122-support-struct-filter-expressions

cdda798

Added accessor based fetching of fields in BoundReference, build inde…

f35e3d6

…x using BuildPositionAccessors

rdblue reviewed Mar 25, 2019

View reviewed changes

prodeezy added 2 commits March 26, 2019 13:07

Added caseInsensitiveFindField and struct based unit tests for Evaluator

71aaba6

Fixed tests to create struct schema properly

9f72f36

made id to accessor map load lazily and cached reference

1097b11

rdblue reviewed Apr 2, 2019

View reviewed changes

api/src/main/java/org/apache/iceberg/expressions/BoundReference.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 2, 2019

View reviewed changes

prodeezy added 2 commits April 10, 2019 17:24

Merge branch 'master' into issue-122-support-struct-filter-expressions

8fa0a11

addressed style comments

6438da5

prodeezy added 3 commits May 9, 2019 16:18

Merge branch 'master' into issue-122-support-struct-filter-expressions

deb5c59

Moved accessor code out of BoundReference into package private classes

f7de4e2

Moved Field to Accessor mapping to Schema

1deff5b

prodeezy added 3 commits May 13, 2019 17:27

Remove extra lines

b1f59e5

Using accessor to fetch field position

cbb3a9f

Cleared out code from BoundReference

25f6f91

aokolnychyi mentioned this pull request May 19, 2019

Basic Benchmarks for Iceberg Spark Data Source #105

Merged

rdblue reviewed May 24, 2019

View reviewed changes

api/src/main/java/org/apache/iceberg/expressions/BoundReference.java Outdated Show resolved Hide resolved

prodeezy and others added 3 commits May 24, 2019 16:50

toString prings id and accessor-type

b699257

Merge branch 'master' into issue-122-support-struct-filter-expressions

f3f1783

Minor updates for style and to minimize additions to the public API.

a88603e

Merge pull request #1 from rdblue/pr-123-support-filter-on-nested-fields

83227b0

PR apache#123 changes

Fix broken test

2031eb5

rdblue merged commit 81f29e2 into apache:master May 28, 2019

prodeezy mentioned this pull request Jun 12, 2019

[SPARK-25558][SQL] Pushdown predicates for nested fields in DataSource Strategy apache/spark#22573

Closed

Add support for struct field based filtering #123

Add support for struct field based filtering #123

Uh oh!

Conversation

prodeezy commented Mar 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fbocse commented Mar 8, 2019

Uh oh!

fbocse commented Mar 8, 2019

Uh oh!

prodeezy commented Mar 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prodeezy commented Mar 14, 2019

Uh oh!

prodeezy commented Mar 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Mar 18, 2019

Uh oh!

prodeezy commented Mar 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prodeezy commented Mar 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prodeezy commented Mar 22, 2019

Uh oh!

rdblue commented Mar 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prodeezy commented Mar 25, 2019

Uh oh!

rdblue commented Mar 25, 2019

Uh oh!

prodeezy commented Mar 26, 2019

Uh oh!

prodeezy commented Mar 27, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prodeezy May 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

prodeezy commented Mar 7, 2019 •

edited

Loading

prodeezy commented Mar 11, 2019 •

edited

Loading

prodeezy commented Mar 18, 2019 •

edited

Loading

prodeezy commented Mar 20, 2019 •

edited

Loading

prodeezy commented Mar 21, 2019 •

edited

Loading

rdblue commented Mar 23, 2019 •

edited

Loading

prodeezy May 13, 2019 •

edited

Loading

prodeezy May 14, 2019 •

edited

Loading

prodeezy May 14, 2019 •

edited

Loading

prodeezy commented May 13, 2019 •

edited

Loading