-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Spark 3.3: Discard filters that can be pushed down completely #6524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
0d5b3ae
d9e5bb8
1568255
ef6b68b
8e19795
a84b328
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -29,9 +29,9 @@ | |
| import org.apache.iceberg.Snapshot; | ||
| import org.apache.iceberg.Table; | ||
| import org.apache.iceberg.TableProperties; | ||
| import org.apache.iceberg.exceptions.ValidationException; | ||
| import org.apache.iceberg.expressions.Binder; | ||
| import org.apache.iceberg.expressions.Expression; | ||
| import org.apache.iceberg.expressions.ExpressionUtil; | ||
| import org.apache.iceberg.expressions.Expressions; | ||
| import org.apache.iceberg.relocated.com.google.common.base.Preconditions; | ||
| import org.apache.iceberg.relocated.com.google.common.collect.Lists; | ||
|
|
@@ -106,41 +106,41 @@ public SparkScanBuilder caseSensitive(boolean isCaseSensitive) { | |
| @Override | ||
| public Filter[] pushFilters(Filter[] filters) { | ||
| List<Expression> expressions = Lists.newArrayListWithExpectedSize(filters.length); | ||
| List<Filter> pushed = Lists.newArrayListWithExpectedSize(filters.length); | ||
| List<Filter> pushableFilters = Lists.newArrayListWithExpectedSize(filters.length); | ||
| List<Filter> postScanFilters = Lists.newArrayListWithExpectedSize(filters.length); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This may just me but I would mark this as "SparkFilters"
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To me, I agree it is hard to navigate without context. In order to make it a bit clear, I added a comment above. Could you check if it's any better?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that's fine, I don't really like the Spark nomenclature here but I think your comment does a good job defining it. I think I would add that. (1) and (2) are placed in "PushableFilters" and (2) and (3) are returned to spark in "postScanFilters" Really I just think |
||
|
|
||
| for (Filter filter : filters) { | ||
| Expression expr = null; | ||
| try { | ||
| expr = SparkFilters.convert(filter); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is a little hard for me since everything is in the try catch here, maybe instead we keep the older pattern of But i'm pretty sure this is correct either way
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We also have to wrap the code that converts the filter. It is unlikely to throw an exception but we have to make sure it does not fail the query. There are 3 calls that can throw an exception now. I did not want to have nested try-catch because it looked to complicated with added logic. Github renders it in a way that's really hard to read. It does not seem to be that bad in IDE.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For me it was just thinking about the different failure positions, I didn't know SparkFilters.convert could also throw, I thought it was just the binding.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We had some bugs when |
||
| } catch (IllegalArgumentException e) { | ||
| // converting to Iceberg Expression failed, so this expression cannot be pushed down | ||
| LOG.info( | ||
| "Failed to convert filter to Iceberg expression, skipping push down for this expression: {}. {}", | ||
| filter, | ||
| e.getMessage()); | ||
| } | ||
| Expression expr = SparkFilters.convert(filter); | ||
|
|
||
| if (expr != null) { | ||
| try { | ||
| if (expr != null) { | ||
| // try binding the expression to ensure it can be pushed down | ||
|
RussellSpitzer marked this conversation as resolved.
|
||
| Binder.bind(schema.asStruct(), expr, caseSensitive); | ||
|
|
||
| expressions.add(expr); | ||
| pushed.add(filter); | ||
| } catch (ValidationException e) { | ||
| // binding to the table schema failed, so this expression cannot be pushed down | ||
| LOG.info( | ||
| "Failed to bind expression to table schema, skipping push down for this expression: {}. {}", | ||
| filter, | ||
| e.getMessage()); | ||
| pushableFilters.add(filter); | ||
| } | ||
|
|
||
| if (expr == null || requiresRecordLevelFiltering(expr)) { | ||
| postScanFilters.add(filter); | ||
| } | ||
| } catch (Exception e) { | ||
| LOG.warn("Failed to check if {} can be pushed down: {}", filter, e.getMessage()); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is now a Warn instead of an info
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was not sure about this. If there is an exception anywhere in this path, it indicates something went wrong. Seems like something we should warn the user about? I can revert it too. What do you think?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure the user can really do anything about the exceptions in this path. It's really only something a dev can fix when working on the Iceberg library correct?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's true.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would prefer to be more specific about the error here. Logging the |
||
| postScanFilters.add(filter); | ||
| } | ||
| } | ||
|
|
||
| this.filterExpressions = expressions; | ||
| this.pushedFilters = pushed.toArray(new Filter[0]); | ||
| this.pushedFilters = pushableFilters.toArray(new Filter[0]); | ||
|
|
||
| // all unsupported filters and filters that require record-level filtering | ||
| // must be reported back and handled on the Spark side | ||
| return postScanFilters.toArray(new Filter[0]); | ||
| } | ||
|
|
||
| // Spark doesn't support residuals per task, so return all filters | ||
| // to get Spark to handle record-level filtering | ||
| return filters; | ||
| private boolean requiresRecordLevelFiltering(Expression expr) { | ||
|
aokolnychyi marked this conversation as resolved.
Outdated
|
||
| return table.specs().values().stream() | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For now, we simply check each spec in the table. In the future, we may optimize this to only look at selected specs but that won't be trivial. I think it is a reasonable start.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think another optimization maybe worth doing would be to group all expressions based on bound column an operation. In the bad case we are considering we would end up checking wether or not we can filter a "column = literal" for a ton of different literal values. |
||
| .anyMatch(spec -> !ExpressionUtil.selectsPartitions(expr, spec, caseSensitive)); | ||
|
RussellSpitzer marked this conversation as resolved.
|
||
| } | ||
|
|
||
| @Override | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.