Push down partition filter to Spark when Importing File Based Tables #3745

huaxingao · 2021-12-14T22:22:26Z

When getting files from Spark, we want to push down partition filters to Spark, so only the partitions that match the filters will be returned.
Basically, I am using this Spark API to prune partitions

FileIndex
  def listFiles(
      partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Seq[PartitionDirectory]

huaxingao · 2021-12-14T22:29:04Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java

+            .sameType(DataTypes.CalendarIntervalType)) {
+          filterExpressions.add(new EqualTo(ref,
+              org.apache.spark.sql.catalyst.expressions.Literal.create(entry.getValue(),
+              DataTypes.CalendarIntervalType)));


I only tested dataType Integer and String by using the existing test suite TestAddFilesProcedure. If this PR looks OK, I will also test all the other data types.

For this large if-else-if chain, you might want to look into this lookup-map pattern used here:

iceberg/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkFilters.java

Lines 80 to 220 in 466073b

private static final Map<Class<? extends Filter>, Operation> FILTERS = ImmutableMap

.<Class<? extends Filter>, Operation>builder()

.put(AlwaysTrue.class, Operation.TRUE)

.put(AlwaysTrue$.class, Operation.TRUE)

.put(AlwaysFalse$.class, Operation.FALSE)

.put(AlwaysFalse.class, Operation.FALSE)

.put(EqualTo.class, Operation.EQ)

.put(EqualNullSafe.class, Operation.EQ)

.put(GreaterThan.class, Operation.GT)

.put(GreaterThanOrEqual.class, Operation.GT_EQ)

.put(LessThan.class, Operation.LT)

.put(LessThanOrEqual.class, Operation.LT_EQ)

.put(In.class, Operation.IN)

.put(IsNull.class, Operation.IS_NULL)

.put(IsNotNull.class, Operation.NOT_NULL)

.put(And.class, Operation.AND)

.put(Or.class, Operation.OR)

.put(Not.class, Operation.NOT)

.put(StringStartsWith.class, Operation.STARTS_WITH)

.build();

public static Expression convert(Filter[] filters) {

Expression expression = Expressions.alwaysTrue();

for (Filter filter : filters) {

Expression converted = convert(filter);

Preconditions.checkArgument(converted != null, "Cannot convert filter to Iceberg: %s", filter);

expression = Expressions.and(expression, converted);

}

return expression;

}

public static Expression convert(Filter filter) {

// avoid using a chain of if instanceof statements by mapping to the expression enum.

Operation op = FILTERS.get(filter.getClass());

if (op != null) {

switch (op) {

case TRUE:

return Expressions.alwaysTrue();

case FALSE:

return Expressions.alwaysFalse();

case IS_NULL:

IsNull isNullFilter = (IsNull) filter;

return isNull(unquote(isNullFilter.attribute()));

case NOT_NULL:

IsNotNull notNullFilter = (IsNotNull) filter;

return notNull(unquote(notNullFilter.attribute()));

case LT:

LessThan lt = (LessThan) filter;

return lessThan(unquote(lt.attribute()), convertLiteral(lt.value()));

case LT_EQ:

LessThanOrEqual ltEq = (LessThanOrEqual) filter;

return lessThanOrEqual(unquote(ltEq.attribute()), convertLiteral(ltEq.value()));

case GT:

GreaterThan gt = (GreaterThan) filter;

return greaterThan(unquote(gt.attribute()), convertLiteral(gt.value()));

case GT_EQ:

GreaterThanOrEqual gtEq = (GreaterThanOrEqual) filter;

return greaterThanOrEqual(unquote(gtEq.attribute()), convertLiteral(gtEq.value()));

case EQ: // used for both eq and null-safe-eq

if (filter instanceof EqualTo) {

EqualTo eq = (EqualTo) filter;

// comparison with null in normal equality is always null. this is probably a mistake.

Preconditions.checkNotNull(eq.value(),

"Expression is always false (eq is not null-safe): %s", filter);

return handleEqual(unquote(eq.attribute()), eq.value());

} else {

EqualNullSafe eq = (EqualNullSafe) filter;

if (eq.value() == null) {

return isNull(unquote(eq.attribute()));

} else {

return handleEqual(unquote(eq.attribute()), eq.value());

}

}

case IN:

In inFilter = (In) filter;

return in(unquote(inFilter.attribute()),

Stream.of(inFilter.values())

.filter(Objects::nonNull)

.map(SparkFilters::convertLiteral)

.collect(Collectors.toList()));

case NOT:

Not notFilter = (Not) filter;

Filter childFilter = notFilter.child();

Operation childOp = FILTERS.get(childFilter.getClass());

if (childOp == Operation.IN) {

// infer an extra notNull predicate for Spark NOT IN filters

// as Iceberg expressions don't follow the 3-value SQL boolean logic

// col NOT IN (1, 2) in Spark is equivalent to notNull(col) && notIn(col, 1, 2) in Iceberg

In childInFilter = (In) childFilter;

Expression notIn = notIn(unquote(childInFilter.attribute()),

Stream.of(childInFilter.values())

.map(SparkFilters::convertLiteral)

.collect(Collectors.toList()));

return and(notNull(childInFilter.attribute()), notIn);

} else if (hasNoInFilter(childFilter)) {

Expression child = convert(childFilter);

if (child != null) {

return not(child);

}

}

return null;

case AND: {

And andFilter = (And) filter;

Expression left = convert(andFilter.left());

Expression right = convert(andFilter.right());

if (left != null && right != null) {

return and(left, right);

}

return null;

}

case OR: {

Or orFilter = (Or) filter;

Expression left = convert(orFilter.left());

Expression right = convert(orFilter.right());

if (left != null && right != null) {

return or(left, right);

}

return null;

}

case STARTS_WITH: {

StringStartsWith stringStartsWith = (StringStartsWith) filter;

return startsWith(unquote(stringStartsWith.attribute()), stringStartsWith.value());

}

}

}

return null;

}

I'm not sure if a look-up map can be used here because of the usage of sameType function, but it might be worth looking into 😄

huaxingao · 2021-12-14T22:31:02Z

cc @RussellSpitzer

RussellSpitzer · 2021-12-14T23:04:08Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java

+    List<org.apache.spark.sql.catalyst.expressions.Expression> filterExpressions = new java.util.ArrayList<>();
+    for (Map.Entry<String, String> entry : partitionFilter.entrySet()) {
+      try {
+        // IllegalArgumentException is thrown if schema doesn't contain this entry,


Just wondering, how does Spark parse these. Is it just in the ParseExpressions code and then fed directly into here? I know in the RewriteDataProcedure we end up using the parser to do a parseExpression since then we don't have to deal with all of these internal transforms and such. I am wondering if a similar approach could work here?

Spark doesn't wrap this with its internal Exceptions. It throws this java.lang.IllegalArgumentException directly if the schema doesn't contains the column.

Sorry I meant this whole conversion into Spark Expression classes.

I know in Spark this probably happens in the parser and I was wondering if we could do the same thing here like

spark.parser.parseExpression("x = y")
I guess somewhere it does some conversion into strings or from strings into the proper types? Or maybe it doesn't...

I tried spark.parser.parseExpression("dept = hr"), we will get a EqualTo filter with left side UnresolvedAttribute dept and right side UnresolvedAttribute hr. I think in RewriteDataFilesProcedure we will execute and the UnresolvedAttribute will get resolved. But in the case of listFiles, it doesn't go through the name resolution code and UnresolvedAttribute remains to be unresolved and cause problem.
I think I will have to manually construct this EqualTo filter. I will check more to see if I can find a better solution.

I tried spark.parser.parseExpression("value"), hoping Spark can turn this into the right type, but it didn't work. I think I will have to construct the Literal and Filter Expression by myself.

@huaxingao: As someone worked recently on the similar area. In my experience, It is better to avoid new implementation as it needs lot of testing (all data type * all expression) and chances of inducing issue is high.

For the unresolved expression issue with spark.parser.parseExpression("value") , I recently found a way to get the resolved expression by adding some prefix and collecting resolved expression from plan #3757
May be can you see if it can help you as well ?

@RussellSpitzer
As @ajantha-bhat suggested, we can put the filter "x = y" inside a simple query and let Spark execute the query and then we collect the resolved filter Expression. This solution sounds good to me. WDYT?

I like it, but do we have to change all the types then? I know spark gets upset if our types don't match column vs literal, and we have all string literals here :/

Sorry for the late reply. I was off last Friday and yesterday. We don't need to change the types because Spark takes care of the casting.

kbendick

Thanks for this patch @huaxingao! Some minor comments on code style conventions and things to look into.

I'm still digesting this PR but thought I'd give you those in the interim.

kbendick · 2021-12-14T22:28:07Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java

    org.apache.spark.sql.execution.datasources.PartitionSpec spec = fileIndex.partitionSpec();
    StructType schema = spec.partitionColumns();
+    if (schema.isEmpty()) {
+      return new ArrayList<>();


Nit: Can you use Lists.newArrayList() or ImmutableList.empty() here?

For Lists, we use the repackaged internal guava version from org.apache.iceberg.relocated.com.google.common.collect.Lists. The same is true for ImmutableList, which is already imported.

You could also use Collections.emptyList() to be similar to the emptyMap above.

Also, these can be fixed later once it's determined this PR looks OK. I noticed a comment below about holding off on things until the PR is determined to look OK, and this can definitely be done then 😄

kbendick · 2021-12-15T00:10:43Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java

+    List<org.apache.spark.sql.catalyst.expressions.Expression> filterExpressions =
+        getPartitionFilterExpressions(schema, partitionFilter);
+
+    List<org.apache.spark.sql.catalyst.expressions.Expression> dataFilters = new java.util.ArrayList<>();


Nit: Same note about avoiding new ArrayList<> in favor of one of the ones mentioned above 👍

kbendick · 2021-12-15T00:13:51Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java

+            .sameType(DataTypes.CalendarIntervalType)) {
+          filterExpressions.add(new EqualTo(ref,
+              org.apache.spark.sql.catalyst.expressions.Literal.create(entry.getValue(),
+              DataTypes.CalendarIntervalType)));


For this large if-else-if chain, you might want to look into this lookup-map pattern used here:

iceberg/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkFilters.java

Lines 80 to 220 in 466073b

private static final Map<Class<? extends Filter>, Operation> FILTERS = ImmutableMap

.<Class<? extends Filter>, Operation>builder()

.put(AlwaysTrue.class, Operation.TRUE)

.put(AlwaysTrue$.class, Operation.TRUE)

.put(AlwaysFalse$.class, Operation.FALSE)

.put(AlwaysFalse.class, Operation.FALSE)

.put(EqualTo.class, Operation.EQ)

.put(EqualNullSafe.class, Operation.EQ)

.put(GreaterThan.class, Operation.GT)

.put(GreaterThanOrEqual.class, Operation.GT_EQ)

.put(LessThan.class, Operation.LT)

.put(LessThanOrEqual.class, Operation.LT_EQ)

.put(In.class, Operation.IN)

.put(IsNull.class, Operation.IS_NULL)

.put(IsNotNull.class, Operation.NOT_NULL)

.put(And.class, Operation.AND)

.put(Or.class, Operation.OR)

.put(Not.class, Operation.NOT)

.put(StringStartsWith.class, Operation.STARTS_WITH)

.build();

public static Expression convert(Filter[] filters) {

Expression expression = Expressions.alwaysTrue();

for (Filter filter : filters) {

Expression converted = convert(filter);

Preconditions.checkArgument(converted != null, "Cannot convert filter to Iceberg: %s", filter);

expression = Expressions.and(expression, converted);

}

return expression;

}

public static Expression convert(Filter filter) {

// avoid using a chain of if instanceof statements by mapping to the expression enum.

Operation op = FILTERS.get(filter.getClass());

if (op != null) {

switch (op) {

case TRUE:

return Expressions.alwaysTrue();

case FALSE:

return Expressions.alwaysFalse();

case IS_NULL:

IsNull isNullFilter = (IsNull) filter;

return isNull(unquote(isNullFilter.attribute()));

case NOT_NULL:

IsNotNull notNullFilter = (IsNotNull) filter;

return notNull(unquote(notNullFilter.attribute()));

case LT:

LessThan lt = (LessThan) filter;

return lessThan(unquote(lt.attribute()), convertLiteral(lt.value()));

case LT_EQ:

LessThanOrEqual ltEq = (LessThanOrEqual) filter;

return lessThanOrEqual(unquote(ltEq.attribute()), convertLiteral(ltEq.value()));

case GT:

GreaterThan gt = (GreaterThan) filter;

return greaterThan(unquote(gt.attribute()), convertLiteral(gt.value()));

case GT_EQ:

GreaterThanOrEqual gtEq = (GreaterThanOrEqual) filter;

return greaterThanOrEqual(unquote(gtEq.attribute()), convertLiteral(gtEq.value()));

case EQ: // used for both eq and null-safe-eq

if (filter instanceof EqualTo) {

EqualTo eq = (EqualTo) filter;

// comparison with null in normal equality is always null. this is probably a mistake.

Preconditions.checkNotNull(eq.value(),

"Expression is always false (eq is not null-safe): %s", filter);

return handleEqual(unquote(eq.attribute()), eq.value());

} else {

EqualNullSafe eq = (EqualNullSafe) filter;

if (eq.value() == null) {

return isNull(unquote(eq.attribute()));

} else {

return handleEqual(unquote(eq.attribute()), eq.value());

}

}

case IN:

In inFilter = (In) filter;

return in(unquote(inFilter.attribute()),

Stream.of(inFilter.values())

.filter(Objects::nonNull)

.map(SparkFilters::convertLiteral)

.collect(Collectors.toList()));

case NOT:

Not notFilter = (Not) filter;

Filter childFilter = notFilter.child();

Operation childOp = FILTERS.get(childFilter.getClass());

if (childOp == Operation.IN) {

// infer an extra notNull predicate for Spark NOT IN filters

// as Iceberg expressions don't follow the 3-value SQL boolean logic

// col NOT IN (1, 2) in Spark is equivalent to notNull(col) && notIn(col, 1, 2) in Iceberg

In childInFilter = (In) childFilter;

Expression notIn = notIn(unquote(childInFilter.attribute()),

Stream.of(childInFilter.values())

.map(SparkFilters::convertLiteral)

.collect(Collectors.toList()));

return and(notNull(childInFilter.attribute()), notIn);

} else if (hasNoInFilter(childFilter)) {

Expression child = convert(childFilter);

if (child != null) {

return not(child);

}

}

return null;

case AND: {

And andFilter = (And) filter;

Expression left = convert(andFilter.left());

Expression right = convert(andFilter.right());

if (left != null && right != null) {

return and(left, right);

}

return null;

}

case OR: {

Or orFilter = (Or) filter;

Expression left = convert(orFilter.left());

Expression right = convert(orFilter.right());

if (left != null && right != null) {

return or(left, right);

}

return null;

}

case STARTS_WITH: {

StringStartsWith stringStartsWith = (StringStartsWith) filter;

return startsWith(unquote(stringStartsWith.attribute()), stringStartsWith.value());

}

}

}

return null;

}

I'm not sure if a look-up map can be used here because of the usage of sameType function, but it might be worth looking into 😄

huaxingao · 2021-12-15T01:51:10Z

@kbendick Thanks a lot for reviewing! I will address your comments later because I might need to change other part of the code too. I will do all the changes in one commit.

…t String filters to Spark expressions

ajantha-bhat · 2021-12-23T04:17:19Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java

        }).collect(Collectors.toList());
  }

+  private static List getPartitionFilterExpressions(SparkSession spark, String tableName,


I think we should use List<org.apache.spark.sql.catalyst.expressions.Expression> instead of List to avoid IDE warnings.

ajantha-bhat · 2021-12-23T04:19:39Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java

+            SparkExpressionConverter.collectResolvedSparkExpression(spark, tableName, filter);
+        filterExpressions.add(expression);
+      } catch (AnalysisException e) {
+        // ignore if filter cannot be converted to Spark expression


I think we can also add "PartitionFilter map is already validated in the caller" and we should log the error ?

i guess maybe follow what you have here https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java#L124 and throw IllegalArgumentException?

yeah, we can do that.

ajantha-bhat · 2021-12-23T13:24:57Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java

+      SparkSession spark, String tableName, Map<String, String> partitionFilter) {
+    List<org.apache.spark.sql.catalyst.expressions.Expression> filterExpressions = Lists.newArrayList();
+    for (Map.Entry<String, String> entry : partitionFilter.entrySet()) {
+      String filter = entry.getKey() + " = '" + entry.getValue() + "'";


why do we have quotes ('') for the value?
Example: If id is an integer column, then we need id = 3 in the query instead of id ='3' ?

have we tested both string and non string partition columns ?

I was also thinking instead of map, can we expose the where clause in the call procedure (similar to rewrite_data files), so user can give filters other than equals also?

I was also thinking instead of map, can we expose the where clause in the call procedure (similar to rewrite_data files), so user can give filters other than equals also?

But it will become breaking API changes I guess. Let's see what others think on this also.

If the filter is on String column, e.g. dept column with value hr, we want the filter to be dept = 'hr'.

For non-String columns, such as id = 3, the filter with id = '3' still works ok because Spark will have Literal with String value to begin with, and then cast to Literal with Int value after it has the column type.

TestAddFilesProcedure has both String and int type partition columns so these two are tested. We probably should test other types e.g. Timestamp just to make sure.

huaxingao · 2022-01-06T17:23:56Z

@RussellSpitzer @kbendick
Could you please take one more look when you have time? Thanks a lot!

kbendick

This LGTM.

One question / corner case about what happens when a partitionFilter doesn't match any filters, but it's more of a user experience issue about the error message and not a correctness concern.

kbendick · 2022-01-06T19:29:44Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java

      Preconditions.checkArgument(!partitions.isEmpty(),
          "Cannot find any partitions in table %s", partitions);


Nit / corner case:

Should we update the precondition message to indicate that it's possible that the filter didn't match any partitions? The current error message now might be kind of confusing to users if the file based table is partitioned. Maybe just Cannot find any matching partitions in table %s?

Fixed. Thanks!

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java

rdblue · 2022-01-09T19:50:16Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java

+    Seq<org.apache.spark.sql.catalyst.expressions.Expression> scalaPartitionFilters =
+        JavaConverters.asScalaBufferConverter(filterExpressions).asScala().toSeq();
+    Seq<org.apache.spark.sql.catalyst.expressions.Expression> scalaDataFilters =
+        JavaConverters.asScalaBufferConverter(dataFilters).asScala().toSeq();


Is there an easier way to construct an empty sequence? Also, since this is always empty, can you put the dataFilters definition and this line next to one another? The line to create scalaPartitionFilters can be next to the line above that creates filterExpressions.

rdblue · 2022-01-09T19:51:19Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java

          });
-          return new SparkPartition(values, partition.path().toString(), format);
+          FileStatus fileStatus =
+              scala.collection.JavaConverters.seqAsJavaListConverter(partition.files()).asJava().get(0);


scala.collection.JavaConverters is imported. Can you remove the fully-qualified name?

rdblue · 2022-01-09T19:52:33Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java

        }).collect(Collectors.toList());
  }

+  private static List<org.apache.spark.sql.catalyst.expressions.Expression> getPartitionFilterExpressions(


Is it possible to move this to a separate util class to avoid the conflict with the connector Expression? Maybe SparkPartitionUtil or something?

rdblue · 2022-01-09T19:54:02Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java

+        throw new IllegalArgumentException("filter " + filter + " cannot be converted to Spark expression");
+      }
+    }
+    return filterExpressions;


Nit: missing whitespace between the control flow block and the following statement.

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java

rdblue · 2022-01-09T20:01:11Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java

+            SparkExpressionConverter.collectResolvedSparkExpression(spark, tableName, filter);
+        filterExpressions.add(expression);
+      } catch (AnalysisException e) {
+        throw new IllegalArgumentException("filter " + filter + " cannot be converted to Spark expression");


Minor: The exception message should follow the conventions for errors:

Use sentence case. That is, capitalize the first word of the message.

State what went wrong first, "Cannot convert filter to Spark"

Next, give context after a :, which in this case is the filter

Never swallow cause exceptions

This should be "throw new IllegalArgumentException("Cannot convert filter to Spark: " + filter, e)`

rdblue · 2022-01-09T20:03:18Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java

+      String filter = entry.getKey() + " = '" + entry.getValue() + "'";
+      try {
+        org.apache.spark.sql.catalyst.expressions.Expression expression =
+            SparkExpressionConverter.collectResolvedSparkExpression(spark, tableName, filter);


collectResolvedSparkExpression is really expensive. Why does this need to call it?

Can't this produce Expression instances directly rather than building strings and converting with fake plans?

@rdblue Thank you very much for reviewing my PR on the weekend!

Do you mean constructing a filter Expression instead of letting Spark generate the Expression? I initially generated Expression like this

BoundReference ref = new BoundReference(index, dataType, true); switch (dataType.typeName()) { case "integer": filterExpressions.add(new EqualTo(ref, org.apache.spark.sql.catalyst.expressions.Literal.create(Integer.parseInt(entry.getValue()), DataTypes.IntegerType))); break;

There are some concerns from the reviewers because we need to test each of the data types. Then I changed the code to call collectResolvedSparkExpression.

I will address all the other comments after I find out what to do for this one, so I can fix all the problems in one commit.

Yes, I think you should construct the filter expression directly rather than calling collectResolvedSparkExpression.

@rdblue I changed the code to construct filter expression directly. Could you please take one more look? Thank you very much!

I have manually checked to make sure the filter expression can be created correctly for all the numeric types, Date and Timestamp. There are tests for String and int partition filters in TestAddFilesProcedure, and I added a test for Date partition filter. There is a bug with Timestamp partition filter in the current code. I will fix it in a separate PR.

The build failure doesn't seem to be related to my changes.

rdblue · 2022-01-12T22:06:47Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java

-          return new SparkPartition(values, partition.path().toString(), format);
+
+          FileStatus fileStatus =
+              JavaConverters.seqAsJavaListConverter(partition.files()).asJava().get(0);


Why does this use partition.files() instead of partition.path()?

Because here partition is PartitionDirectory

case class PartitionDirectory(values: InternalRow, files: Seq[FileStatus])

listFiles returns a Seq of PartitionDirectory

def listFiles( partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Seq[PartitionDirectory]

Before my change, partition is PartitionPath

case class PartitionPath(values: InternalRow, path: Path)

Great, thanks for the context! I assumed that it would use the same values.

rdblue · 2022-01-12T22:08:01Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java

+   *                and value is the specific value to be filtered on the column.
+   * @return a List of filters in the format of Spark Expression.
+   */
+  public static List getSparkFilterExpressions(StructType schema,


Iceberg doesn't use get in method names because it tends to either be filler or prevent us from having more specific methods. Here, I think a more specific name is partitionMapToExpression

Fixed. Thanks!

rdblue · 2022-01-12T22:08:16Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java

+   *                and value is the specific value to be filtered on the column.
+   * @return a List of filters in the format of Spark Expression.
+   */
+  public static List getSparkFilterExpressions(StructType schema,


List is not parameterized. Could you fix types?

Fixed. Thanks!

rdblue · 2022-01-12T23:38:30Z

Looks great. Thanks, @huaxingao!

huaxingao · 2022-01-12T23:45:30Z

Thank you all very much for helping me on this PR!

github-actions bot added the spark label Dec 14, 2021

huaxingao commented Dec 14, 2021

View reviewed changes

huaxingao mentioned this pull request Dec 14, 2021

Spark: Pushdown Filters / Improve Performance when Importing File Based Tables #3532

Closed

RussellSpitzer reviewed Dec 14, 2021

View reviewed changes

kbendick reviewed Dec 15, 2021

View reviewed changes

huaxingao force-pushed the partition_filter branch from afa03e3 to 6fa436b Compare December 15, 2021 15:57

ajantha-bhat mentioned this pull request Dec 16, 2021

Spark: Fix UnresolvedException for some filters in rewrite_data_files procedure #3757

Merged

huaxingao added 5 commits December 22, 2021 19:17

Push down partition filter to Spark when Importing File Based Tables

3ef0170

use case for condition + Lists.newArrayList()

46602b8

fix indentation

f72721e

remove unused import

79828e2

use SparkExpressionConverter.collectResolvedSparkExpression to conver…

51ee198

…t String filters to Spark expressions

huaxingao force-pushed the partition_filter branch from 4193ec8 to 51ee198 Compare December 23, 2021 03:49

ajantha-bhat reviewed Dec 23, 2021

View reviewed changes

address comments

19cb5c7

ajantha-bhat reviewed Dec 23, 2021

View reviewed changes

ajantha-bhat approved these changes Dec 23, 2021

View reviewed changes

kbendick approved these changes Jan 6, 2022

View reviewed changes

fix precondition message

7d95086

rdblue reviewed Jan 9, 2022

View reviewed changes

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 9, 2022

View reviewed changes

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java Show resolved Hide resolved

rdblue reviewed Jan 9, 2022

View reviewed changes

huaxingao added 3 commits January 11, 2022 18:10

address comments

ffc5c6e

ReTrigger Build

3b4ba27

Trigger Build

f9b2f40

flyrain mentioned this pull request Jan 12, 2022

API: Register existing tables in Iceberg HiveCatalog #3851

Merged

rdblue reviewed Jan 12, 2022

View reviewed changes

address comments

f8822e4

rdblue approved these changes Jan 12, 2022

View reviewed changes

rdblue merged commit b379256 into apache:master Jan 12, 2022

huaxingao deleted the partition_filter branch January 12, 2022 23:45

felixYyu added a commit to felixYyu/iceberg that referenced this pull request Jan 13, 2022

updated with apache#3745

6281e75

felixYyu mentioned this pull request Jan 13, 2022

Spark: Supports partition management in V2 Catalog #3862

Closed

felixYyu added a commit to felixYyu/iceberg that referenced this pull request Jan 13, 2022

updated with apache#3745

17ba08c

felixYyu added a commit to felixYyu/iceberg that referenced this pull request Jan 13, 2022

updated with apache#3745

3f0c4eb

	private static final Map<Class<? extends Filter>, Operation> FILTERS = ImmutableMap
	.<Class<? extends Filter>, Operation>builder()
	.put(AlwaysTrue.class, Operation.TRUE)
	.put(AlwaysTrue$.class, Operation.TRUE)
	.put(AlwaysFalse$.class, Operation.FALSE)
	.put(AlwaysFalse.class, Operation.FALSE)
	.put(EqualTo.class, Operation.EQ)
	.put(EqualNullSafe.class, Operation.EQ)
	.put(GreaterThan.class, Operation.GT)
	.put(GreaterThanOrEqual.class, Operation.GT_EQ)
	.put(LessThan.class, Operation.LT)
	.put(LessThanOrEqual.class, Operation.LT_EQ)
	.put(In.class, Operation.IN)
	.put(IsNull.class, Operation.IS_NULL)
	.put(IsNotNull.class, Operation.NOT_NULL)
	.put(And.class, Operation.AND)
	.put(Or.class, Operation.OR)
	.put(Not.class, Operation.NOT)
	.put(StringStartsWith.class, Operation.STARTS_WITH)
	.build();

	public static Expression convert(Filter[] filters) {
	Expression expression = Expressions.alwaysTrue();
	for (Filter filter : filters) {
	Expression converted = convert(filter);
	Preconditions.checkArgument(converted != null, "Cannot convert filter to Iceberg: %s", filter);
	expression = Expressions.and(expression, converted);
	}
	return expression;
	}

	public static Expression convert(Filter filter) {
	// avoid using a chain of if instanceof statements by mapping to the expression enum.
	Operation op = FILTERS.get(filter.getClass());
	if (op != null) {
	switch (op) {
	case TRUE:
	return Expressions.alwaysTrue();

	case FALSE:
	return Expressions.alwaysFalse();

	case IS_NULL:
	IsNull isNullFilter = (IsNull) filter;
	return isNull(unquote(isNullFilter.attribute()));

	case NOT_NULL:
	IsNotNull notNullFilter = (IsNotNull) filter;
	return notNull(unquote(notNullFilter.attribute()));

	case LT:
	LessThan lt = (LessThan) filter;
	return lessThan(unquote(lt.attribute()), convertLiteral(lt.value()));

	case LT_EQ:
	LessThanOrEqual ltEq = (LessThanOrEqual) filter;
	return lessThanOrEqual(unquote(ltEq.attribute()), convertLiteral(ltEq.value()));

	case GT:
	GreaterThan gt = (GreaterThan) filter;
	return greaterThan(unquote(gt.attribute()), convertLiteral(gt.value()));

	case GT_EQ:
	GreaterThanOrEqual gtEq = (GreaterThanOrEqual) filter;
	return greaterThanOrEqual(unquote(gtEq.attribute()), convertLiteral(gtEq.value()));

	case EQ: // used for both eq and null-safe-eq
	if (filter instanceof EqualTo) {
	EqualTo eq = (EqualTo) filter;
	// comparison with null in normal equality is always null. this is probably a mistake.
	Preconditions.checkNotNull(eq.value(),
	"Expression is always false (eq is not null-safe): %s", filter);
	return handleEqual(unquote(eq.attribute()), eq.value());
	} else {
	EqualNullSafe eq = (EqualNullSafe) filter;
	if (eq.value() == null) {
	return isNull(unquote(eq.attribute()));
	} else {
	return handleEqual(unquote(eq.attribute()), eq.value());
	}
	}

	case IN:
	In inFilter = (In) filter;
	return in(unquote(inFilter.attribute()),
	Stream.of(inFilter.values())
	.filter(Objects::nonNull)
	.map(SparkFilters::convertLiteral)
	.collect(Collectors.toList()));

	case NOT:
	Not notFilter = (Not) filter;
	Filter childFilter = notFilter.child();
	Operation childOp = FILTERS.get(childFilter.getClass());
	if (childOp == Operation.IN) {
	// infer an extra notNull predicate for Spark NOT IN filters
	// as Iceberg expressions don't follow the 3-value SQL boolean logic
	// col NOT IN (1, 2) in Spark is equivalent to notNull(col) && notIn(col, 1, 2) in Iceberg
	In childInFilter = (In) childFilter;
	Expression notIn = notIn(unquote(childInFilter.attribute()),
	Stream.of(childInFilter.values())
	.map(SparkFilters::convertLiteral)
	.collect(Collectors.toList()));
	return and(notNull(childInFilter.attribute()), notIn);
	} else if (hasNoInFilter(childFilter)) {
	Expression child = convert(childFilter);
	if (child != null) {
	return not(child);
	}
	}
	return null;

	case AND: {
	And andFilter = (And) filter;
	Expression left = convert(andFilter.left());
	Expression right = convert(andFilter.right());
	if (left != null && right != null) {
	return and(left, right);
	}
	return null;
	}

	case OR: {
	Or orFilter = (Or) filter;
	Expression left = convert(orFilter.left());
	Expression right = convert(orFilter.right());
	if (left != null && right != null) {
	return or(left, right);
	}
	return null;
	}

	case STARTS_WITH: {
	StringStartsWith stringStartsWith = (StringStartsWith) filter;
	return startsWith(unquote(stringStartsWith.attribute()), stringStartsWith.value());
	}
	}
	}

	return null;
	}

		Preconditions.checkArgument(!partitions.isEmpty(),
		"Cannot find any partitions in table %s", partitions);

Push down partition filter to Spark when Importing File Based Tables #3745

Push down partition filter to Spark when Importing File Based Tables #3745

Uh oh!

Conversation

huaxingao commented Dec 14, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Dec 14, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Dec 15, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat Dec 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Jan 6, 2022

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdblue Jan 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ajantha-bhat Dec 23, 2021 •

edited

Loading

rdblue Jan 9, 2022 •

edited

Loading