Draft: Fix for Spark dataframe count on entries metadata table throws an IllegalArgumentException for partitioned tables #2395

szehon-ho · 2021-03-30T20:03:56Z

Quick-fix for problem was reported in Spark dataframe count on entries metadata table throws an IllegalArgumentException for partitioned tables #1378
As Russell mentioned, he debugged the same thing in : Fix Avro Pruning Bugs with ManifestEntries Table #1744, that is trying a more complete fix. This pr is focused on fixing 'entries' and 'all-entries' table.
Background: When running Spark aggregation query on "entries" metadata table, empty projection is passed in.
However, data_file is required field as per Manifest schema spec, so this projection triggers java.lang.IllegalArgumentException: Missing required field: data_file in BuildAvroProjection.record
Fix NPE when counting entries #1077 fixes it only for non-partitioned tables
This is only due to the peculiar behavior in PruneColumns where empty structs are not pruned away, thus 'data-file' is kept in the final projection when data-files.partitions is empty struct (non-partitioned table). In contrast, 'data-file' is not kept in final projection as non-empty structs with no fields matching projection are pruned away (partitioned-table).

Full exception stack for reference:
Caused by: java.lang.IllegalArgumentException: Missing required field: data_file
at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:217)
at org.apache.iceberg.avro.BuildAvroProjection.record(BuildAvroProjection.java:98)
at org.apache.iceberg.avro.BuildAvroProjection.record(BuildAvroProjection.java:42)
at org.apache.iceberg.avro.AvroCustomOrderSchemaVisitor.visit(AvroCustomOrderSchemaVisitor.java:51)
at org.apache.iceberg.avro.AvroSchemaUtil.buildAvroProjection(AvroSchemaUtil.java:105)
at org.apache.iceberg.avro.ProjectionDatumReader.setSchema(ProjectionDatumReader.java:68)
at org.apache.iceberg.shaded.org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:132)
at org.apache.iceberg.shaded.org.apache.avro.file.DataFileReader.(DataFileReader.java:106)
at org.apache.iceberg.shaded.org.apache.avro.file.DataFileReader.(DataFileReader.java:98)
at org.apache.iceberg.shaded.org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:66)
at org.apache.iceberg.avro.AvroIterable.newFileReader(AvroIterable.java:100)
at org.apache.iceberg.avro.AvroIterable.iterator(AvroIterable.java:77)
at org.apache.iceberg.io.CloseableIterable$4$1.(CloseableIterable.java:99)
at org.apache.iceberg.io.CloseableIterable$4.iterator(CloseableIterable.java:98)
at org.apache.iceberg.io.CloseableIterable$4$1.(CloseableIterable.java:99)
at org.apache.iceberg.io.CloseableIterable$4.iterator(CloseableIterable.java:98)
at org.apache.iceberg.io.CloseableIterable$4$1.(CloseableIterable.java:99)
at org.apache.iceberg.io.CloseableIterable$4.iterator(CloseableIterable.java:98)
at org.apache.iceberg.io.CloseableIterable$4$1.(CloseableIterable.java:99)
at org.apache.iceberg.io.CloseableIterable$4.iterator(CloseableIterable.java:98)
at org.apache.iceberg.spark.source.RowDataReader.open(RowDataReader.java:95)
at org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:86)
at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_1$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:897)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:897)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:336)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:483)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)

…rows an IllegalArgumentException for partitioned tables) - When running Spark aggregation query on "entries" metadata table, empty projection is passed in. - However, data_file is required field, so violatesjava.lang.IllegalArgumentException: Missing required field: data_file in BuildAvroProjection.record - apache#1077 fixes it only for non-partitioned tables, but only due to the (expected?) behavior in PruneColumns where empty structs are not pruned.

szehon-ho · 2021-10-19T09:14:00Z

Issue fixed

github-actions bot added core spark labels Mar 30, 2021

szehon-ho mentioned this pull request Mar 30, 2021

Fix Avro Pruning Bugs with ManifestEntries Table #1744

Closed

szehon-ho closed this Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Draft: Fix for Spark dataframe count on entries metadata table throws an IllegalArgumentException for partitioned tables #2395

Draft: Fix for Spark dataframe count on entries metadata table throws an IllegalArgumentException for partitioned tables #2395

Uh oh!

szehon-ho commented Mar 30, 2021

Uh oh!

szehon-ho commented Oct 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Draft: Fix for Spark dataframe count on entries metadata table throws an IllegalArgumentException for partitioned tables #2395

Draft: Fix for Spark dataframe count on entries metadata table throws an IllegalArgumentException for partitioned tables #2395

Uh oh!

Conversation

szehon-ho commented Mar 30, 2021

Uh oh!

szehon-ho commented Oct 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant