[SPARK-34960][SQL] Aggregate push down for ORC #34298

c21 · 2021-10-16T02:49:31Z

What changes were proposed in this pull request?

This PR is to add aggregate push down feature for ORC data source v2 reader.

At a high level, the PR does:

The supported aggregate expression is MIN/MAX/COUNT same as Parquet aggregate push down.
BooleanType, ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType, DateType are allowed in MIN/MAXX aggregate push down. All other columns types are not allowed in MIN/MAX aggregate push down.
All columns types are supported in COUNT aggregate push down.
Nested column's sub-fields are disallowed in aggregate push down.
If the file does not have valid statistics, Spark will throw exception and fail query.
If aggregate has filter or group-by column, aggregate will not be pushed down.

At code level, the PR does:

OrcScanBuilder: pushAggregation() checks whether the aggregation can be pushed down. The most checking logic is shared between Parquet and ORC, extracted into AggregatePushDownUtils.getSchemaForPushedAggregation(). OrcScanBuilder will create a OrcScan with aggregation and aggregation data schema.
OrcScan: createReaderFactory creates a ORC reader factory with aggregation and schema. Similar change with ParquetScan.
OrcPartitionReaderFactory: buildReaderWithAggregates creates a ORC reader with aggregate push down (i.e. read ORC file footer to process columns statistics, instead of reading actual data in the file). buildColumnarReaderWithAggregates creates a columnar ORC reader similarly. Both delegate the real work to read footer in OrcUtils.createAggInternalRowFromFooter.
OrcUtils.createAggInternalRowFromFooter: reads ORC file footer to process columns statistics (real heavy lift happens here). Similar to ParquetUtils.createAggInternalRowFromFooter. Leverage utility method such as OrcFooterReader.readStatistics.
OrcFooterReader: readStatistics reads the ORC ColumnStatistics[] into Spark OrcColumnStatistics. The transformation is needed here, because ORC ColumnStatistics[] stores all columns statistics in a flatten array style, and hard to process. Spark OrcColumnStatistics stores the statistics in nested tree structure (e.g. like StructType). This is used by OrcUtils.createAggInternalRowFromFooter
OrcColumnStatistics: the easy-to-manipulate structure for ORC ColumnStatistics. This is used by OrcFooterReader.readStatistics.

Why are the changes needed?

To improve the performance of query with aggregate.

Does this PR introduce any user-facing change?

Yes. A user-facing config spark.sql.orc.aggregatePushdown is added to control enabling/disabling the aggregate push down for ORC. By default the feature is disabled.

How was this patch tested?

Added unit test in FileSourceAggregatePushDownSuite.scala. Refactored all unit tests in #33639, and it now works for both Parquet and ORC.

c21 · 2021-10-16T02:50:54Z

cc @huaxingao, @sunchao, @viirya, @cloud-fan and @dongjoon-hyun could you help take a look when you have time? Thanks!

c21 · 2021-10-16T02:52:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala


  private lazy val fieldNamesSet: Set[String] = fieldNames.toSet
-  private[sql] lazy val nameToField: Map[String, StructField] = fields.map(f => f.name -> f).toMap
+  private lazy val nameToField: Map[String, StructField] = fields.map(f => f.name -> f).toMap


This reverts the change in #33639, as we don't need to make it more public.

...test/scala/org/apache/spark/sql/execution/datasources/FileSourceAggregatePushDownSuite.scala

c21 · 2021-10-16T02:54:38Z

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScanBuilder.scala

+        // Not push down Timestamp because INT96 sort order is undefined,
+        // Parquet doesn't return statistics for INT96.
+        // Not push down Binary type as Parquet can truncate the statistics.
+        case StructType(_) | ArrayType(_, _) | MapType(_, _, _) | TimestampType | BinaryType =>


Adding in Parquet to disallow BinaryType here. This would make the unit tests between Parquet and ORC easier, and we are discussing to disallow it. cc @huaxingao feel free to let me revert the change if it does not make sense. Thanks.

Looks good. Thanks for adding this.

We should put StringType here too

@sunchao - updated.

viirya · 2021-10-16T03:05:53Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnsStatistics.java

+ * this class is used to covert ORC {@link ColumnStatistics}s from array to nested tree structure,
+ * according to data types. This is used for aggregate push down in ORC.
+ */
+public class OrcColumnsStatistics {


I'm curious why this is in java? This doesn't look like to be an API open to others.

No fundamental reason actually. Was following OrcColumnVector where it converts some ORC object to its counterpart in Spark. I can move if needed.

OrcColumnsStatistics -> OrcColumnStatistics?

@sunchao - updated with new name.

viirya · 2021-10-16T03:09:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/AggregatePushDownUtils.scala

+      // are combined with filter or group by
+      // e.g. SELECT COUNT(col1) FROM t WHERE col2 = 8
+      //      SELECT COUNT(col1) FROM t GROUP BY col2
+      // Todo: 1. add support if groupby column is partition col


@huaxingao Don't you already add the support? No?

@viirya - #34248 is not merged yet, I can do a rebase later once it's merged.

@viirya - rebased.

viirya · 2021-10-16T03:10:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/AggregatePushDownUtils.scala

+      return None
+    }
+
+    aggregation.groupByColumns.foreach { col =>


Hmm, doesn't aggregation.groupByColumns must be empty?

Good call. Didn't change it when moving logic from 128168d to here. Will update.

@viirya - updated, thanks.

SparkQA · 2021-10-16T04:05:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48796/

SparkQA · 2021-10-16T05:03:04Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48796/

SparkQA · 2021-10-16T08:23:44Z

Test build #144317 has finished for PR 34298 at commit 73627ed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-10-17T00:22:06Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48805/

SparkQA · 2021-10-17T01:21:16Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48805/

huaxingao · 2021-10-17T03:58:26Z

...test/scala/org/apache/spark/sql/execution/datasources/FileSourceAggregatePushDownSuite.scala

@@ -123,36 +126,11 @@ abstract class ParquetAggregatePushDownSuite
    }
  }

-  test("aggregate push down - Count(partition Col): push down") {


We still support push down Count(partition Col), right? Do we still need this test?

@huaxingao - yes we still support. sorry I was removing it by mistake when copying the file. Let me add it back.

@huaxingao - added back.

huaxingao · 2021-10-17T04:07:01Z

@c21 Thanks for working on this! I took a quick look, overall it is good. I will find time to take a closer look.

SparkQA · 2021-10-17T04:26:24Z

Test build #144326 has finished for PR 34298 at commit 1f36f12.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao

Thanks @c21 ! left some comments

sunchao · 2021-10-18T16:11:53Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnsStatistics.java

+ * this class is used to covert ORC {@link ColumnStatistics}s from array to nested tree structure,
+ * according to data types. This is used for aggregate push down in ORC.
+ */
+public class OrcColumnsStatistics {


OrcColumnsStatistics -> OrcColumnStatistics?

sunchao · 2021-10-18T16:17:33Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcFooterReader.java

+  public static OrcColumnsStatistics readStatistics(Reader orcReader) {
+    TypeDescription orcSchema = orcReader.getSchema();
+    ColumnStatistics[] orcStatistics = orcReader.getStatistics();
+    StructType dataType = OrcUtils.toCatalystSchema(orcSchema);


nit: maybe rename dataType to sparkSchema.

@sunchao - updated.

sunchao · 2021-10-18T16:20:23Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnsStatistics.java

+/**
+ * Columns statistics interface wrapping ORC {@link ColumnStatistics}s.
+ *
+ * Because ORC {@link ColumnStatistics}s are stored as an flatten array in ORC file footer,


is it in pre-order and does it flatten all the nested types? might worth mention here

Yes, added comment.

sunchao · 2021-10-18T16:23:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/AggregatePushDownUtils.scala

+      partitionNameSet: Set[String],
+      dataFilters: Seq[Expression],
+      isAllowedTypeForMinMaxAggregate: DataType => Boolean,
+      sparkSession: SparkSession): Option[StructType] = {


nit: sparkSession is unused.

@sunchao - sorry, removed.

sunchao · 2021-10-18T16:24:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/AggregatePushDownUtils.scala

+  def getSchemaForPushedAggregation(
+      aggregation: Aggregation,
+      schema: StructType,
+      partitionNameSet: Set[String],


nit: maybe partitionNames?

@sunchao - updated.

sunchao · 2021-10-18T16:41:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

+          val nonNullRowsCount = if (isPartitionColumn) {
+            val topLevelStatistics = columnsStatistics.getStatistics
+            if (topLevelStatistics.hasNull) {
+              throw new SparkException(s"Illegal ORC top-level column statistics with NULL " +


hm does it mean here we have an invalid ORC file or it is a valid file but Spark can't handle the case?

I think we should also give an informative error message to the users so they know how to fallback

@sunchao - here it means the ORC file is invalid. Actually we don't need this check, as ORC guarantees this and this error message is quite confusing. Removed.

sunchao · 2021-10-18T16:45:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

+          // Count(*) includes both null and non-null values.
+          val topLevelStatistics = columnsStatistics.getStatistics
+          if (topLevelStatistics.hasNull) {
+            throw new SparkException(s"Illegal ORC top-level column statistics with NULL " +


not sure why we should throw exception here - doesn't count(*) include NULLs?

@sunchao - yes same as above, this error message is quite confusing. Removed.

sunchao · 2021-10-18T16:54:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScan.scala

+  override def isSplitable(path: Path): Boolean = {
+    // If aggregate is pushed down, only the file footer will be read once,
+    // so file should be not split across multiple tasks.
+    pushedAggregate.isEmpty


Seems this is a better approach than we are doing on Parquet side, cc @huaxingao . Also maybe we should change how we measure file weight when combining tasks for aggregate pushdown, since we can combine multiple large files into a single task as computing stats is much cheaper.

@sunchao - agreed, that's why I diverge from Parquet code path for this. We should make sure the file only being processed by only 1 task. Splitting the file across multiple tasks is weird and useless. I can make a change on Parquet side later after this PR is merged.

Also maybe we should change how we measure file weight when combining tasks for aggregate pushdown, since we can combine multiple large files into a single task as computing stats is much cheaper.

Yes I thought this as well. It's not trivial though as we need to come up with another heuristics to decide how do we combine files when aggregate is pushed down.

Yes, I agree this is a better approach and Parquet should do this way too.

@huaxingao - cool then I can address for Parquet in a followup PR, no urgent anyway.

sunchao · 2021-10-18T16:59:57Z

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScanBuilder.scala

+        // Not push down Timestamp because INT96 sort order is undefined,
+        // Parquet doesn't return statistics for INT96.
+        // Not push down Binary type as Parquet can truncate the statistics.
+        case StructType(_) | ArrayType(_, _) | MapType(_, _, _) | TimestampType | BinaryType =>


We should put StringType here too

sunchao · 2021-10-18T17:00:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScanBuilder.scala

+      dataType match {
+        // Not push down complex and Timestamp type.
+        // Not push down Binary type as ORC does not write min/max statistics for it.
+        case StructType(_) | ArrayType(_, _) | MapType(_, _, _) | TimestampType | BinaryType =>


hm should we add StringType? here how does ORC store stats for long strings?

@sunchao - yes, updated.

c21 · 2021-10-19T06:00:31Z

Thank you @sunchao for review! Will update shortly.

c21

Addressed all comments and the PR is ready for review again. Thanks @sunchao, @viirya and @huaxingao.

c21 · 2021-10-22T19:15:34Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnsStatistics.java

+/**
+ * Columns statistics interface wrapping ORC {@link ColumnStatistics}s.
+ *
+ * Because ORC {@link ColumnStatistics}s are stored as an flatten array in ORC file footer,


Yes, added comment.

c21 · 2021-10-22T19:15:46Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnsStatistics.java

+ * this class is used to covert ORC {@link ColumnStatistics}s from array to nested tree structure,
+ * according to data types. This is used for aggregate push down in ORC.
+ */
+public class OrcColumnsStatistics {


@sunchao - updated with new name.

c21 · 2021-10-22T19:16:23Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcFooterReader.java

+  public static OrcColumnsStatistics readStatistics(Reader orcReader) {
+    TypeDescription orcSchema = orcReader.getSchema();
+    ColumnStatistics[] orcStatistics = orcReader.getStatistics();
+    StructType dataType = OrcUtils.toCatalystSchema(orcSchema);


@sunchao - updated.

c21 · 2021-10-22T19:16:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/AggregatePushDownUtils.scala

+  def getSchemaForPushedAggregation(
+      aggregation: Aggregation,
+      schema: StructType,
+      partitionNameSet: Set[String],


@sunchao - updated.

c21 · 2021-10-22T19:17:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/AggregatePushDownUtils.scala

+      partitionNameSet: Set[String],
+      dataFilters: Seq[Expression],
+      isAllowedTypeForMinMaxAggregate: DataType => Boolean,
+      sparkSession: SparkSession): Option[StructType] = {


@sunchao - sorry, removed.

c21 · 2021-10-22T19:28:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

+          // Count(*) includes both null and non-null values.
+          val topLevelStatistics = columnsStatistics.getStatistics
+          if (topLevelStatistics.hasNull) {
+            throw new SparkException(s"Illegal ORC top-level column statistics with NULL " +


@sunchao - yes same as above, this error message is quite confusing. Removed.

c21 · 2021-10-22T19:31:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScan.scala

+  override def isSplitable(path: Path): Boolean = {
+    // If aggregate is pushed down, only the file footer will be read once,
+    // so file should be not split across multiple tasks.
+    pushedAggregate.isEmpty


@sunchao - agreed, that's why I diverge from Parquet code path for this. We should make sure the file only being processed by only 1 task. Splitting the file across multiple tasks is weird and useless. I can make a change on Parquet side later after this PR is merged.

Also maybe we should change how we measure file weight when combining tasks for aggregate pushdown, since we can combine multiple large files into a single task as computing stats is much cheaper.

Yes I thought this as well. It's not trivial though as we need to come up with another heuristics to decide how do we combine files when aggregate is pushed down.

c21 · 2021-10-22T19:31:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScanBuilder.scala

+      dataType match {
+        // Not push down complex and Timestamp type.
+        // Not push down Binary type as ORC does not write min/max statistics for it.
+        case StructType(_) | ArrayType(_, _) | MapType(_, _, _) | TimestampType | BinaryType =>


@sunchao - yes, updated.

c21 · 2021-10-22T19:33:07Z

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScanBuilder.scala

+        // Not push down Timestamp because INT96 sort order is undefined,
+        // Parquet doesn't return statistics for INT96.
+        // Not push down Binary type as Parquet can truncate the statistics.
+        case StructType(_) | ArrayType(_, _) | MapType(_, _, _) | TimestampType | BinaryType =>


@sunchao - updated.

c21 · 2021-10-22T19:35:22Z

...test/scala/org/apache/spark/sql/execution/datasources/FileSourceAggregatePushDownSuite.scala

@@ -123,36 +126,11 @@ abstract class ParquetAggregatePushDownSuite
    }
  }

-  test("aggregate push down - Count(partition Col): push down") {


@huaxingao - added back.

SparkQA · 2021-10-23T02:06:35Z

Test build #144547 has finished for PR 34298 at commit 3341440.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2021-10-25T21:34:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

When reading this code again, I realized that we may not need the isCaseSensitive because the column name has already been normalized. We probably don't need to pass down the isCaseSensitive at all. I will double check this.

I did a quick test. We don't need this isCaseSensitive. I can clean this up if you don't have time for this.

@huaxingao - thanks for checking. Removed for ORC. I can do another PR for Parquet to help this PR review faster, but if you are already on it for Parquet code path, feel free to go ahead.

huaxingao · 2021-10-25T21:48:49Z

@c21 Did you have a chance to test large ORC files with multiple partitions? For Parquet, I did some testing using customer's data, but I still doubt my testing is sufficient. Please have more testing if possible. Thanks!

sunchao

Thanks @c21 . Overall looks pretty good.

sunchao · 2021-10-25T21:54:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

nit: does it mean COUNT for complex types can be pushed down? maybe make it more explicit.

@sunchao - yes, updated the doc.

sunchao · 2021-10-25T22:06:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

nit: add a space at the end

@sunchao - added.

sunchao · 2021-10-25T22:12:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

hm why we can include both null and non-null values when the column is a partition column?

@sunchao - because for every row, the partition column should not be NULL (similar reason for Parquet in #33639 (comment)). So for partition column, every row should be counted. Also updated the unit test FileSourceAggregatePushDownSuite."Count(partition column): push down" to test for null values.

I see, thanks!

sunchao · 2021-10-25T22:33:03Z

...la/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala

in case we are using off-heap memory, we might want to check taskContext.isDefined since otherwise the task completion listener may not be triggered to free up the memory?

@sunchao - makes sense to me, this is also existing behavior of ParquetPartitionReaderFactory.createParquetVectorizedReader(). Updated.

sunchao · 2021-10-25T22:39:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

hmm does a ORC file always have stats?

@sunchao - normally it should have. Added code here to throw an actionable exception here in case the file's statistics are not valid.

I'm just curious, since from https://github.com/apache/orc/blob/main/proto/orc_proto.proto, min/max are optional fields, and ORC's ColumnStatisticsImpl also doesn't set minimum or maximum if the fields from protobuf are not defined.

Taking Spark write code path as an example here.

Spark uses OrcOutputWriter to write ORC file, and internally it depends on ORC OrcMapreduceRecordWriter to do the actual write.

The writing of file statistics is happening during OrcOutputWriter.close() -> OrcMapreduceRecordWriter.close() -> WriterImpl.close() -> WriterImpl.writeFooter() -> TreeWriter.writeFileStatistics(). So writing file statistics is a step of writing file footer and will throw exception if not written.

TreeWriter contains individual writer per each column. Let's take IntegerTreeWriter as an example for writing int column.

TreeWriterBase (the superclass of IntegerTreeWriter) maintains real-time per-row/batch statistics (TreeWriterBase.indexStatistics), per-stripe statistics (TreeWriterBase.stripeColStatistics) and per-file statistics (TreeWriterBase.fileStatistics). TreeWriterBase.writeBatch() updates the count statistics. IntegerTreeWriter.writeBatch() updates min/max statistics for int column when writing each row.

I see. Thanks @c21 !

sunchao · 2021-10-25T22:41:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

what if the column has 0 values, will min/max still be defined?

@sunchao - great catch! Added handling for empty file (0 value/row), we should return null instead. Also added the unit test for empty file in FileSourceAggregatePushDownSuite/"aggregate push down - different data types", thanks.

viirya · 2021-10-26T07:06:33Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnStatistics.java

Can you add a few comments about how we store OrcColumnStatistics? Especially for map and array types. Although it is understable by reading convertStatistics, it is better to let readers/callers quickly know the format.

@viirya - sure, added some comments and an example.

viirya · 2021-10-26T07:10:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/AggregatePushDownUtils.scala

Do we need this? Can't we use sameResult?

@viirya - I think so, Aggregation is not a QueryPlan here, btw this was introduced in #33639, and I am refactoring here.

viirya · 2021-10-26T07:16:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

getMinMaxFromColumnStatistics

@viirya - sorry, fixed.

viirya · 2021-10-26T07:17:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

Why not use DateColumnStatistics instead of ${statistics.getClass.getName}?

Why not use DateColumnStatistics instead of ${statistics.getClass.getName}?

Sorry if it's not clear, but this is code path for case _, not case s: DateColumnStatistics. I want to print out the class name for the statistics we do not handle.

c21 · 2021-10-26T23:37:53Z

Addressed all comments, and the PR is ready for review again, thanks @viirya, @sunchao and @huaxingao.

SparkQA · 2021-10-27T00:52:31Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49102/

SparkQA · 2021-10-27T01:35:12Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49102/

SparkQA · 2021-10-27T05:04:32Z

Test build #144632 has finished for PR 34298 at commit 9b8b9ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ments

c21 · 2021-10-27T18:55:39Z

Rebased to latest master

SparkQA · 2021-10-27T19:54:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49145/

SparkQA · 2021-10-27T20:54:29Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49145/

sunchao · 2021-10-27T22:33:48Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnStatistics.java

+ * c3: map<key: int, value: string>
+ * c4: array<int>
+ *
+ *                        OrcColumnStatistics


sunchao · 2021-10-27T22:48:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

I'm just curious, since from https://github.com/apache/orc/blob/main/proto/orc_proto.proto, min/max are optional fields, and ORC's ColumnStatisticsImpl also doesn't set minimum or maximum if the fields from protobuf are not defined.

sunchao

LGTM

SparkQA · 2021-10-28T00:03:15Z

Test build #144676 has finished for PR 34298 at commit 8c7c617.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2021-10-28T05:46:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  val ORC_AGGREGATE_PUSHDOWN_ENABLED = buildConf("spark.sql.orc.aggregatePushdown")
+    .doc("If true, aggregates will be pushed down to ORC for optimization. Support MIN, MAX and " +
+      "COUNT as aggregate expression. For MIN/MAX, support boolean, integer, float and date " +
+      "type. For COUNT, support all data types.")


We support byte, short and double for MIN/MAX too?

I thought to just use integer to represent all integer types (byte, short, int, long) and use float here to represent all float types (float and double), to be less verbose. We anyway will update Spark doc on website with more detailed explanation of this aggregate push down feature anyway (ideally a sheet).

Sounds good. Let's have a detailed doc later on.

huaxingao · 2021-10-28T05:49:59Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnStatistics.java

+ * tree pre-ordering. This is used for aggregate push down in ORC.
+ *
+ * For nested data types (array, map and struct), the sub-field statistics are stored recursively
+ * inside parent column's `children` field. Here is an example of `OrcColumnStatistics`:


nit: I think in java doc, we are supposed to use {@code} instead of back quotes. There are a couple of other places use back quotes too.

@huaxingao - thanks, updated for all class-level comments. For method comments or comments in method's body, the {@code} does not work, so not changed. There are also back quotes used in comments in our code base as well for comments in body of method.

huaxingao · 2021-10-28T05:56:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

+      throw new SparkException(
+        s"Cannot read columns statistics in file: $filePath. Please consider disabling " +
+        s"ORC aggregate push down by setting 'spark.sql.orc.aggregatePushdown' to false.", e)
+    }


I think I need to do the same thing for parquet too. When the columns statistics can't be read, is this guaranteed to be a RuntimeException, or it could be other Exception or Error too?

I think the only case is the statistics is missing and OrcFooterReader.convertStatistics():orcStatistics.remove() throws RuntimeException. But just to be safe, I changed to match all Exception here.

huaxingao · 2021-10-28T06:28:55Z

LGTM

c21 · 2021-10-28T18:40:20Z

Addressed all comments from @huaxingao. @viirya do you wanna take another look? Thanks.

SparkQA · 2021-10-28T19:56:28Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49194/

SparkQA · 2021-10-28T20:33:36Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49194/

SparkQA · 2021-10-29T00:19:28Z

Test build #144724 has finished for PR 34298 at commit d85d4ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-10-29T00:28:54Z

Thanks! Merging to master.

c21 · 2021-10-29T00:38:47Z

Thank you @viirya, @sunchao and @huaxingao for review!

github-actions bot added the SQL label Oct 16, 2021

c21 commented Oct 16, 2021

View reviewed changes

viirya reviewed Oct 16, 2021

View reviewed changes

huaxingao reviewed Oct 17, 2021

View reviewed changes

sunchao reviewed Oct 18, 2021

View reviewed changes

c21 force-pushed the orc-agg branch from 1f36f12 to 3341440 Compare October 22, 2021 20:31

c21 commented Oct 22, 2021

View reviewed changes

huaxingao reviewed Oct 25, 2021

View reviewed changes

sunchao reviewed Oct 25, 2021

View reviewed changes

viirya reviewed Oct 26, 2021

View reviewed changes

c21 added 3 commits October 27, 2021 11:18

Aggregate push down for ORC reader

73aa26b

Fix style

cca5a30

Address comment in AggregatePushDownUtils to remove unnecessary state…

829cafb

…ments

c21 added 3 commits October 27, 2021 11:18

Addressed all comments

c3e1a12

Rebase to latest master

9b0a399

Address all comments

8c7c617

c21 force-pushed the orc-agg branch from 9b8b9ef to 8c7c617 Compare October 27, 2021 18:55

sunchao reviewed Oct 27, 2021

View reviewed changes

sunchao approved these changes Oct 28, 2021

View reviewed changes

huaxingao reviewed Oct 28, 2021

View reviewed changes

Addressed all comments

d85d4ba

viirya approved these changes Oct 28, 2021

View reviewed changes

viirya closed this in 609e749 Oct 29, 2021

c21 deleted the orc-agg branch October 29, 2021 00:38

amahussein mentioned this pull request Feb 24, 2022

Fallback to CPU when Spark pushes down Aggregates (Min/Max/Count) for ORC NVIDIA/spark-rapids#4859

Merged

[SPARK-34960][SQL] Aggregate push down for ORC #34298

[SPARK-34960][SQL] Aggregate push down for ORC #34298

Uh oh!

Conversation

c21 commented Oct 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

c21 commented Oct 16, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Oct 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 16, 2021

Uh oh!

SparkQA commented Oct 16, 2021

Uh oh!

SparkQA commented Oct 16, 2021

Uh oh!

SparkQA commented Oct 17, 2021

Uh oh!

SparkQA commented Oct 17, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Oct 17, 2021

Uh oh!

SparkQA commented Oct 17, 2021

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

c21 commented Oct 16, 2021 •

edited

Loading

viirya Oct 16, 2021 •

edited

Loading