ORC metrics #199

edgarRd · 2019-05-28T20:52:49Z

This PR adds the following:

Metrics implementation for ORC

Thanks.

rdsr · 2019-05-29T00:09:25Z

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java

+    Literal<?> min = null;
+    if (columnStats.hasIntStatistics()) {
+      if (column.type().typeId() == Type.TypeID.INTEGER) {
+        min = Literal.of((int) columnStats.getIntStatistics().getMinimum());


Why convert to bytebuffers in a separate function? It seems that if we do it here itself, we save traversing the columns again and possibly also simplify the code a little.
e.g instead of
Literal.of(columnStats.getDoubleStatistics().getMaximum()); we could do
Conversions.toByteBuffer(column.type(), columnStats.getDoubleStatistics().getMinimum());

Yeah, I think that should work. I'll try that. Thanks.

rdsr · 2019-05-29T03:28:54Z

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java

+      Map<Integer, ByteBuffer> upperBounds = Maps.newHashMap();
+
+      for (int i = 0; i < colStats.length; i++) {
+        columSizes.put(i, colStats[i].getBytesOnDisk());


It seems here we are not explicitly using the fieldId from the schema as key in these stats maps. I wonder if the assumption that the i used to traverse colStats is also the right field id for a column, would always hold true.

For example, when we add a new column in the middle, during schema evolution?

Yeah, this is a good point. I've simplified the code and removed this assumption by iterating the schema columns directly.

+1, these maps should definitely be keyed by fieldId.

rdsr · 2019-05-29T03:33:28Z

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java

+  }
+
+  private static Optional<ByteBuffer> fromOrcMin(Types.NestedField column,
+                                                    OrcProto.ColumnStatistics columnStats) {


is OrcProto.ColumnStatistics a pubic API for ORC . I see that proper column statistics interfaces are also defined e.g see DecimalColumnStatistics.

I guess the Proto classes are used to avoid a typecast?

Found the right interfaces! Thanks for the pointers.

rdsr · 2019-05-29T03:38:20Z

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java

+    return Optional.ofNullable(max);
+  }
+
+  static Map<Integer, ?> fromBufferMap(Schema schema, Map<Integer, ByteBuffer> map) {


Seems this is only used in the test. Should this then be in the test class? Or do u see this being used in other places

I'll make it for testing only since I don't see any other use yet.

rdsr · 2019-06-07T20:39:35Z

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java

+
+      for(Types.NestedField col : schema.columns()) {
+        final int i = col.fieldId();
+        columSizes.put(i, colStats[i].getBytesOnDisk());


I think this can still cause issues, e.g col having fieldId i may not be the same column whose stats are at colstats[i]
E.g if we assume the initial schema is {0 a int, 1 b int} and we add a column in the middle so the new schema is {0 a int, 2 c int, 1 b int} . So col c has fieldId 2 but is at index 1. Note that I'm making an assumption about ORC here about col c being at index 1.

I see Parquet does it by field name org.apache.iceberg.parquet.ParquetUtil#footerMetrics

What do you guys think @edgarRd, @rdblue ?

+1. This needs to use the mapping in the ORC file to get the correct column for each field ID.

I think this assumption of id->index in column statistics is directly done in ORC itself. I don't think there's anything else to map column to stats object if not for the index, which afaict is an assumption done in the Reader interface, the implementation follows this assumption.

E.g if we assume the initial schema is {0 a int, 1 b int} and we add a column in the middle so the new schema is {0 a int, 2 c int, 1 b int} . So col c has fieldId 2 but is at index 1. Note that I'm making an assumption about ORC here about col c being at index 1.

In this example, if we follow the assumption done that array index (in column stats) maps to column id (preserved in filedId) then this is still correct. I don't think ColumnStatistics in the ProtoBuf object (in the file footer) gets rearranged to match the column order; however, it seems like the indices are preserved.

I see Parquet does it by field name org.apache.iceberg.parquet.ParquetUtil#footerMetrics

This seems to be possible because the column object obtained from the iteration contains the related stats information; instead in ORC these two pieces of information are detached.

I took a better look at this issue and I've created #213 - I think it'd be best to work on a fix for column mapping in a separate PR.

rdblue · 2019-06-07T22:35:16Z

@edgarRd, is it possible to separate out the listPartition support? It seems like that could go in earlier while we review the metrics support.

rdblue · 2019-06-07T22:39:10Z

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java

+      for(Types.NestedField col : schema.columns()) {
+        final int i = col.fieldId();
+        columSizes.put(i, colStats[i].getBytesOnDisk());
+        valueCounts.put(i, colStats[i].getNumberOfValues());


The metrics filters also require null value counts to determine whether any or all of the values in a column are null. Is that available from ORC?

No. According to the spec, ORC does not record counts of null values, instead it just stores whether a column has null values or not. From the spec:

From Hive 1.1.0 onwards, the column statistics will also record if there are any null values within the row group by setting the hasNull flag. The hasNull flag is used by ORC’s predicate pushdown to better answer ‘IS NULL’ queries.

https://orc.apache.org/specification/ORCv1/

rdblue · 2019-06-07T22:39:48Z

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java

+      Map<Integer, ByteBuffer> upperBounds = Maps.newHashMap();
+
+      for(Types.NestedField col : schema.columns()) {
+        final int i = col.fieldId();


The name should be fieldId, not i because i implies it is an index.

rdblue · 2019-06-07T22:41:24Z

orc/src/main/java/org/apache/iceberg/orc/TypeConversion.java

          TypeDescription type = fieldTypes.get(c);
-          fields.add(Types.NestedField.optional(columnIds.get(type), name,
-              convertOrcToType(type, columnIds)));
+          fields.add(Types.NestedField.optional(type.getId(), name,


Did ORC add field IDs? Why wasn't getId used before?

rdblue · 2019-06-07T22:42:48Z

orc/src/test/java/org/apache/iceberg/orc/OrcMetricsTest.java

+
+  public OrcMetricsTest() {
+    orcSchema = TypeDescription.fromString("struct<w:int,x:bigint,y:int,z:double>");
+    icebergSchema = TypeConversion.fromOrc(orcSchema);


I think that this test should create an Iceberg schema with non-sequential field IDs and convert that to ORC. That would validate that the problems that @rdsr pointed out -- using type ID as an index or index as the type ID -- are fixed.

rdblue · 2019-06-07T22:43:28Z

orc/src/test/java/org/apache/iceberg/orc/OrcMetricsTest.java

+  }
+
+  @Test
+  public void testOrcMetricsPrimitive() throws IOException {


@aokolnychyi wrote a thorough test for Parquet. Could this reuse that code instead of writing a new one?

edgarRd · 2019-06-07T23:10:06Z

@rdblue I just pushed #210 for listPartition support - I'll rebase this PR from master once that gets merged.

rdblue · 2019-06-08T21:28:48Z

@edgarRd, I merged #210 so you can rebase. Thanks!

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java

rdblue · 2019-06-23T00:04:30Z

@edgarRd, is this ready for another round of reviews, or should we merge #227 first?

edgarRd · 2019-06-24T05:24:08Z

@rdblue yeah, I was thinking in getting #227 to have the correct column mapping.

aokolnychyi · 2019-07-15T06:48:25Z

Shall we also follow different metrics modes introduced in PR #263?

rdsr · 2020-01-29T02:48:15Z

@edgarRd what's remaining in this module? Something we can help with?

edgarRd · 2020-01-29T16:24:38Z

Thanks for the follow up on this, @rdsr. I'm finishing the tests for this PR with the recently added column mapping changes. I'll push as soon as I get those working.

rdblue · 2020-01-29T16:59:38Z

Thanks, @edgarRd! I'm looking forward to getting this working for the 0.8.0 release!

rdblue · 2020-05-21T20:40:50Z

@edgarRd, thanks for updating this! Is it ready for review?

rdblue · 2020-05-21T20:42:53Z

gradle/wrapper/gradle-wrapper.properties

-#
-
+#Wed Jan 29 07:37:53 PST 2020
+distributionUrl=https\://services.gradle.org/distributions/gradle-5.4.1-all.zip


Is this change needed?

rdblue · 2020-05-22T20:30:12Z

orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java

-  private static final String ICEBERG_ID_ATTRIBUTE = "iceberg.id";
-  private static final String ICEBERG_REQUIRED_ATTRIBUTE = "iceberg.required";
+  static final String ICEBERG_ID_ATTRIBUTE = "iceberg.id";
+  static final String ICEBERG_REQUIRED_ATTRIBUTE = "iceberg.required";


I don't see where these are used outside of this class. Why make the package-private instead of private?

rdblue · 2020-05-22T20:35:19Z

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java

+  private static Metrics buildOrcMetrics(final long numOfRows, final TypeDescription orcSchema,
+                                         final ColumnStatistics[] colStats) {
+    final Schema schema = ORCSchemaUtil.convert(orcSchema);
+    Map<Integer, Long> columSizes = Maps.newHashMapWithExpectedSize(colStats.length);


Typo: columSizes -> columnSizes

orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java

rdblue · 2020-06-02T22:26:15Z

Merged!

Thanks for all the work to get this in, @edgarRd!

edgarRd · 2020-06-02T23:16:08Z

Thanks @rdblue, @rdsr and @shardulm94 for the review!

lxynov · 2020-06-15T18:51:57Z

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java

+          Optional<ByteBuffer> orcMin = (colStat.getNumberOfValues() > 0) ?
+              fromOrcMin(icebergCol, colStat) : Optional.empty();
+          orcMin.ifPresent(byteBuffer -> lowerBounds.put(icebergCol.fieldId(), byteBuffer));
+          Optional<ByteBuffer> orcMax = (colStat.getNumberOfValues() > 0) ?
+              fromOrcMax(icebergCol, colStat) : Optional.empty();
+          orcMax.ifPresent(byteBuffer -> upperBounds.put(icebergCol.fieldId(), byteBuffer));


@edgarRd @rdblue @rdsr @shardulm94 In ORC, the column stats will have min/max values even when there're null values within the same file. (See here) Is this okay for Iceberg?

The only time there should not be a min/max value is when there are no non-null values.

Got it! Thanks for the clarification!

edgarRd changed the title ~~Orc metrics list~~ ORC metrics and listPartition support May 28, 2019

rdsr reviewed May 29, 2019

View reviewed changes

edgarRd mentioned this pull request Jun 7, 2019

Capture lower and upper bound stats for ORC #197

Closed

rdsr reviewed Jun 7, 2019

View reviewed changes

rdblue reviewed Jun 7, 2019

View reviewed changes

edgarRd changed the title ~~ORC metrics and listPartition support~~ ORC metrics Jun 11, 2019

rdblue reviewed Jun 11, 2019

View reviewed changes

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java Outdated Show resolved Hide resolved

rdsr reviewed Jun 14, 2019

View reviewed changes

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java Outdated Show resolved Hide resolved

edgarRd mentioned this pull request Jul 1, 2019

ORC column map fix #227

Merged

rdblue added this to the Java 0.8.0 Release milestone Feb 17, 2020

XiaokunDing mentioned this pull request Apr 1, 2020

[ISSUE #855] Add the Test Util And ORC Writer for Spark #857

Merged

rdblue removed this from the Java 0.8.0 Release milestone May 8, 2020

rdblue reviewed May 21, 2020

View reviewed changes

rdblue reviewed May 22, 2020

View reviewed changes

rdblue reviewed Jun 2, 2020

View reviewed changes

orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 2, 2020

View reviewed changes

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 2, 2020

View reviewed changes

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 2, 2020

View reviewed changes

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 2, 2020

View reviewed changes

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java Outdated Show resolved Hide resolved

edgarRd and others added 19 commits June 1, 2020 22:53

Add ORC metrics and function to fetch ORC partitions

17613d8

Add ORC lower/upper bounds to metrics

21ac278

Avoid unnecessary conversion to Literal

bb27a18

Use ORC ColumnStatistics interface

c047d4b

Make fromBufferMap test private

208b9e4

Avoid column id assumptions and simply col iteration

cf41c67

ColumnStats null checks & tests for all types

afd396a

Remove dead code

28a14ad

ORC metrics using new column mapping

ac87091

Fix reader creation and closing

10d8709

Fix check errors and unit tests

59ff1d7

Simplify with new changes

51fbcc7

Add workaround to ORC-611 in metrics

9034b42

Skip computing bounds for maps and lists elements

d0dcc90

Fix null counts for columns in containers

97d7259

Adjust lower bound since ORC does adjust by 1 ms below epoch

e36c95e

Return map and array in visitor

4379030

Use DateTimeUtils

2fd84a4

Simplify handling for ORC in list and map metrics test.

f3a5291

rdblue merged commit 260bfa0 into apache:master Jun 2, 2020

lxynov reviewed Jun 15, 2020

View reviewed changes

rdblue added this to the Java 0.9.0 Release milestone Jul 10, 2020

cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020

ORC: Support metrics in Iceberg metadata (apache#199)

0eed545

ORC metrics #199

ORC metrics #199

Uh oh!

Conversation

edgarRd commented May 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jun 7, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edgarRd commented Jun 7, 2019

Uh oh!

rdblue commented Jun 8, 2019

Uh oh!

Uh oh!

Uh oh!

rdblue commented Jun 23, 2019

Uh oh!

edgarRd commented Jun 24, 2019

Uh oh!

aokolnychyi commented Jul 15, 2019

Uh oh!

rdsr commented Jan 29, 2020

Uh oh!

edgarRd commented Jan 29, 2020

Uh oh!

rdblue commented Jan 29, 2020

Uh oh!

rdblue commented May 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edgarRd commented May 28, 2019 •

edited

Loading