Avro metrics support: track metrics in Avro value writers #1963

yyanyy · 2020-12-19T05:56:42Z

This change is a smaller PR broken down from #1935.

This change adds field id to constructors of Avro primitive value writers, and make these writers to track stats such as value count, min and max, and expose a metrics method that could be called to collect FieldMetrics. However nothing is calling these method yet.
This change doesn't have any test, and tests will be included in the next PR when end to end integration is set up.

Please note: regarding change to the signature of FieldMetrics, the alternative would be to keep ByteBuffer as the return value for lower/upper bound of FieldMetrics and ingest each field's metrics mode to each leaf value writer during construction, so that when collecting metrics from these writers, truncation and conversion to byte buffer could happen. I think it's doable but it would touch a lot of methods' signatures, including adding metric mode to the constructor of every leaf writer, and adding metrics config to every datum writer (e.g. DataWriter, GenericAppenderFactory), but it does avoid skip computing min/max for fields that don't need them. Please let me know if you are interested, and I'll post a new commit to this PR so that the differences in these two implementations could be compared.

core/src/main/java/org/apache/iceberg/FieldMetrics.java

rdblue · 2021-02-03T02:02:09Z

@yyanyy, can you rebase this? I think merging #1946 caused conflicts.

core/src/main/java/org/apache/iceberg/FloatFieldMetrics.java

core/src/main/java/org/apache/iceberg/avro/AvroSchemaUtil.java

rdblue · 2021-02-03T17:47:29Z

core/src/main/java/org/apache/iceberg/avro/AvroWithPartnerByStructureVisitor.java

+  // ---------------------------------- Helpers ---------------------------------------------
+
+  private Deque<String> fieldNames = Lists.newLinkedList();
+  private Deque<Schema> parentSchemas = Lists.newLinkedList();


Instead of updating all visitors, why not add extra callbacks like the visitors for Iceberg schemas? I think that supporting beforeField and afterField would be a better way to handle this than passing the parent and name around. The implementation to get a field's ID from its parent and name seems a bit awkward compared to adding an ID stack in one visitor.

I think I did think of that pattern, and from my notes the main reason I didn't do it is that it won't be as clean as the existing pattern of before/afterField in other visitors, as different data structure has different way of retrieving field id information. For example, for fields that are part of the struct, the field id is stored in its own Schema.Field so that we can directly pass field to before/afterField within the for loop when looping through fields; but for map value, the id is stored in the map's schema instead of its own schema, so that beforeMapValue should be passed with the parent schema. My thought was that to require different visitors to implement before/after with all these different parameters, and to duplicate the various ID retrieval logic among data types in different visitor implementations could be messy to reason about, that keeping them here might actually be cleaner.

What about creating a fake Schema.Field to pass to the before/after method instead? Another alternative is to pass the field information to the method, like this:

public beforeField(int fieldId, String name, Schema type); public afterField(int fieldId, String name, Schema type); public beforeListElement(int elementId, Schema elementType); public beforeMapKey(int keyId, Schema keyType); public beforeMapValue(int valueId, Schema valueType);

I think some variation on this would be better. We want to avoid keeping additional state in all visitors.

rdblue · 2021-02-03T17:49:38Z

core/src/main/java/org/apache/iceberg/avro/ValueWriters.java

    @Override
    public void write(Void ignored, Encoder encoder) throws IOException {
-      encoder.writeNull();
+      throw new IllegalStateException("[BUG] NullWriter shouldn't be used for writing nulls for Avro");


Is this change necessary? Do you think that it would actually cause a problem if it were used?

I think it won't, this was just me trying to make this fail loudly instead of fail silently and to avoid usage confusion in code, I can revert it back if you think it's not necessary.

I think we should revert this back so that writeNull is correctly called. Nulls shouldn't be written as anything, but the fact that the encoder has a writeNull method makes me think that it should be called. We don't know what the encoder might be using it for.

core/src/main/java/org/apache/iceberg/avro/ValueWriters.java

rdblue · 2021-02-03T17:58:29Z

core/src/main/java/org/apache/iceberg/avro/ValueWriters.java

+    private long nullValueCount;

-    private OptionWriter(int nullIndex, ValueWriter<T> valueWriter) {
+    private OptionWriter(int nullIndex, ValueWriter<T> valueWriter, Schema.Type type) {


It looks like the only place that type is used is in an error message. I'd prefer not to change the signature here just to print a type.

Oh I think this type is also used for checking isMetricSupportedType/supportsMetrics so we will still need it.

I think that we should not pass the type here. Instead, let's have a MetricsOptionWriter and a non-metrics version and choose which one to use ahead of time. That way, we don't keep a count that won't be used. There's no need to check whether the null count should be used when metrics is called. That can be determined ahead of time.

If we do that, then there is no need to pass the type here or into the option method. We can either add a boolean (collectMetrics) or have a separate factory method. I think that would be a bit cleaner.

yyanyy

Thank you @rdblue for reviewing this PR! I have rebased the changes and responded/addressed the feedback in this PR and #1946 except for the one that I forgot to include rowWriter as part of the metrics() for positional delete writer, which I'll do in a separate PR since I wanted to have both code and test coverage in the same PR for that.

rdblue · 2021-02-21T23:51:18Z

core/src/main/java/org/apache/iceberg/avro/AvroFileAppender.java

      CodecFactory codec, Map<String, String> metadata) throws IOException {
    DataFileWriter<D> writer = new DataFileWriter<>(
-        (DatumWriter<D>) metricsAwareDatumWriter);
+        (DatumWriter<D>) datumWriter);


Nit: no need for the newline any more.

rdblue · 2021-02-22T00:09:35Z

core/src/main/java/org/apache/iceberg/avro/AvroSchemaUtil.java

    return schema.getObjectProp(propertyName) != null;
  }

+  public static int fieldId(Schema currentSchema, Schema parentSchema, Supplier<String> fieldNameGetter) {


I'd like to avoid needing this method because of the strange input types. I think those show that the way we're traversing the schema isn't quite right, which is why this is complicated.

rdblue · 2021-02-22T00:12:33Z

core/src/main/java/org/apache/iceberg/avro/ValueWriters.java

-    private static final BooleanWriter INSTANCE = new BooleanWriter();
+    @Override
+    public Stream<FieldMetrics> metrics() {
+      throw new IllegalStateException("[BUG] NullWriter shouldn't be used for writing nulls for Avro");


Let's default this to Stream.empty() so that it won't fail.

core/src/main/java/org/apache/iceberg/avro/ValueWriters.java

rdblue · 2021-02-22T00:22:00Z

core/src/main/java/org/apache/iceberg/avro/ValueWriters.java

+   * @param <T2> Type after transformation
+   */
+  @SuppressWarnings("checkstyle:VisibilityModifier")
+  public abstract static class MetricsAwareTransformWriter<T1, T2> implements ValueWriter<T1> {


We typically use S and T for input and output types in other transform classes.

rdblue · 2021-02-22T00:24:39Z

core/src/main/java/org/apache/iceberg/avro/ValueWriters.java

+    @Override
+    public void write(T1 datum, Encoder encoder) throws IOException {
+      valueCount++;
+      if (datum == null) {


Is this necessary? Before, all of the type-specific writers assumed that the input value was non-null because null isn't allowed unless the type is optional. If it is optional, the writer will be wrapped in an option writer, so there is no need to handle null.

I think this should similarly assume that the option writer tracks null values and that all values will be non-null. That simplifies this class quite a bit.

rdblue · 2021-02-22T00:25:14Z

core/src/main/java/org/apache/iceberg/avro/ValueWriters.java

+          min = transformedDatum;
+        }
+        writeVal(transformedDatum, encoder);
+


Nit: no newline after if blocks and unnecessary newline before closing curly.

I see this in a couple other places, too.

rdblue · 2021-06-17T00:28:54Z

core/src/main/java/org/apache/iceberg/FloatFieldMetrics.java

 * exceptions when they are accessed.
 */
-public class FloatFieldMetrics extends FieldMetrics {
+public class FloatFieldMetrics extends FieldMetrics<Number> {


After reviewing #2464, I think I understand why Number is used here instead of Float or Double, but I think it would be better to make each FieldMetrics class specific to a value type. This will probably be done when #2464 is merged and this is rebased for float, but in the mean time you may want to update this for any other type metrics you're introducing in this PR.

rdblue · 2021-06-17T00:30:04Z

core/src/main/java/org/apache/iceberg/avro/AvroFileAppender.java

      CodecFactory codec, Map<String, String> metadata) throws IOException {
-    DataFileWriter<D> writer = new DataFileWriter<>(
-        (DatumWriter<D>) metricsAwareDatumWriter);
+    DataFileWriter<D> writer = new DataFileWriter<>((DatumWriter<D>) datumWriter);


Is this rename needed? While I support cleaning up names, I would generally opt to leave these as-is to have smaller commits that are less likely to cause conflicts.

rdblue · 2021-06-17T00:30:58Z

core/src/main/java/org/apache/iceberg/avro/AvroMetrics.java

-    // TODO will populate in following PRs if datum writer is a MetricsAwareDatumWriter
-    return new Metrics(numRecords, null, null, null);
+    if (!(datumWriter instanceof MetricsAwareDatumWriter)) {
+      return new Metrics(numRecords, null, null, null, null);


is numRecords correct? What if this is for a field nested in a map or list?

rdblue · 2021-06-17T00:31:49Z

core/src/main/java/org/apache/iceberg/avro/AvroMetrics.java

+
+    metricsAwareDatumWriter.metrics().forEach(metrics -> {
+      String columnName = schema.findColumnName(metrics.id());
+      MetricsModes.MetricsMode metricsMode = metricsConfig.columnMode(columnName);


Should we add a method to look up metrics mode by ID?

rdblue · 2021-06-17T00:37:25Z

core/src/main/java/org/apache/iceberg/avro/AvroMetrics.java

+      switch (type.typeId()) {
+        case STRING:
+          lowerBound = UnicodeUtil.truncateStringMin(
+              Literal.of((CharSequence) metrics.lowerBound()), truncateLength).value();


Why not use UnicodeUtil.truncateStringMin(CharSequence, int) instead of the Literal method?

Ah, I see that the min and max methods take literals. I think it would be better to refactor those methods to expose non-Literal implementations instead of creating the literals and unwrapping the result.

rdblue · 2021-06-17T00:40:37Z

core/src/main/java/org/apache/iceberg/avro/AvroMetrics.java

+      }
+
+      updateLowerBound(metrics, type, metricsMode).ifPresent(lowerBound -> lowerBounds.put(metrics.id(), lowerBound));
+      updateUpperBound(metrics, type, metricsMode).ifPresent(upperBound -> upperBounds.put(metrics.id(), upperBound));


I don't quite understand the decision to return an option instead of just passing lowerBounds or upperBounds into the update method and having the put happen there. Wouldn't it be simpler if the put happened at the end of updateLowerBound or updateUpperBound?

rdblue · 2021-06-17T00:41:45Z

core/src/main/java/org/apache/iceberg/avro/AvroMetrics.java

+      metricsConfig = inputMetricsConfig;
+    }
+
+    Map<Integer, Long> valueCounts = new HashMap<>();


Nit: we typically prefer Maps.newHashMap(), which will add null checking, and no put value should be null here.

rdblue · 2021-06-17T00:43:01Z

core/src/main/java/org/apache/iceberg/avro/AvroMetrics.java

+    Map<Integer, Long> nullValueCounts = new HashMap<>();
+    Map<Integer, Long> nanValueCounts = new HashMap<>();
+    Map<Integer, ByteBuffer> lowerBounds = new HashMap<>();
+    Map<Integer, ByteBuffer> upperBounds = new HashMap<>();


One thing to consider is that quite a bit of this method and the helper methods below is generic and could be written for Stream<FieldMetrics>. I don't think it needs to happen right now, but I think it would be good to have this separated into the Avro-specific part and a metrics part that lives in MetricsUtil.

rdblue · 2021-06-17T00:44:03Z

core/src/main/java/org/apache/iceberg/avro/AvroSchemaVisitor.java

        if (schema.getLogicalType() instanceof LogicalMap) {
-          return visitor.array(schema, visit(schema.getElementType(), visitor));
+          T result = visit(schema.getElementType(), visitor);
+          return visitor.array(schema, result);


It doesn't look like this change is needed?

rdblue · 2021-06-17T00:48:23Z

core/src/main/java/org/apache/iceberg/avro/AvroWithPartnerByStructureVisitor.java

            "Invalid map: %s is not a string", keyType);
-        return visitor.map(partner, schema, visit(visitor.mapValueType(partner), schema.getValueType(), visitor));
+
+        visitor.beforeMapValue("value", schema.getValueType(), schema);


Should this make an artificial call to the map key callbacks using a generic String schema as well?

rdblue · 2021-06-17T00:55:24Z

core/src/main/java/org/apache/iceberg/avro/AvroWriterBuilderFieldIdUtil.java

+ * This util class helps Avro DatumWriter builders to retrieve the correct field Id when building Avro DatumWriters
+ * with visitor pattern.
+ */
+public class AvroWriterBuilderFieldIdUtil {


Minor: I think it would be a bit shorter to make this an abstract class that can be extended to get the current field ID via a protected method, but this is a good solution already.

rdblue · 2021-06-17T01:07:22Z

core/src/main/java/org/apache/iceberg/avro/ValueWriters.java

+  public abstract static class StoredAsIntWriter<T> implements ValueWriter<T> {
+    protected final int id;
+    protected long valueCount;
+    protected Integer max;


Because this is only used for non-null values, I think you can simplify the min and max updates by removing the null check. Just use protected int min = Integer.MAX_VALUE here and remove the null check. Then when creating FieldMetrics, you can check whether valueCount is non-zero to know whether min and max are valid.

rdblue · 2021-06-17T01:09:09Z

core/src/main/java/org/apache/iceberg/avro/ValueWriters.java

    }
  }
+
+  private abstract static class FloatingPointWriter<T extends Comparable<T>>


After #2464, would this be needed? Or would you just use the floating point classes from that PR?

rdblue · 2021-06-17T19:09:52Z

core/src/main/java/org/apache/iceberg/avro/ValueWriters.java

-    private static final LongWriter INSTANCE = new LongWriter();
-
-    private LongWriter() {
+  private static class LongWriter extends ComparableWriter<Long> {


I would expect LongWriter to extend StoredAsLongWriter instead of ComparableWriter. Same for IntegerWriter. Could you update those?

rdblue · 2021-06-17T19:12:49Z

core/src/main/java/org/apache/iceberg/avro/ValueWriter.java

-  default Stream<FieldMetrics> metrics() {
-    return Stream.empty(); // TODO will populate in following PRs
-  }
+  Stream<FieldMetrics> metrics();


FieldMetrics is parameterized, but this is a bare reference. Could you update it? I think it should be FieldMetrics<?> since the metrics are not necessarily for the written value type, D.

rdblue · 2021-06-17T19:16:54Z

core/src/main/java/org/apache/iceberg/avro/ValueWriters.java

+      writeVal(datum, encoder);
+    }
+
+    protected abstract void writeVal(T datum, Encoder encoder) throws IOException;


I would probably name this encode rather than writeVal so that there is less confusion with the write method.

rdblue · 2021-06-17T19:17:35Z

core/src/main/java/org/apache/iceberg/avro/ValueWriters.java

+    protected abstract void writeVal(T datum, Encoder encoder) throws IOException;
+
+    @Override
+    public Stream<FieldMetrics> metrics() {


I'd like to fix all of the references that don't parameterize FieldMetrics. I think they should be FieldMetrics<?>.

rdblue · 2021-06-17T19:18:08Z

core/src/test/java/org/apache/iceberg/TestMetrics.java

        ByteBuffer.wrap("A".getBytes()), ByteBuffer.wrap("A".getBytes()), metrics);
    assertCounts(7, 1L, 0L, 1L, metrics);
-    assertBounds(7, DoubleType.get(), Double.NaN, Double.NaN, metrics);
+    if (fileFormat() == FileFormat.AVRO) {


Is this needed if #2464 goes in first?

rdblue · 2021-06-17T19:19:38Z

core/src/test/java/org/apache/iceberg/TestMetrics.java

+  private void assertNonNullColumnSizes(Metrics metrics) {
+    if (fileFormat() != FileFormat.AVRO) {
+      Assert.assertTrue(metrics.columnSizes().values().stream().allMatch(Objects::nonNull));
+    }


Can you add an assertion for Avro? I think that the column sizes map should be null, is that correct?

rdblue · 2021-06-17T19:20:21Z

data/src/test/java/org/apache/iceberg/avro/TestGenericAvroMetrics.java

+public class TestGenericAvroMetrics extends TestAvroMetrics {
+
+  protected Metrics getMetrics(Schema schema, OutputFile file, Map<String, String> properties,
+                                        MetricsConfig metricsConfig, Record... records) throws IOException {


Nit: indentation is off.

rdblue · 2021-06-17T19:24:49Z

core/src/main/java/org/apache/iceberg/avro/ValueWriters.java

-  public static <T> ValueWriter<T> option(int nullIndex, ValueWriter<T> writer) {
-    return new OptionWriter<>(nullIndex, writer);
+  public static <T> ValueWriter<T> option(int nullIndex, ValueWriter<T> writer, Schema.Type type) {
+    if (AvroSchemaUtil.supportsMetrics(type)) {


Rather than introducing supportsMetrics, why not just check the value writer to see whether it is a MetricsAwareWriter? I know that not all of the writers extend that class, but you could either introduce a MetricsWriter interface to signal that the inner writer supports metrics that all of the implementations extend, or maybe you could alter the hierarchy a little so that StoredAsIntWriter and StoredAsLongWriter actually do extend MetricsAwareWriter. Then you wouldn't need changes to AvroSchemaUtil or so many changes to option handling.

rdblue · 2021-06-17T19:28:04Z

flink/src/main/java/org/apache/iceberg/flink/data/FlinkValueWriters.java

+
+    @Override
+    public Stream<FieldMetrics> metrics() {
+      return metrics(DecimalData::toBigDecimal);


Does this return a Java BigDecimal? I know we've had problems with this method in Spark because it actually produces a Scala BigDecimal.

rdblue · 2021-06-17T19:28:58Z

spark/src/main/java/org/apache/iceberg/spark/data/SparkValueWriters.java

+
+    @Override
+    public Stream<FieldMetrics> metrics() {
+      return metrics(Decimal::toJavaBigDecimal);


Good to see you got the right one.

rdblue · 2021-08-01T21:23:28Z

Any update on this one, @yyanyy?

yyanyy · 2021-08-04T02:30:41Z

Any update on this one, @yyanyy?

Apologies, didn't find a chance to update this. I'll make sure to allocate time to address the comments in the coming two weeks!

rdblue · 2021-08-04T19:35:11Z

Thanks, @yyanyy! No rush, I just wanted to check in.

github-actions · 2024-07-27T00:12:58Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-08-03T00:13:09Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions bot added API core flink spark labels Dec 19, 2020

yyanyy mentioned this pull request Dec 19, 2020

Avro metrics support #1935

Closed

yyanyy changed the title ~~Track metrics in Avro value writers~~ Avro metrics support: track metrics in Avro value writers Dec 19, 2020

rdblue reviewed Dec 28, 2020

View reviewed changes

core/src/main/java/org/apache/iceberg/FieldMetrics.java Outdated Show resolved Hide resolved

yyanyy force-pushed the avro_metrics_2 branch from 0086b4f to 8914589 Compare January 5, 2021 22:37

rdblue reviewed Feb 3, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/FloatFieldMetrics.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 3, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/avro/AvroSchemaUtil.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 3, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/avro/ValueWriters.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 3, 2021

View reviewed changes

yyanyy added 2 commits February 4, 2021 16:08

Track metrics in Avro value writers

91cf508

address some of the comments, pending discussions on others

39efa9f

yyanyy force-pushed the avro_metrics_2 branch from 8914589 to 39efa9f Compare February 9, 2021 03:36

yyanyy commented Feb 9, 2021

View reviewed changes

yyanyy added 3 commits February 8, 2021 19:42

fix import order

70821cc

fix import order in flink

c25cb9c

fix the final import order...

63434f9

rdblue reviewed Feb 21, 2021

View reviewed changes

rdblue reviewed Feb 22, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/avro/ValueWriters.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 22, 2021

View reviewed changes

core/src/main/java/org/apache/iceberg/avro/ValueWriters.java Show resolved Hide resolved

rdblue reviewed Feb 22, 2021

View reviewed changes

rdblue reviewed Jun 17, 2021

View reviewed changes

github-actions bot added the stale label Jul 27, 2024

github-actions bot closed this Aug 3, 2024

Avro metrics support: track metrics in Avro value writers #1963

Avro metrics support: track metrics in Avro value writers #1963

Uh oh!

Conversation

yyanyy commented Dec 19, 2020

Uh oh!

Uh oh!

rdblue commented Feb 3, 2021

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yyanyy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment