Reduce compute complexity in OrcMetrics::findColumnsInContainers #1112

lxynov · 2020-06-13T01:58:03Z

Reduce compute complexity from O(N*R) to O(N), where N is the number of
columns (including all child columns in LIST, MAP and STRUCT) in an ORC
schema, R is the average number of LIST-or-MAP ancestors per node.

Test done: TestOrcMetrics passes

Reduce compute complexity from O(N*R) to O(N), where N is the number of columns (including all child columns in LIST, MAP and STRUCT) in an ORC schema, R is the average number of LIST-or-MAP ancestors per node.

rdblue · 2020-06-15T22:43:45Z

orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java

-      return flatTypes;
+  private static void findColumnsInContainers(TypeDescription column,
+                                              Set<TypeDescription> columnsInContainers,
+                                              boolean isInContainers) {


My main reservation about this is that it doesn't use the ORC type visitor.

The visitors keep code cleaner. Recursive code is kept to a single method that is reused, so all of the tree traversals are similar. Visitors also allow us to go find all of the places that traverse a given structure to make sure they are up to date, so updates and maintenance are easier.

I'm not sure about this change, if the concern is compute complexity on (really) large values of N we should not be using recursion either in as it is used here due to potential stack overflows, and instead an iterative version should be used.

I found myself using the visitor to be simpler for maintenance and readability since it's a common pattern used across the codebase. Usually, schemas would not be that large (possibly there are some exceptions), but even in the 100s of columns I'm not sure if the complexity impact is too high. I guess this was a trade-off on complexity vs using a common pattern in the codebase.

I agree about the complexity of traversing a schema not being a huge concern. I think the main value of this is to simplify the code to make it easier to maintain.

Here's the version I came up with, which is a bit simpler:

private static Set<Integer> statsColumns(TypeDescription schema) { return OrcSchemaWithTypeVisitor.visit((Type) null, schema, new StatsColumnsVisitor()); } private static class StatsColumnsVisitor extends OrcSchemaWithTypeVisitor<Set<Integer>> { @Override public Set<Integer> record(Types.StructType s, TypeDescription record, List<String> names, List<Set<Integer>> fields) { ImmutableSet.Builder<Integer> result = ImmutableSet.builder(); fields.stream().filter(Objects::nonNull).forEach(result::addAll); record.getChildren().stream().map(ORCSchemaUtil::fieldId).forEach(result::add); return result.build(); } @Override public Set<Integer> list(Types.ListType l, TypeDescription array, Set<Integer> element) { return null; } @Override public Set<Integer> map(Types.MapType m, TypeDescription map, Set<Integer> key, Set<Integer> value) { return null; } @Override public Set<Integer> primitive(Type.PrimitiveType p, TypeDescription primitive) { return null; } }

Using this would avoid the need to negate the check, so it would be if (statsColumns.contains(icebergId)) {...}.

@rdblue @edgarRd I think these are good points. @rdblue Your code snippet is indeed a neat and logical solution. Please feel free to close this PR and supersede it with yours.

@lxynov, I opened #1167 to fix this. Please have a look and review it if you have time. I'll close this one.

rdblue · 2020-06-15T22:44:10Z

I thought about asking for a similar change when merging the original PR, but I thought someone would submit a patch for it. Thanks for taking the time to keep the code clean.

Reduce compute complexity in OrcMetrics::findColumnsInContainers

3ecb21c

Reduce compute complexity from O(N*R) to O(N), where N is the number of columns (including all child columns in LIST, MAP and STRUCT) in an ORC schema, R is the average number of LIST-or-MAP ancestors per node.

rdblue reviewed Jun 15, 2020

View reviewed changes

rdblue mentioned this pull request Jul 4, 2020

ORC: Simplify logic to determine which columns have stats #1167

Merged

rdblue closed this Jul 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce compute complexity in OrcMetrics::findColumnsInContainers #1112

Reduce compute complexity in OrcMetrics::findColumnsInContainers #1112

Uh oh!

lxynov commented Jun 13, 2020 •

edited

Loading

Uh oh!

rdblue Jun 15, 2020

Uh oh!

edgarRd Jun 17, 2020 •

edited

Loading

Uh oh!

rdblue Jun 17, 2020 •

edited

Loading

Uh oh!

lxynov Jun 17, 2020

Uh oh!

rdblue Jul 4, 2020

Uh oh!

rdblue commented Jun 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Reduce compute complexity in OrcMetrics::findColumnsInContainers #1112

Reduce compute complexity in OrcMetrics::findColumnsInContainers #1112

Uh oh!

Conversation

lxynov commented Jun 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue Jun 15, 2020

Choose a reason for hiding this comment

Uh oh!

edgarRd Jun 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Jun 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lxynov Jun 17, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue Jul 4, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jun 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lxynov commented Jun 13, 2020 •

edited

Loading

edgarRd Jun 17, 2020 •

edited

Loading

rdblue Jun 17, 2020 •

edited

Loading