Skip to content

Conversation

@lxynov
Copy link
Contributor

@lxynov lxynov commented Jun 13, 2020

Reduce compute complexity from O(N*R) to O(N), where N is the number of
columns (including all child columns in LIST, MAP and STRUCT) in an ORC
schema, R is the average number of LIST-or-MAP ancestors per node.

Test done: TestOrcMetrics passes

Reduce compute complexity from O(N*R) to O(N), where N is the number of
columns (including all child columns in LIST, MAP and STRUCT) in an ORC
schema, R is the average number of LIST-or-MAP ancestors per node.
return flatTypes;
private static void findColumnsInContainers(TypeDescription column,
Set<TypeDescription> columnsInContainers,
boolean isInContainers) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main reservation about this is that it doesn't use the ORC type visitor.

The visitors keep code cleaner. Recursive code is kept to a single method that is reused, so all of the tree traversals are similar. Visitors also allow us to go find all of the places that traverse a given structure to make sure they are up to date, so updates and maintenance are easier.

Copy link
Contributor

@edgarRd edgarRd Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this change, if the concern is compute complexity on (really) large values of N we should not be using recursion either in as it is used here due to potential stack overflows, and instead an iterative version should be used.

I found myself using the visitor to be simpler for maintenance and readability since it's a common pattern used across the codebase. Usually, schemas would not be that large (possibly there are some exceptions), but even in the 100s of columns I'm not sure if the complexity impact is too high. I guess this was a trade-off on complexity vs using a common pattern in the codebase.

Copy link
Contributor

@rdblue rdblue Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about the complexity of traversing a schema not being a huge concern. I think the main value of this is to simplify the code to make it easier to maintain.

Here's the version I came up with, which is a bit simpler:

  private static Set<Integer> statsColumns(TypeDescription schema) {
    return OrcSchemaWithTypeVisitor.visit((Type) null, schema, new StatsColumnsVisitor());
  }

  private static class StatsColumnsVisitor extends OrcSchemaWithTypeVisitor<Set<Integer>> {
    @Override
    public Set<Integer> record(Types.StructType s, TypeDescription record, List<String> names, List<Set<Integer>> fields) {
      ImmutableSet.Builder<Integer> result = ImmutableSet.builder();
      fields.stream().filter(Objects::nonNull).forEach(result::addAll);
      record.getChildren().stream().map(ORCSchemaUtil::fieldId).forEach(result::add);
      return result.build();
    }

    @Override
    public Set<Integer> list(Types.ListType l, TypeDescription array, Set<Integer> element) {
      return null;
    }

    @Override
    public Set<Integer> map(Types.MapType m, TypeDescription map, Set<Integer> key, Set<Integer> value) {
      return null;
    }

    @Override
    public Set<Integer> primitive(Type.PrimitiveType p, TypeDescription primitive) {
      return null;
    }
  }

Using this would avoid the need to negate the check, so it would be if (statsColumns.contains(icebergId)) {...}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue @edgarRd I think these are good points. @rdblue Your code snippet is indeed a neat and logical solution. Please feel free to close this PR and supersede it with yours.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lxynov, I opened #1167 to fix this. Please have a look and review it if you have time. I'll close this one.

@rdblue
Copy link
Contributor

rdblue commented Jun 15, 2020

I thought about asking for a similar change when merging the original PR, but I thought someone would submit a patch for it. Thanks for taking the time to keep the code clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants