-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Reduce compute complexity in OrcMetrics::findColumnsInContainers #1112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reduce compute complexity from O(N*R) to O(N), where N is the number of columns (including all child columns in LIST, MAP and STRUCT) in an ORC schema, R is the average number of LIST-or-MAP ancestors per node.
| return flatTypes; | ||
| private static void findColumnsInContainers(TypeDescription column, | ||
| Set<TypeDescription> columnsInContainers, | ||
| boolean isInContainers) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main reservation about this is that it doesn't use the ORC type visitor.
The visitors keep code cleaner. Recursive code is kept to a single method that is reused, so all of the tree traversals are similar. Visitors also allow us to go find all of the places that traverse a given structure to make sure they are up to date, so updates and maintenance are easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this change, if the concern is compute complexity on (really) large values of N we should not be using recursion either in as it is used here due to potential stack overflows, and instead an iterative version should be used.
I found myself using the visitor to be simpler for maintenance and readability since it's a common pattern used across the codebase. Usually, schemas would not be that large (possibly there are some exceptions), but even in the 100s of columns I'm not sure if the complexity impact is too high. I guess this was a trade-off on complexity vs using a common pattern in the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about the complexity of traversing a schema not being a huge concern. I think the main value of this is to simplify the code to make it easier to maintain.
Here's the version I came up with, which is a bit simpler:
private static Set<Integer> statsColumns(TypeDescription schema) {
return OrcSchemaWithTypeVisitor.visit((Type) null, schema, new StatsColumnsVisitor());
}
private static class StatsColumnsVisitor extends OrcSchemaWithTypeVisitor<Set<Integer>> {
@Override
public Set<Integer> record(Types.StructType s, TypeDescription record, List<String> names, List<Set<Integer>> fields) {
ImmutableSet.Builder<Integer> result = ImmutableSet.builder();
fields.stream().filter(Objects::nonNull).forEach(result::addAll);
record.getChildren().stream().map(ORCSchemaUtil::fieldId).forEach(result::add);
return result.build();
}
@Override
public Set<Integer> list(Types.ListType l, TypeDescription array, Set<Integer> element) {
return null;
}
@Override
public Set<Integer> map(Types.MapType m, TypeDescription map, Set<Integer> key, Set<Integer> value) {
return null;
}
@Override
public Set<Integer> primitive(Type.PrimitiveType p, TypeDescription primitive) {
return null;
}
}Using this would avoid the need to negate the check, so it would be if (statsColumns.contains(icebergId)) {...}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
I thought about asking for a similar change when merging the original PR, but I thought someone would submit a patch for it. Thanks for taking the time to keep the code clean. |
Reduce compute complexity from O(N*R) to O(N), where N is the number of
columns (including all child columns in LIST, MAP and STRUCT) in an ORC
schema, R is the average number of LIST-or-MAP ancestors per node.
Test done: TestOrcMetrics passes