Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Jul 4, 2020

This simplifies the visitor used to determine which ORC columns should have stats stored in table metadata. It uses the new visitor introduced in #1140. This is an alternative implementation to #1112.

Copy link
Contributor

@rdsr rdsr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@rdblue rdblue merged commit 2438f21 into apache:master Jul 5, 2020
@rdblue
Copy link
Contributor Author

rdblue commented Jul 5, 2020

Merged. Thanks for reviewing, @rdsr!

public Set<Integer> record(TypeDescription record, List<String> names, List<Set<Integer>> fields) {
ImmutableSet.Builder<Integer> result = ImmutableSet.builder();
fields.stream().filter(Objects::nonNull).forEach(result::addAll);
record.getChildren().stream().map(ORCSchemaUtil::fieldId).forEach(result::add);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to verify, this would fail if there's any column that does not have an Iceberg ID. Is that preferred to skipping the metrics instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. This is called when Iceberg wrote the file, so we should be able to assume the IDs are present, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ORC files are written by Iceberg it should be fine. I was thinking for the case of importing existing ORC files although we'd need to implement name mapping fallback strategy.

Copy link
Contributor

@edgarRd edgarRd Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this breaks pretty much any import of non-Iceberg ORC tables via SparkTableUtil.

cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants