Skip to content

Conversation

@RussellSpitzer
Copy link
Member

Previously we only had TypeUtil.select which would select all nested
elements of selected structs. This made it difficult to project empty
structs since they would have no selected leaf elements. The new
TypeUtil.project instead takes only the elements which are specifically
requested with no automatic selection of nested elements.

@RussellSpitzer RussellSpitzer requested a review from rdblue August 7, 2021 18:57
@github-actions github-actions bot added the API label Aug 7, 2021
PruneColumns(Set<Integer> selected) {
Preconditions.checkNotNull(selected, "Selected field ids cannot be null");
this.selected = selected;
this.onlySelected = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this(selected, false);?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep of course

/**
* Visits a schema and returns only those element's whose id's have been passed as selected
* @param selected ids of elements to return
* @param onlySelected whether to return only elements which have been selected, no sub elements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording seems a bit unclear to me since the purpose of this class is to return only the selected fields. I think that it misses the context that this affects the fields that are selected when a struct itself is selected.

What about renaming this to selectFullStructs or selectEmptyStructs? Then the javadoc would be "whether explicitly selected structs should contain all fields or no fields".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sgtm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this and reading more of the PR, I think that the boolean should be selectFullTypes so that it is clear how maps and lists are handled. Additionally, when selectFullTypes is false, only struct IDs should be supported. Maps and lists should not be explicitly selected or else there is a problem.

} else {
selectedFields.add(
Types.NestedField.required(field.fieldId(), field.name(), projectedType, field.doc()));
boolean includeField = !onlySelected || selected.contains(field.fieldId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right. I think that the only thing we need to change is the logic for selecting a struct itself, in field. If a field makes it to this point because it is in fieldResults then it (or one of its children) was selected.

For example, if the schema is 1: id bigint, 2: location struct<3: lat float, 4: long float, 5: alt float> and selected includes {3, 4} then the schema returned must include both lat and long fields, even if location is explicitly selected and empty structs should be returned.

The behavior difference is in the case of selected={2, 3, 4}:

  • When selecting empty structs, the result should be 2: location struct<3: lat float, 4: long float>
  • When selecting full structs, the result should be 2: location struct<3: lat float, 4: long float, 5: alt float>

Similarly, if selected={2} the results should be:

  • When selecting empty structs, the result should be 2: location struct<>
  • When selecting full structs, the result should be 2: location struct<3: lat float, 4: long float, 5: alt float>

So I don't think we need to check onlySelected when considering fields. This should not be a way to prune explicitly selected fields.

this.selectFullTypes = selectFullTypes;
}

PruneColumns(Set<Integer> selected) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is package-private, can we remove it and always explicitly pass selectFullTypes?


PruneColumns(Set<Integer> selected) {
/**
* Visits a schema and returns only those element's whose id's have been passed as selected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may want to rephrase this because the part about "only those elements whose ids have been passed as selected" is a bit unclear to me. I would say "Visits a schema and returns only the fields selected by the ID set." Then in the description you could get more specific about selectFullTypes: "when true, selecting a nested type selects all subfields. otherwise, select produces empty structs." And you can also say that selecting maps and lists directly when selectFullTypes is enabled is ambiguous and therefore not allowed

return projectSelectedStruct(fieldResult);
} else {
Preconditions.checkArgument(selectFullTypes || !field.type().isNestedType(),
"Cannot select partial List or Map types explicitly");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this include the ID and field that was selected to give more context?

}

private ListType projectList(ListType list, Type elementResult) {
Preconditions.checkArgument(elementResult != null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add friendly error messages to all of the preconditions?

* Project extracts particular fields from a schema. Unlike
* {@link TypeUtil#select(Schema, Set)}, project will pick out only
* the fields enumerated, this means no sub fields will be selected
* unless they are explicitly requested.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: line wrapping here is odd. Could you wrap at 120 chars?

}

/**
* Project extracts particular fields from a schema. Unlike
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically have a short high-level description (hopefully on a single line) and then a paragraph break followed by other relevant information.

* Project extracts particular fields from a schema. Unlike
* {@link TypeUtil#select(Schema, Set)}, project will pick out only
* the fields enumerated, this means no sub fields will be selected
* unless they are explicitly requested.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might add "... unless they are explicitly requested. Structs that are explicitly projected are empty unless sub-fields are explicitly projected. Maps and lists cannot be explicitly selected in fieldIds."

return new Schema(result.fields());
}
}
return new Schema();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no fields are selected, but the schema had aliases, this will drop the aliases. Can you ensure we always retain the aliases?

Preconditions.checkNotNull(schema, "Schema cannot be null");

Types.StructType result = project(schema.asStruct(), fieldIds);
if (Objects.equals(schema.asStruct(), result)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: If schema is non-null from the check above, then schema.asStruct() must also be non-null. So you should be able to use equals instead if you want.

Preconditions.checkNotNull(fieldIds, "Field ids cannot be null");

Type result = visit(struct, new PruneColumns(fieldIds, false));
if (struct == result) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be .equals? We used to use identity, but there are build warnings so now we just use equals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this wasn't updated.

}

@Test
public void testProjectList() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.


@Test
public void testProjectMap() {
// We can't partially project keys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that we don't because that changes key equality.

optional(100, "x", Types.IntegerType.get()),
optional(101, "y", Types.IntegerType.get())),
Types.StructType.of()))));
Schema actualDepthOne = TypeUtil.project(schema, Sets.newHashSet(10, 13, 14, 100, 101));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100 and 101 should be implicitly projected because any part of the key is projected.

}

@Test
public void testSelect() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

)))));

Schema actualDepthTwo = TypeUtil.project(schema, Sets.newHashSet(11, 12, 15, 17));
Schema actualDepthTwoChildren = TypeUtil.project(schema, Sets.newHashSet(11, 17));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added another "Child only" projection here

optional(101, "y", Types.IntegerType.get())),
Types.StructType.of()))));
Schema actualDepthOne = TypeUtil.project(schema, Sets.newHashSet(10, 13, 14, 100, 101));
Schema actualDepthOneNoKeys = TypeUtil.project(schema, Sets.newHashSet(10, 13, 14));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implicit projection of keys here

* Visits a schema and returns only the fields selected by the id set.
* @param selected ids of elements to return
* @param selectFullTypes when true, selecting a nested type selects all subfields. When false selecting list or
* map types is undefined and forbidden.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I normally opt to explain behavior in the method description and not in argument descriptions. Argument descriptions should state what the argument controls, but not all the implications. Otherwise, the argument descriptions are very long and reading the description doesn't cover the full behavior of the method. I'd probably change this to just "when true, selecting a nested type selects all subfields". And add a paragraph above that explains the behavior for structs, maps, and lists.

// fields selected.
return fieldResult;
// This field wasn't selected but a subfield was so include that
return fieldResult;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this line didn't change except for an accidental new space.

return list;
} else if (list.isElementOptional()) {
return Types.ListType.ofOptional(list.elementId(), elementResult);
if (list.elementType().isStructType() && !selectFullTypes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be slightly different:

if (selectFullTypes) {
  return list;
} else if (list.elementType().isStructType()) {
  StructType projectedStruct = projectSelectedStruct(elementResult);
  return projectList(list, projectedStruct);
} else {
  Precondtions.checkArgument(list.elementType().isPrimitive(), "Cannot directly project list or map");
  return list;
}

I think that makes it more understandable because selectFullTypes is handled first and so we know that the remaining cases are not selecting full types. It also handles lists of lists because of the argument check. I think that this was allowing an inner list to be selected directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably add test cases for list of lists or map of maps.

return map;
} else if (map.isValueOptional()) {
return Types.MapType.ofOptional(map.keyId(), map.valueId(), map.keyType(), valueResult);
if (map.valueType().isStructType() && !selectFullTypes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, I think this should handle selectFullTypes and then go into more cases.

public Type field(Types.NestedField field, Type fieldResult) {
if (selected.contains(field.fieldId())) {
return field.type();
if (field.type().isStructType() && !selectFullTypes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to below, it is probably going to be more readable later to handle selectFullTypes first and then move to other cases.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nearly there, but I think that a list of lists is going to allow lists to be directly selected.

@RussellSpitzer
Copy link
Member Author

@rdblue Added more tests and changed around the list/map/field if statement ordering, should be good to go now

Previously we only had TypeUtil.select which would select all nested
elements of selected structs. This made it difficult to project empty
structs since they would have no selected leaf elements. The new
TypeUtil.project instead takes only the elements which are specifically
requested with no automatic selection of nested elements.
} else {
return Types.ListType.ofRequired(list.elementId(), elementResult);
Preconditions.checkArgument(list.elementType().isPrimitiveType(),
"Cannot explicitly project List or Map types, Field %s of type %s was selected",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really a "field" so much as a "list element"

} else {
return Types.MapType.ofRequired(map.keyId(), map.valueId(), map.keyType(), valueResult);
Preconditions.checkArgument(map.valueType().isPrimitiveType(),
"Cannot explicitly project List or Map types, Field %s of type %s was selected",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd replace "field" with "map value"

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I had a couple of nits in error messages, but this is ready to go.

@RussellSpitzer
Copy link
Member Author

Tests had the timeout issue again, it has to be unrelated to this patch since this patch doesn't actually change any other used apis so I'm going to merge. The suite also passed before I changed the error messages.

@RussellSpitzer RussellSpitzer merged commit 599ee22 into apache:master Sep 3, 2021
@RussellSpitzer RussellSpitzer deleted the TypeUtilProject branch September 3, 2021 02:08
kbendick pushed a commit to kbendick/iceberg that referenced this pull request Nov 2, 2021
* Core: Add TypeUtil.project() for Explicit Projection

Previously we only had TypeUtil.select which would select all nested
elements of selected structs. This made it difficult to project empty
structs since they would have no selected leaf elements. The new
TypeUtil.project instead takes only the elements which are specifically
requested with no automatic selection of nested elements.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants