-
Notifications
You must be signed in to change notification settings - Fork 2.9k
API: Change GetProjectedIds to Return all Ids #2953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
e2b003a
ec0fc4b
6cb5b46
b85ba90
4f4e35a
2b7da89
38b627d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -42,7 +42,7 @@ public class StructProjection implements StructLike { | |
| */ | ||
| public static StructProjection create(Schema schema, Set<Integer> ids) { | ||
| StructType structType = schema.asStruct(); | ||
| return new StructProjection(structType, TypeUtil.select(structType, ids)); | ||
| return new StructProjection(structType, TypeUtil.project(structType, ids)); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like there aren't any uses of this call, which is good. I agree that we probably want this to use |
||
| } | ||
|
|
||
| /** | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -296,7 +296,7 @@ private Schema lazyColumnProjection() { | |
| } | ||
| requiredFieldIds.addAll(selectedIds); | ||
|
|
||
| return TypeUtil.select(schema, requiredFieldIds); | ||
| return TypeUtil.project(schema, requiredFieldIds); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with this because it is the opposite of |
||
|
|
||
| } else if (context.projectedSchema() != null) { | ||
| return context.projectedSchema(); | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,12 +19,14 @@ | |
|
|
||
| package org.apache.iceberg.avro; | ||
|
|
||
| import java.util.Collections; | ||
| import java.util.List; | ||
| import java.util.Map; | ||
| import java.util.Objects; | ||
| import java.util.Set; | ||
| import org.apache.avro.JsonProperties; | ||
| import org.apache.avro.Schema; | ||
| import org.apache.avro.Schema.Type; | ||
| import org.apache.avro.SchemaNormalization; | ||
| import org.apache.iceberg.mapping.NameMapping; | ||
| import org.apache.iceberg.relocated.com.google.common.base.Preconditions; | ||
|
|
@@ -81,15 +83,26 @@ public Schema record(Schema record, List<String> names, List<Schema> fields) { | |
|
|
||
| Schema fieldSchema = fields.get(field.pos()); | ||
| // All primitives are selected by selecting the field, but map and list | ||
| // types can be selected by projecting the keys, values, or elements. | ||
| // types can be selected by projecting the keys, values, or elements. Empty | ||
| // Structs can be selected by selecting the record itself instead of its children. | ||
| // This creates two conditions where the field should be selected: if the | ||
| // id is selected or if the result of the field is non-null. The only | ||
| // case where the converted field is non-null is when a map or list is | ||
| // selected by lower IDs. | ||
| if (selectedIds.contains(fieldId)) { | ||
| filteredFields.add(copyField(field, field.schema(), fieldId)); | ||
| if (fieldSchema != null) { | ||
| hasChange = true; // Sub-fields may be different | ||
| filteredFields.add(copyField(field, fieldSchema, fieldId)); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure that I understand the reason for this change. Is this implementing the same change as the previous PR, but in the Avro It looks like if a struct field is selected and a sub-field is selected, then the selection for the struct isn't a full selection. But if a sub-field is not selected then the selection for the struct is a full selection. That doesn't make sense to me.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I'm thinking about this more, I think that the behavior in this class should always match
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to set
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we are actually fine here unless every field is selected because the logic for has change is a bit confusing. You either
Currently we have tests hitting 1 and 3 but not 2 :/
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks good now. |
||
| } else { | ||
| if (isRecord(field.schema())) { | ||
| hasChange = true; // Sub-fields are now empty | ||
| filteredFields.add(copyField(field, makeEmptyCopy(field.schema()), fieldId)); | ||
| } else { | ||
| filteredFields.add(copyField(field, field.schema(), fieldId)); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, so in this case the field is a map, list, or primitive. Then we just follow the old behavior. Looks good to me since the Iceberg schema selection would fail. |
||
| } | ||
| } | ||
| } else if (fieldSchema != null) { | ||
| hasChange = true; | ||
| hasChange = true; // Sub-fields may be different | ||
| filteredFields.add(copyField(field, fieldSchema, fieldId)); | ||
| } | ||
| } | ||
|
|
@@ -259,6 +272,26 @@ private static Schema copyRecord(Schema record, List<Schema.Field> newFields) { | |
| return copy; | ||
| } | ||
|
|
||
| private boolean isRecord(Schema field) { | ||
| if (AvroSchemaUtil.isOptionSchema(field)) { | ||
| return AvroSchemaUtil.fromOption(field).getType().equals(Type.RECORD); | ||
| } else { | ||
| return field.getType().equals(Type.RECORD); | ||
| } | ||
| } | ||
|
|
||
| private static Schema makeEmptyCopy(Schema field) { | ||
| if (AvroSchemaUtil.isOptionSchema(field)) { | ||
| Schema innerSchema = AvroSchemaUtil.fromOption(field); | ||
| Schema emptyRecord = Schema.createRecord(innerSchema.getName(), innerSchema.getDoc(), innerSchema.getNamespace(), | ||
| innerSchema.isError(), Collections.emptyList()); | ||
| return AvroSchemaUtil.toOption(emptyRecord); | ||
| } else { | ||
| return Schema.createRecord(field.getName(), field.getDoc(), field.getNamespace(), field.isError(), | ||
| Collections.emptyList()); | ||
| } | ||
| } | ||
|
|
||
| private static Schema.Field copyField(Schema.Field field, Schema newSchema, Integer fieldId) { | ||
| Schema newSchemaReordered; | ||
| // if the newSchema is an optional schema, make sure the NULL option is always the first | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue here is selectNot doest not actually deselect children when a parent ID is not selected. Previously this is because getProjectedIDs (behind getIdsInternal) would not return parent struct ids, so removing it from the set of projectedIds would not do anything.
Now It will not work because removing a parentID still leaves all child IDs. We could fix this but it would be a change in behavior from the previous code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with the decision to not change the behavior of this method, even though the opposite of "select" behavior would be to fully remove a struct when its ID is passed in
fieldIds.But I don't think that
projectis quite correct either. Consider the example schema1: id bigint, 2: location struct<3: lat double, 4: long double>. Previously,selectNot(t, set(3, 4))would produce1: id bigintand omit the location entirely. Using project with the updatedGetProjectedIds, the projected ID set will be {1, 2, 3, 4} and not{1, 3, 4}. That would result in the same call producing1: id bigint, 2: location struct<>, which introduces a new bug because now there is an unexpected extra field.To clean this up, I think we need a version of
GetProjectedIdsthat doesn't select structs and uses the old behavior.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like the right behavior to me? Shouldn't you be required to explicitly omit the parent if you don't want the that element? Otherwise there would be no way to "selectNot" and only get back the empty struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrote up these test cases, i'll run the full test suite to make sure this works with our other usages