-
Notifications
You must be signed in to change notification settings - Fork 2.9k
API: Change GetProjectedIds to Return all Ids #2953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2d6b3c3 to
9f0e776
Compare
| if (fieldResult == null) { | ||
| fieldIds.add(field.fieldId()); | ||
| } | ||
| fieldIds.add(field.fieldId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to my comment on the other PR. I don't think that we should return inner IDs other than struct IDs or else it isn't clear how lists and maps should be handled.
Do you know if anything needs to be done for ORC @RussellSpitzer? I'm helping out with ORC more going forward and if you're aware of anything that needs to be updated there, if you don't have time to update it, if you make an issue I'll see if I can grab it (or at the least we'll have the issue to track). If you're not sure then possibly you can add a unit test for ORC as well? If it fails, open an issue (or I will) and then we can follow up on it after. =) |
9f0e776 to
8be465d
Compare
| Set<Integer> projectedIds = getIdsInternal(struct); | ||
| projectedIds.removeAll(fieldIds); | ||
| return select(struct, projectedIds); | ||
| return project(struct, projectedIds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue here is selectNot doest not actually deselect children when a parent ID is not selected. Previously this is because getProjectedIDs (behind getIdsInternal) would not return parent struct ids, so removing it from the set of projectedIds would not do anything.
Now It will not work because removing a parentID still leaves all child IDs. We could fix this but it would be a change in behavior from the previous code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with the decision to not change the behavior of this method, even though the opposite of "select" behavior would be to fully remove a struct when its ID is passed in fieldIds.
But I don't think that project is quite correct either. Consider the example schema 1: id bigint, 2: location struct<3: lat double, 4: long double>. Previously, selectNot(t, set(3, 4)) would produce 1: id bigint and omit the location entirely. Using project with the updated GetProjectedIds, the projected ID set will be {1, 2, 3, 4} and not {1, 3, 4}. That would result in the same call producing 1: id bigint, 2: location struct<>, which introduces a new bug because now there is an unexpected extra field.
To clean this up, I think we need a version of GetProjectedIds that doesn't select structs and uses the old behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like the right behavior to me? Shouldn't you be required to explicitly omit the parent if you don't want the that element? Otherwise there would be no way to "selectNot" and only get back the empty struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrote up these test cases, i'll run the full test suite to make sure this works with our other usages
Schema schema = new Schema(
Lists.newArrayList(
required(1, "id", Types.LongType.get()),
required(2, "location", Types.StructType.of(
required(3, "lat", Types.DoubleType.get()),
required(4, "long", Types.DoubleType.get())
))));
Schema expectedNoPrimitive = new Schema(
Lists.newArrayList(
required(2, "location", Types.StructType.of(
required(3, "lat", Types.DoubleType.get()),
required(4, "long", Types.DoubleType.get())
))));
Schema actualNoPrimitve = TypeUtil.selectNot(schema, Sets.newHashSet(1));
Assert.assertEquals(expectedNoPrimitive.asStruct(), actualNoPrimitve.asStruct());
// Expected legacy behavior is to completely remove structs if their elements are removed
Schema expectedNoStructElements = new Schema(required(1, "id", Types.LongType.get()));
Schema actualNoStructElements = TypeUtil.selectNot(schema, Sets.newHashSet(3, 4));
Assert.assertEquals(expectedNoStructElements.asStruct(), actualNoStructElements.asStruct());
// Expected legacy behavior is to ignore selectNot on struct elements.
Schema actualNoStruct = TypeUtil.selectNot(schema, Sets.newHashSet(2));
Assert.assertEquals(schema.asStruct(), actualNoStruct.asStruct());
```
I don't think so, we don't have any custom ORC projection code as far as I can tell. It just uses the output of a TypeUtil.selectNot() iceberg/spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java Lines 158 to 159 in 6809103
against the ORC schema to determine what to read and then uses that directly here : iceberg/orc/src/main/java/org/apache/iceberg/orc/ORC.java Lines 315 to 319 in 970e8aa
|
Previously getProjectedIds would only return leaf nodes and primitives that were selected. This made it impossible to return empty structs. To fix this we change the behavior to return all id's of required fields including structs. This in turn requires fixing the alternate PruneColumn methods for Avro and Parquet to respect that they will now have selected field ID's for non primtiive nodes. Previous use cases of TypeUtil.select are converted to TypeUtil.project, which inverses this new getProjecetedIds code.
I was working on this too late at night.
The changed behavior of getProjectedIds and "select" means that selectNot needs to be implemented with Project.
63d96e3 to
b85ba90
Compare
| public static StructProjection create(Schema schema, Set<Integer> ids) { | ||
| StructType structType = schema.asStruct(); | ||
| return new StructProjection(structType, TypeUtil.select(structType, ids)); | ||
| return new StructProjection(structType, TypeUtil.project(structType, ids)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there aren't any uses of this call, which is good. I agree that we probably want this to use project instead of select.
| requiredFieldIds.addAll(selectedIds); | ||
|
|
||
| return TypeUtil.select(schema, requiredFieldIds); | ||
| return TypeUtil.project(schema, requiredFieldIds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this because it is the opposite of GetProjectedIds used above.
| if (selectedIds.contains(fieldId)) { | ||
| filteredFields.add(copyField(field, field.schema(), fieldId)); | ||
| if (fieldSchema != null) { | ||
| filteredFields.add(copyField(field, fieldSchema, fieldId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that I understand the reason for this change. Is this implementing the same change as the previous PR, but in the Avro PruneColumns?
It looks like if a struct field is selected and a sub-field is selected, then the selection for the struct isn't a full selection. But if a sub-field is not selected then the selection for the struct is a full selection. That doesn't make sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I'm thinking about this more, I think that the behavior in this class should always match project. I doubt there's a case where we want select behavior, right? In that case, shouldn't the else case check whether the type is a record and create an empty record?
core/src/test/java/org/apache/iceberg/avro/TestReadProjection.java
Outdated
Show resolved
Hide resolved
| public void testPartitionsTable() { | ||
| TableIdentifier tableIdentifier = TableIdentifier.of("db", "partitions_test"); | ||
| Table table = createTable(tableIdentifier, SCHEMA, PartitionSpec.builderFor(SCHEMA).identity("id").build()); | ||
| Table table = createTable(tableIdentifier, SCHEMA, SPEC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these changes needed? I don't have a problem with them, but I don't think they are required right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just a clean up I had in the original PR, I can split this into another pull request if needed
Have PruneColumns Avro mimic PruneColumns Iceberg Adjust TypeUtil.selectNot to better mimic the old behavior and added tests
|
@rdblue All fixed up |
| hasChange = true; | ||
| builder.addField(field); | ||
| } else { | ||
| builder.addField(originalField); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I do the empty message only thing here as well? where we copy the struct to be empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean?
|
|
||
| private static Set<Integer> getIdsInternal(Type type) { | ||
| return visit(type, new GetProjectedIds()); | ||
| return getIdsInternal(type, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a lot of changes to remove this method? I wouldn't expect it since this is private. I'd probably prefer removing it, but up to you if it touches a bunch of unrelated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are pretty safe here, removing it. Only two changes in this file
| if (isRecord(field.schema())) { | ||
| filteredFields.add(copyField(field, makeEmptyCopy(field.schema()), fieldId)); | ||
| } else { | ||
| filteredFields.add(copyField(field, field.schema(), fieldId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so in this case the field is a map, list, or primitive. Then we just follow the old behavior. Looks good to me since the Iceberg schema selection would fail.
| if (selectedIds.contains(fieldId)) { | ||
| filteredFields.add(copyField(field, field.schema(), fieldId)); | ||
| if (fieldSchema != null) { | ||
| filteredFields.add(copyField(field, fieldSchema, fieldId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to set hasChange in the cases where we don't return field.schema() for the field, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are actually fine here unless every field is selected because the logic for has change is a bit confusing.
You either
- Have a change (Make a new record using the filtered fields) ( Return Changed Records)
- Have no change and filtered fields size is the same as the original number of fields ( Return Original Record)
- Have no change and filtered field size is not empty (Make a new record using the filtered fields) (Return changed record)
Currently we have tests hitting 1 and 3 but not 2 :/
I'll add the "hasChange" flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good now.
| builder.addField(field); | ||
| } else { | ||
| if (isStruct(originalField)) { | ||
| builder.addField(originalField.asGroupType().withNewFields(Collections.emptyList())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should hasChange be true here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the ordering of the statements under (field != null) so that they match the usages here
|
@RussellSpitzer, looking good! My one remaining concern is the |
|
@rdblue patched up, I think I got that covered. Should be set now. I also removed that private function as recommended |
| record.put("id", 34L); | ||
| Record location = new Record(record.getSchema().getField("location").schema()); | ||
| location.put("lat", 52.995143f); | ||
| location.put("long", -1.539054f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Odd location to choose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was already in the test suite and came in the Netflix original commit :) So you'll have to ask whoever wrote the first version.
| } | ||
|
|
||
| @Test | ||
| public void testEmptyStructRequiredProjection() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this identical to the test case above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I see that the write schema has the struct as optional.
|
Thanks, @RussellSpitzer! |
|
Thanks @rdblue I know this one was a bit of a pain, but we should hopefully have all of our projection issues fixed now 🤞 |
|
🎉 Great work ! |
Previously getProjectedIds would only return leaf nodes and primitives that
were selected. This made it impossible to return empty structs. To fix this
we change the behavior to return all id's of required fields including structs.
This in turn requires fixing the alternate PruneColumn methods for Avro
and Parquet to respect that they will now have selected field ID's for non
primitive nodes. Previous use cases of TypeUtil.select are converted to
TypeUtil.project, which inverses this new getProjecetedIds code.