API: Change GetProjectedIds to Return all Ids #2953

RussellSpitzer · 2021-08-07T18:58:40Z

Previously getProjectedIds would only return leaf nodes and primitives that
were selected. This made it impossible to return empty structs. To fix this
we change the behavior to return all id's of required fields including structs.
This in turn requires fixing the alternate PruneColumn methods for Avro
and Parquet to respect that they will now have selected field ID's for non
primitive nodes. Previous use cases of TypeUtil.select are converted to
TypeUtil.project, which inverses this new getProjecetedIds code.

rdblue · 2021-08-09T17:16:50Z

api/src/main/java/org/apache/iceberg/types/GetProjectedIds.java

-    if (fieldResult == null) {
-      fieldIds.add(field.fieldId());
-    }
+    fieldIds.add(field.fieldId());


This is related to my comment on the other PR. I don't think that we should return inner IDs other than struct IDs or else it isn't clear how lists and maps should be handled.

kbendick · 2021-08-10T06:57:36Z

This in turn requires fixing the alternate PruneColumn methods for Avro
and Parquet to respect that they will now have selected field ID's for non
primitive nodes.

Do you know if anything needs to be done for ORC @RussellSpitzer? I'm helping out with ORC more going forward and if you're aware of anything that needs to be updated there, if you don't have time to update it, if you make an issue I'll see if I can grab it (or at the least we'll have the issue to track).

If you're not sure then possibly you can add a unit test for ORC as well? If it fails, open an issue (or I will) and then we can follow up on it after. =)

RussellSpitzer · 2021-08-10T22:04:34Z

api/src/main/java/org/apache/iceberg/types/TypeUtil.java

    Set<Integer> projectedIds = getIdsInternal(struct);
    projectedIds.removeAll(fieldIds);
-    return select(struct, projectedIds);
+    return project(struct, projectedIds);


One issue here is selectNot doest not actually deselect children when a parent ID is not selected. Previously this is because getProjectedIDs (behind getIdsInternal) would not return parent struct ids, so removing it from the set of projectedIds would not do anything.

Now It will not work because removing a parentID still leaves all child IDs. We could fix this but it would be a change in behavior from the previous code.

I think I agree with the decision to not change the behavior of this method, even though the opposite of "select" behavior would be to fully remove a struct when its ID is passed in fieldIds.

But I don't think that project is quite correct either. Consider the example schema 1: id bigint, 2: location struct<3: lat double, 4: long double>. Previously, selectNot(t, set(3, 4)) would produce 1: id bigint and omit the location entirely. Using project with the updated GetProjectedIds, the projected ID set will be {1, 2, 3, 4} and not {1, 3, 4}. That would result in the same call producing 1: id bigint, 2: location struct<>, which introduces a new bug because now there is an unexpected extra field.

To clean this up, I think we need a version of GetProjectedIds that doesn't select structs and uses the old behavior.

That seems like the right behavior to me? Shouldn't you be required to explicitly omit the parent if you don't want the that element? Otherwise there would be no way to "selectNot" and only get back the empty struct.

Wrote up these test cases, i'll run the full test suite to make sure this works with our other usages

Schema schema = new Schema( Lists.newArrayList( required(1, "id", Types.LongType.get()), required(2, "location", Types.StructType.of( required(3, "lat", Types.DoubleType.get()), required(4, "long", Types.DoubleType.get()) )))); Schema expectedNoPrimitive = new Schema( Lists.newArrayList( required(2, "location", Types.StructType.of( required(3, "lat", Types.DoubleType.get()), required(4, "long", Types.DoubleType.get()) )))); Schema actualNoPrimitve = TypeUtil.selectNot(schema, Sets.newHashSet(1)); Assert.assertEquals(expectedNoPrimitive.asStruct(), actualNoPrimitve.asStruct()); // Expected legacy behavior is to completely remove structs if their elements are removed Schema expectedNoStructElements = new Schema(required(1, "id", Types.LongType.get())); Schema actualNoStructElements = TypeUtil.selectNot(schema, Sets.newHashSet(3, 4)); Assert.assertEquals(expectedNoStructElements.asStruct(), actualNoStructElements.asStruct()); // Expected legacy behavior is to ignore selectNot on struct elements. Schema actualNoStruct = TypeUtil.selectNot(schema, Sets.newHashSet(2)); Assert.assertEquals(schema.asStruct(), actualNoStruct.asStruct()); ```

RussellSpitzer · 2021-08-11T15:40:27Z

This in turn requires fixing the alternate PruneColumn methods for Avro
and Parquet to respect that they will now have selected field ID's for non
primitive nodes.

Do you know if anything needs to be done for ORC @RussellSpitzer? I'm helping out with ORC more going forward and if you're aware of anything that needs to be updated there, if you don't have time to update it, if you make an issue I'll see if I can grab it (or at the least we'll have the issue to track).

If you're not sure then possibly you can add a unit test for ORC as well? If it fails, open an issue (or I will) and then we can follow up on it after. =)

I don't think so, we don't have any custom ORC projection code as far as I can tell. It just uses the output of a TypeUtil.selectNot()

iceberg/spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java

Lines 158 to 159 in 6809103

    
           Schema readSchemaWithoutConstantAndMetadataFields = TypeUtil.selectNot(readSchema, 
        
               Sets.union(idToConstant.keySet(), MetadataColumns.metadataFieldIds()));

against the ORC schema to determine what to read and then uses that directly here :

iceberg/orc/src/main/java/org/apache/iceberg/orc/ORC.java

Lines 315 to 319 in 970e8aa

    
           public <D> CloseableIterable<D> build() { 
        
             Preconditions.checkNotNull(schema, "Schema is required"); 
        
             return new OrcIterable<>(file, conf, schema, nameMapping, start, length, readerFunc, caseSensitive, filter, 
        
                 batchedReaderFunc, recordsPerBatch); 
        
           }

Previously getProjectedIds would only return leaf nodes and primitives that were selected. This made it impossible to return empty structs. To fix this we change the behavior to return all id's of required fields including structs. This in turn requires fixing the alternate PruneColumn methods for Avro and Parquet to respect that they will now have selected field ID's for non primtiive nodes. Previous use cases of TypeUtil.select are converted to TypeUtil.project, which inverses this new getProjecetedIds code.

I was working on this too late at night.

The changed behavior of getProjectedIds and "select" means that selectNot needs to be implemented with Project.

rdblue · 2021-09-13T19:23:43Z

api/src/main/java/org/apache/iceberg/util/StructProjection.java

  public static StructProjection create(Schema schema, Set<Integer> ids) {
    StructType structType = schema.asStruct();
-    return new StructProjection(structType, TypeUtil.select(structType, ids));
+    return new StructProjection(structType, TypeUtil.project(structType, ids));


Looks like there aren't any uses of this call, which is good. I agree that we probably want this to use project instead of select.

rdblue · 2021-09-13T19:26:26Z

core/src/main/java/org/apache/iceberg/BaseTableScan.java

      requiredFieldIds.addAll(selectedIds);

-      return TypeUtil.select(schema, requiredFieldIds);
+      return TypeUtil.project(schema, requiredFieldIds);


I agree with this because it is the opposite of GetProjectedIds used above.

rdblue · 2021-09-13T19:32:15Z

core/src/main/java/org/apache/iceberg/avro/PruneColumns.java

      if (selectedIds.contains(fieldId)) {
-        filteredFields.add(copyField(field, field.schema(), fieldId));
+        if (fieldSchema != null) {
+          filteredFields.add(copyField(field, fieldSchema, fieldId));


I'm not sure that I understand the reason for this change. Is this implementing the same change as the previous PR, but in the Avro PruneColumns?

It looks like if a struct field is selected and a sub-field is selected, then the selection for the struct isn't a full selection. But if a sub-field is not selected then the selection for the struct is a full selection. That doesn't make sense to me.

As I'm thinking about this more, I think that the behavior in this class should always match project. I doubt there's a case where we want select behavior, right? In that case, shouldn't the else case check whether the type is a record and create an empty record?

core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java

core/src/test/java/org/apache/iceberg/avro/TestReadProjection.java

rdblue · 2021-09-13T19:34:14Z

spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java

  public void testPartitionsTable() {
    TableIdentifier tableIdentifier = TableIdentifier.of("db", "partitions_test");
-    Table table = createTable(tableIdentifier, SCHEMA, PartitionSpec.builderFor(SCHEMA).identity("id").build());
+    Table table = createTable(tableIdentifier, SCHEMA, SPEC);


Are these changes needed? I don't have a problem with them, but I don't think they are required right?

No, just a clean up I had in the original PR, I can split this into another pull request if needed

Have PruneColumns Avro mimic PruneColumns Iceberg Adjust TypeUtil.selectNot to better mimic the old behavior and added tests

RussellSpitzer · 2021-09-16T19:37:02Z

@rdblue All fixed up

RussellSpitzer · 2021-09-16T19:38:22Z

parquet/src/main/java/org/apache/iceberg/parquet/PruneColumns.java

+          hasChange = true;
+          builder.addField(field);
+        } else {
+          builder.addField(originalField);


Should I do the empty message only thing here as well? where we copy the struct to be empty?

What do you mean?

rdblue · 2021-09-19T19:35:40Z

api/src/main/java/org/apache/iceberg/types/TypeUtil.java

+
  private static Set<Integer> getIdsInternal(Type type) {
-    return visit(type, new GetProjectedIds());
+    return getIdsInternal(type, true);


Is it a lot of changes to remove this method? I wouldn't expect it since this is private. I'd probably prefer removing it, but up to you if it touches a bunch of unrelated code.

We are pretty safe here, removing it. Only two changes in this file

rdblue · 2021-09-19T19:38:40Z

core/src/main/java/org/apache/iceberg/avro/PruneColumns.java

+          if (isRecord(field.schema())) {
+            filteredFields.add(copyField(field, makeEmptyCopy(field.schema()), fieldId));
+          } else {
+            filteredFields.add(copyField(field, field.schema(), fieldId));


Okay, so in this case the field is a map, list, or primitive. Then we just follow the old behavior. Looks good to me since the Iceberg schema selection would fail.

rdblue · 2021-09-19T19:40:09Z

core/src/main/java/org/apache/iceberg/avro/PruneColumns.java

      if (selectedIds.contains(fieldId)) {
-        filteredFields.add(copyField(field, field.schema(), fieldId));
+        if (fieldSchema != null) {
+          filteredFields.add(copyField(field, fieldSchema, fieldId));


I think we need to set hasChange in the cases where we don't return field.schema() for the field, right?

I think we are actually fine here unless every field is selected because the logic for has change is a bit confusing.

You either

Have a change (Make a new record using the filtered fields) ( Return Changed Records)

Have no change and filtered fields size is the same as the original number of fields ( Return Original Record)

Have no change and filtered field size is not empty (Make a new record using the filtered fields) (Return changed record)

Currently we have tests hitting 1 and 3 but not 2 :/
I'll add the "hasChange" flag

This looks good now.

rdblue · 2021-09-19T19:41:25Z

parquet/src/main/java/org/apache/iceberg/parquet/PruneColumns.java

+          builder.addField(field);
+        } else {
+          if (isStruct(originalField)) {
+            builder.addField(originalField.asGroupType().withNewFields(Collections.emptyList()));


Should hasChange be true here as well?

Changing the ordering of the statements under (field != null) so that they match the usages here

rdblue · 2021-09-19T19:42:59Z

@RussellSpitzer, looking good! My one remaining concern is the hasChange behavior. That may also be causing tests to not do what you expect.

RussellSpitzer · 2021-09-20T18:23:06Z

@rdblue patched up, I think I got that covered. Should be set now. I also removed that private function as recommended

rdblue · 2021-10-03T23:19:09Z

core/src/test/java/org/apache/iceberg/avro/TestReadProjection.java

+    record.put("id", 34L);
+    Record location = new Record(record.getSchema().getField("location").schema());
+    location.put("lat", 52.995143f);
+    location.put("long", -1.539054f);


Odd location to choose.

It was already in the test suite and came in the Netflix original commit :) So you'll have to ask whoever wrote the first version.

https://github.com/apache/iceberg/blame/master/core/src/test/java/org/apache/iceberg/avro/TestReadProjection.java#L205-L206

rdblue · 2021-10-03T23:20:25Z

core/src/test/java/org/apache/iceberg/avro/TestReadProjection.java

+  }
+
+  @Test
+  public void testEmptyStructRequiredProjection() throws Exception {


Isn't this identical to the test case above?

Nevermind, I see that the write schema has the struct as optional.

rdblue · 2021-10-03T23:24:06Z

Thanks, @RussellSpitzer!

RussellSpitzer · 2021-10-03T23:30:42Z

Thanks @rdblue I know this one was a bit of a pain, but we should hopefully have all of our projection issues fixed now 🤞

szehon-ho · 2021-10-05T17:20:26Z

🎉 Great work !

…he#2953)

RussellSpitzer requested a review from rdblue August 7, 2021 18:58

github-actions bot added API core parquet spark labels Aug 7, 2021

RussellSpitzer force-pushed the GetAllProjectedIds branch from 2d6b3c3 to 9f0e776 Compare August 9, 2021 16:21

rdblue reviewed Aug 9, 2021

View reviewed changes

RussellSpitzer force-pushed the GetAllProjectedIds branch from 9f0e776 to 8be465d Compare August 10, 2021 20:38

RussellSpitzer commented Aug 10, 2021

View reviewed changes

RussellSpitzer added 4 commits September 7, 2021 11:10

Remove unceccessary checks

ec0fc4b

I was working on this too late at night.

Fix SelectNot

6cb5b46

The changed behavior of getProjectedIds and "select" means that selectNot needs to be implemented with Project.

Rebase Fix

b85ba90

RussellSpitzer force-pushed the GetAllProjectedIds branch from 63d96e3 to b85ba90 Compare September 7, 2021 16:59

rdblue reviewed Sep 13, 2021

View reviewed changes

core/src/test/java/org/apache/iceberg/TestSchemaUpdate.java Show resolved Hide resolved

rdblue reviewed Sep 13, 2021

View reviewed changes

core/src/test/java/org/apache/iceberg/avro/TestReadProjection.java Outdated Show resolved Hide resolved

rdblue reviewed Sep 13, 2021

View reviewed changes

Reviewer Comments

4f4e35a

Have PruneColumns Avro mimic PruneColumns Iceberg Adjust TypeUtil.selectNot to better mimic the old behavior and added tests

RussellSpitzer commented Sep 16, 2021

View reviewed changes

Copy Avro/Iceberg Prune behavior in Parquet

2b7da89

rdblue reviewed Sep 19, 2021

View reviewed changes

Fix HasChange behavior

38b627d

RussellSpitzer mentioned this pull request Sep 20, 2021

Fix Avro Pruning Bugs with ManifestEntries Table #1744

Closed

rdblue reviewed Oct 3, 2021

View reviewed changes

rdblue approved these changes Oct 3, 2021

View reviewed changes

rdblue merged commit fafe33a into apache:master Oct 3, 2021

RussellSpitzer deleted the GetAllProjectedIds branch October 3, 2021 23:57

kbendick pushed a commit to kbendick/iceberg that referenced this pull request Nov 2, 2021

API: Update GetProjectedIds to optionally include empty structs (apac…

1a83ac3

…he#2953)

kbendick mentioned this pull request Nov 2, 2021

Investigate amount of work needed to backport #3240 to 0.12.1 #3443

Closed

RussellSpitzer mentioned this pull request Nov 24, 2021

Spark dataframe count on entries metadata table throws an IllegalArgumentException for partitioned tables #1378

Closed

ajantha-bhat mentioned this pull request Jul 19, 2023

Parquet: Remove duplicate test code #8098

Merged

API: Change GetProjectedIds to Return all Ids #2953

API: Change GetProjectedIds to Return all Ids #2953

Uh oh!

Conversation

RussellSpitzer commented Aug 7, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kbendick commented Aug 10, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Aug 11, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Sep 16, 2021

Uh oh!

RussellSpitzer Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Sep 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Sep 19, 2021

Uh oh!

RussellSpitzer commented Sep 20, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Oct 3, 2021

Uh oh!

RussellSpitzer commented Oct 3, 2021

Uh oh!

szehon-ho commented Oct 5, 2021

Uh oh!

Reviewers

Assignees

RussellSpitzer Sep 16, 2021 •

edited

Loading

RussellSpitzer Sep 20, 2021 •

edited

Loading