-
Notifications
You must be signed in to change notification settings - Fork 3k
Fix Avro Pruning Bugs with ManifestEntries Table #1744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
8c6392b
f62148e
e0951cd
e4df112
02c4a69
6a74c4a
3b4689a
8352782
1b8f300
50a3edd
9e345c5
f2dd098
3168630
49da4d2
b0ba4f7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -32,6 +32,7 @@ | |
| import org.apache.iceberg.types.Comparators; | ||
| import org.apache.iceberg.types.Types; | ||
| import org.junit.Assert; | ||
| import org.junit.Assume; | ||
| import org.junit.Rule; | ||
| import org.junit.Test; | ||
| import org.junit.rules.TemporaryFolder; | ||
|
|
@@ -526,4 +527,228 @@ public void testListOfStructsProjection() throws IOException { | |
| AssertHelpers.assertEmptyAvroField(projectedP2, "y"); | ||
| Assert.assertNull("Should project null z", projectedP2.get("z")); | ||
| } | ||
|
|
||
| @Test | ||
| public void testEmptyStructProjection() throws Exception { | ||
| Schema writeSchema = new Schema( | ||
| Types.NestedField.required(0, "id", Types.LongType.get()), | ||
| Types.NestedField.optional(3, "location", Types.StructType.of( | ||
| Types.NestedField.required(1, "lat", Types.FloatType.get()), | ||
| Types.NestedField.required(2, "long", Types.FloatType.get()) | ||
| )) | ||
| ); | ||
|
|
||
| Record record = new Record(AvroSchemaUtil.convert(writeSchema, "table")); | ||
| record.put("id", 34L); | ||
| Record location = new Record( | ||
| AvroSchemaUtil.fromOption(record.getSchema().getField("location").schema())); | ||
| location.put("lat", 52.995143f); | ||
| location.put("long", -1.539054f); | ||
| record.put("location", location); | ||
|
|
||
| Schema emptyStruct = new Schema( | ||
| Types.NestedField.required(3, "location", Types.StructType.of()) | ||
| ); | ||
|
|
||
| Record projected = writeAndRead("empty_proj", writeSchema, emptyStruct, record); | ||
| AssertHelpers.assertEmptyAvroField(projected, "id"); | ||
| Record result = (Record) projected.get("location"); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should access the fields that are present by position as well. The expected position for this is 0, so asserting that the same record is returned would be a good check.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wouldn't this be the same as checking if "id" is not projected? Maybe that's a better check? I can add that in too |
||
| Assert.assertEquals("location should be in the 0th position", result, projected.get(0)); | ||
| Assert.assertNotNull("Should contain an empty record", result); | ||
| AssertHelpers.assertEmptyAvroField(result, "lat"); | ||
| AssertHelpers.assertEmptyAvroField(result, "long"); | ||
| } | ||
|
|
||
| @Test | ||
| public void testEmptyStructRequiredProjection() throws Exception { | ||
| Schema writeSchema = new Schema( | ||
| Types.NestedField.required(0, "id", Types.LongType.get()), | ||
| Types.NestedField.required(3, "location", Types.StructType.of( | ||
| Types.NestedField.required(1, "lat", Types.FloatType.get()), | ||
| Types.NestedField.required(2, "long", Types.FloatType.get()) | ||
| )) | ||
| ); | ||
|
|
||
| Record record = new Record(AvroSchemaUtil.convert(writeSchema, "table")); | ||
| record.put("id", 34L); | ||
| Record location = new Record(record.getSchema().getField("location").schema()); | ||
| location.put("lat", 52.995143f); | ||
| location.put("long", -1.539054f); | ||
| record.put("location", location); | ||
|
|
||
| Schema emptyStruct = new Schema( | ||
| Types.NestedField.required(3, "location", Types.StructType.of()) | ||
| ); | ||
|
|
||
| Record projected = writeAndRead("empty_req_proj", writeSchema, emptyStruct, record); | ||
| AssertHelpers.assertEmptyAvroField(projected, "id"); | ||
| Record result = (Record) projected.get("location"); | ||
| Assert.assertEquals("location should be in the 0th position", result, projected.get(0)); | ||
| Assert.assertNotNull("Should contain an empty record", result); | ||
| AssertHelpers.assertEmptyAvroField(result, "lat"); | ||
| AssertHelpers.assertEmptyAvroField(result, "long"); | ||
| } | ||
|
|
||
| @Test | ||
| public void testRequiredEmptyStructInRequiredStruct() throws Exception { | ||
| Schema writeSchema = new Schema( | ||
| Types.NestedField.required(0, "id", Types.LongType.get()), | ||
| Types.NestedField.required(3, "location", Types.StructType.of( | ||
| Types.NestedField.required(1, "lat", Types.FloatType.get()), | ||
| Types.NestedField.required(2, "long", Types.FloatType.get()), | ||
| Types.NestedField.required(4, "empty", Types.StructType.of()) | ||
| )) | ||
| ); | ||
|
|
||
| Record record = new Record(AvroSchemaUtil.convert(writeSchema, "table")); | ||
| record.put("id", 34L); | ||
| Record location = new Record(record.getSchema().getField("location").schema()); | ||
| location.put("lat", 52.995143f); | ||
| location.put("long", -1.539054f); | ||
| record.put("location", location); | ||
|
|
||
| Schema emptyStruct = new Schema( | ||
| Types.NestedField.required(0, "id", Types.LongType.get()), | ||
| Types.NestedField.required(3, "location", Types.StructType.of( | ||
| Types.NestedField.required(4, "empty", Types.StructType.of()) | ||
| )) | ||
| ); | ||
|
|
||
| Record projected = writeAndRead("req_empty_req_proj", writeSchema, emptyStruct, record); | ||
| Assert.assertEquals("Should project id", 34L, projected.get("id")); | ||
| Record result = (Record) projected.get("location"); | ||
| Assert.assertEquals("location should be in the 1st position", result, projected.get(1)); | ||
| Assert.assertNotNull("Should contain an empty record", result); | ||
| AssertHelpers.assertEmptyAvroField(result, "lat"); | ||
| AssertHelpers.assertEmptyAvroField(result, "long"); | ||
| Assert.assertNotNull("Should project empty", result.getSchema().getField("empty")); | ||
| Assert.assertNotNull("Empty should not be null", result.get("empty")); | ||
| Assert.assertEquals("Empty should be empty", 0, | ||
| ((Record) result.get("empty")).getSchema().getFields().size()); | ||
| } | ||
|
|
||
| @Test | ||
| public void testEmptyNestedStructProjection() throws Exception { | ||
| Schema writeSchema = new Schema( | ||
| Types.NestedField.required(0, "id", Types.LongType.get()), | ||
| Types.NestedField.optional(3, "outer", Types.StructType.of( | ||
| Types.NestedField.required(1, "lat", Types.FloatType.get()), | ||
| Types.NestedField.optional(2, "inner", Types.StructType.of( | ||
| Types.NestedField.required(5, "lon", Types.FloatType.get()) | ||
| ) | ||
| ) | ||
| )) | ||
| ); | ||
|
|
||
| Record record = new Record(AvroSchemaUtil.convert(writeSchema, "table")); | ||
| record.put("id", 34L); | ||
| Record outer = new Record( | ||
| AvroSchemaUtil.fromOption(record.getSchema().getField("outer").schema())); | ||
| Record inner = new Record(AvroSchemaUtil.fromOption(outer.getSchema().getField("inner").schema())); | ||
| inner.put("lon", 32.14f); | ||
| outer.put("lat", 52.995143f); | ||
| outer.put("inner", inner); | ||
| record.put("outer", outer); | ||
|
|
||
| Schema emptyStruct = new Schema( | ||
| Types.NestedField.required(3, "outer", Types.StructType.of( | ||
| Types.NestedField.required(2, "inner", Types.StructType.of()) | ||
| ))); | ||
|
|
||
| Record projected = writeAndRead("nested_empty_proj", writeSchema, emptyStruct, record); | ||
| AssertHelpers.assertEmptyAvroField(projected, "id"); | ||
| Record outerResult = (Record) projected.get("outer"); | ||
| Assert.assertEquals("Outer should be in the 0th position", outerResult, projected.get(0)); | ||
| Assert.assertNotNull("Should contain the outer record", outerResult); | ||
| AssertHelpers.assertEmptyAvroField(outerResult, "lat"); | ||
| Record innerResult = (Record) outerResult.get("inner"); | ||
| Assert.assertEquals("Inner should be in the 0th position", innerResult, outerResult.get(0)); | ||
| Assert.assertNotNull("Should contain the inner record", innerResult); | ||
| AssertHelpers.assertEmptyAvroField(innerResult, "lon"); | ||
| } | ||
|
|
||
| @Test | ||
| public void testEmptyNestedStructRequiredProjection() throws Exception { | ||
| Schema writeSchema = new Schema( | ||
| Types.NestedField.required(0, "id", Types.LongType.get()), | ||
| Types.NestedField.required(3, "outer", Types.StructType.of( | ||
| Types.NestedField.required(1, "lat", Types.FloatType.get()), | ||
| Types.NestedField.required(2, "inner", Types.StructType.of( | ||
| Types.NestedField.required(5, "lon", Types.FloatType.get()) | ||
| ) | ||
| ) | ||
| )) | ||
| ); | ||
|
|
||
| Record record = new Record(AvroSchemaUtil.convert(writeSchema, "table")); | ||
| record.put("id", 34L); | ||
| Record outer = new Record(record.getSchema().getField("outer").schema()); | ||
| Record inner = new Record(outer.getSchema().getField("inner").schema()); | ||
| inner.put("lon", 32.14f); | ||
| outer.put("lat", 52.995143f); | ||
| outer.put("inner", inner); | ||
| record.put("outer", outer); | ||
|
|
||
| Schema emptyStruct = new Schema( | ||
| Types.NestedField.required(3, "outer", Types.StructType.of( | ||
| Types.NestedField.required(2, "inner", Types.StructType.of()) | ||
| ))); | ||
|
|
||
| Record projected = writeAndRead("nested_empty_req_proj", writeSchema, emptyStruct, record); | ||
| AssertHelpers.assertEmptyAvroField(projected, "id"); | ||
| Record outerResult = (Record) projected.get("outer"); | ||
| Assert.assertEquals("Outer should be in the 0th position", outerResult, projected.get(0)); | ||
| Assert.assertNotNull("Should contain the outer record", outerResult); | ||
| AssertHelpers.assertEmptyAvroField(outerResult, "lat"); | ||
| Record innerResult = (Record) outerResult.get("inner"); | ||
| Assert.assertEquals("Inner should be in the 0th position", innerResult, outerResult.get(0)); | ||
| Assert.assertNotNull("Should contain the inner record", innerResult); | ||
| AssertHelpers.assertEmptyAvroField(innerResult, "lon"); | ||
| } | ||
|
|
||
| @Test | ||
| public void testNonExistentProjection() throws Exception { | ||
| Assume.assumeFalse("Bug in pruning code will make the names not match when name mapping applied", | ||
| this.getClass().getName().equals(TestAvroNameMapping.class.getName())); | ||
| // TODO Purning code keeps records whose subfields have changed even if those fields are not required, | ||
| // this means BuildAvroProjection builds a r_Named "foo" because "location" is kept in the pruned schema | ||
| // even though it should not be. Otherwise location would be missing and the foo field would be returned | ||
| // "foo" since it is a subfield of required field being built rather than the field being built. | ||
| Schema writeSchema = new Schema( | ||
| Types.NestedField.required(0, "id", Types.LongType.get()), | ||
| Types.NestedField.optional(3, "location", Types.StructType.of( | ||
| Types.NestedField.required(1, "lat", Types.FloatType.get()), | ||
| Types.NestedField.required(2, "long", Types.FloatType.get()) | ||
| )) | ||
| ); | ||
|
|
||
| Record record = new Record(AvroSchemaUtil.convert(writeSchema, "table")); | ||
| record.put("id", 34L); | ||
| Record location = new Record( | ||
| AvroSchemaUtil.fromOption(record.getSchema().getField("location").schema())); | ||
| location.put("lat", 52.995143f); | ||
| location.put("long", -1.539054f); | ||
| record.put("location", location); | ||
|
|
||
| Schema emptyStruct = new Schema( | ||
| Types.NestedField.required(3, "location", Types.StructType.of( | ||
| Types.NestedField.optional(10000, "foo", Types.StructType.of( | ||
| Types.NestedField.optional(10001, "bar", Types.IntegerType.get()) | ||
| )) | ||
| )) | ||
| ); | ||
|
|
||
| Record projected = writeAndRead("non_existant_proj", writeSchema, emptyStruct, record); | ||
| AssertHelpers.assertEmptyAvroField(projected, "id"); | ||
| Record result = (Record) projected.get("location"); | ||
| Assert.assertEquals("location should be in the 0th position", result, projected.get(0)); | ||
| Assert.assertNotNull("Should contain an fake optional record", result); | ||
| AssertHelpers.assertEmptyAvroField(result, "lat"); | ||
| AssertHelpers.assertEmptyAvroField(result, "long"); | ||
| Assert.assertNotNull("Schema should contain foo", result.getSchema().getField("foo")); | ||
| Assert.assertNull("foo should be null since it is not present in the data", result.get("foo")); | ||
| Assert.assertNotNull("Schema should contain foo.bar", | ||
| AvroSchemaUtil.fromOption(result.getSchema().getField("foo").schema()) | ||
| .getField("bar")); | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the second bug mentioned, If you are pruning
X : Struct { Y : Struct<>, Z: Int, ..... }And don't request "X" or "Y"
Y Has no fields so previously
filteredFields.size() = 0 = record.getFields().size()which means we returnY as a required field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what keeps "data_file" from being pruned on Unpartitioned tables (since partitionType would be an empty struct)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a different operation that is the opposite of
GetProjectedIds, likeProjectFromIds.TypeUtil.selectuses this class,PruneColumns, but it has behavior like a SQLSELECT. If I have a schemaa int, b struct<x double, y double>, c stringand I selectb, then everything underneathbis selected, which is what you'd expect fromSELECT b FROM table.If we were to update
GetProjectedIdswith the logic above, then projectingb struct<>(which you can't do by naming columns) would actually result in the full struct getting projected because of the logic here that selects all ofb. This class cannot be used to reconstruct a schema using the result ofGetProjectedIds.I think that we also need a
BuildProjectionthat does the opposite ofGetProjectedIdswith the update to add empty structs. Then the datum reader could use that logic to prune the Avro schema and get an exact match with the expected schema.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah I forgot about that :/ yeah we'll need an inverse sort of thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to figure out if there was a simpler way to do this and I ended up modifying BuildAvroProjection. In my mind the real problem here was we were telling BuildAvroProjection what columns we wanted, but our pruned schema really only showed us what columns we needed that have data. This only left leaf empty-struct nodes. So during BuildAvroProjection we check for these empty-structs being requested in the expected schema and just add them back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a reasonable argument that
BuildAvroProjectionshould do what your describe, since it is creating the final schema that will be requested from Avro.If that's adding back the empty structs, then why change the behavior of
PruneColumnshere?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is the change to fix bug 2,
Example of Bug 2
The issue being that a struct which contains no elements naturally will always be included because
records.filedsize == 0 == filteredFields.size == 0Because the field is included, all parent structs which include it are also included in the projection. For example
I think that behavior is incorrect, and this should be pruned out if it isn't needed in the projection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant to ask this on the block above.