-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Fix Avro Pruning Bugs with ManifestEntries Table #1744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @Test | ||
| public void testFilesTable() throws Exception { | ||
| TableIdentifier tableIdentifier = TableIdentifier.of("db", "files_test"); | ||
| Table table = createTable(tableIdentifier, SCHEMA, PartitionSpec.builderFor(SCHEMA).identity("id").build()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor test cleanup here to use a constant partition spec like other tests
| if (hasChange) { | ||
| return copyRecord(record, filteredFields); | ||
| } else if (filteredFields.size() == record.getFields().size()) { | ||
| } else if (record.getFields().size() != 0 && filteredFields.size() == record.getFields().size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the second bug mentioned, If you are pruning
X : Struct { Y : Struct<>, Z: Int, ..... }
And don't request "X" or "Y"
Y Has no fields so previously filteredFields.size() = 0 = record.getFields().size() which means we return
Y as a required field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what keeps "data_file" from being pruned on Unpartitioned tables (since partitionType would be an empty struct)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a different operation that is the opposite of GetProjectedIds, like ProjectFromIds.
TypeUtil.select uses this class, PruneColumns, but it has behavior like a SQL SELECT. If I have a schema a int, b struct<x double, y double>, c string and I select b, then everything underneath b is selected, which is what you'd expect from SELECT b FROM table.
If we were to update GetProjectedIds with the logic above, then projecting b struct<> (which you can't do by naming columns) would actually result in the full struct getting projected because of the logic here that selects all of b. This class cannot be used to reconstruct a schema using the result of GetProjectedIds.
I think that we also need a BuildProjection that does the opposite of GetProjectedIds with the update to add empty structs. Then the datum reader could use that logic to prune the Avro schema and get an exact match with the expected schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah I forgot about that :/ yeah we'll need an inverse sort of thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to figure out if there was a simpler way to do this and I ended up modifying BuildAvroProjection. In my mind the real problem here was we were telling BuildAvroProjection what columns we wanted, but our pruned schema really only showed us what columns we needed that have data. This only left leaf empty-struct nodes. So during BuildAvroProjection we check for these empty-structs being requested in the expected schema and just add them back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a reasonable argument that BuildAvroProjection should do what your describe, since it is creating the final schema that will be requested from Avro.
If that's adding back the empty structs, then why change the behavior of PruneColumns here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is the change to fix bug 2,
Example of Bug 2
From (A, B, C, D {foo, bar, baz {}} ) // Starting with D containing a field baz which is empty
Project (A)
Returns
(A, D{ baz {} }) // Includes D even though it was not requested
The issue being that a struct which contains no elements naturally will always be included because
records.filedsize == 0 == filteredFields.size == 0
Because the field is included, all parent structs which include it are also included in the projection. For example
From (A , B{ C { D { E { F{}, G }}}})
Project (A)
Returns
(A, B{ C { D { E { F{}}}}})
I think that behavior is incorrect, and this should be pruned out if it isn't needed in the projection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant to ask this on the block above.
73dc928 to
8e67943
Compare
| @Override | ||
| public Set<Integer> field(Types.NestedField field, Set<Integer> fieldResult) { | ||
| if (fieldResult == null) { | ||
| if (fieldResult == null || isEmptyStruct(field)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error here previously, is that an empty struct has no fields, since it has no fields we never would ad an ID representing it (normally this happens by referring to subfields)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this change leaving the API behavior exactly the same, now instead we add on empty projections when doing the projection itself.
8e67943 to
04f6137
Compare
|
For reference later, the problem here is that Avro's |
|
Also for reference, the other issue we are dealing with is that "PruneColumns" in the current implementation will always select any record whose sub records are empty structs regardless of pruning Id's. |
4e76fc8 to
ae9c96a
Compare
|
Automated tests failed on the last run that work locally on my MacBook :/ but oddly enough not the direct schema projection tests, only the metadata table test ... maybe it won't break this time? |
|
It was the merge, something in master has broken my tests :/ |
bf7ba35 to
0743232
Compare
| hasChange = true; | ||
| filteredFields.add(copyField(field, fieldSchema, fieldId)); | ||
| } else if (emptyStructIds.contains(fieldId)) { | ||
| // This field does not require any known sub-fields but is required in the projection so keep it without |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the pruning Side fix to Bug 1, Basically making sure we add the parents of any fields in the projection even if none of their real (in the file) fields are projected.
For example without this fix
From (A, B { C } )
Project (A, B { D.Optional })
Returns
(A)
| import org.apache.iceberg.Schema; | ||
| import org.apache.iceberg.relocated.com.google.common.collect.Sets; | ||
|
|
||
| class GetEmptyStructIds extends TypeUtil.SchemaVisitor<Set<Integer>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Visitor is used to find the FieldIDs of any fields which must remain in the projection, but if they have no sub fields, should be just included as empty Structs.
There are 3 Main cases of this
-
The Struct is Empty
-
The Sturct contains required MetadataColumns, these are not real so we cannot properly include their parents in the normal path. Currently the only one of these allowed is MetadataColumns.ROW_POSITION, see
iceberg/core/src/main/java/org/apache/iceberg/avro/BuildAvroProjection.java
Lines 98 to 102 in 0743232
Preconditions.checkArgument( field.isOptional() || field.fieldId() == MetadataColumns.ROW_POSITION.fieldId(), "Missing required field: %s", field.name()); // Create a field that will be defaulted to null. We assign a unique suffix to the field // to make sure that even if records in the file have the field it is not projected. -
The Struct contains Optional Columns which aren't part of the schema, these are a allowed by the same code block mentioned above, but if their parents are pruned out we will have an error when we attempt to add them back to the projection.
| return false; | ||
| } | ||
| List<Types.NestedField> fields = ((Types.StructType) field.type()).fields(); | ||
| return fields.stream().allMatch(f -> f.fieldId() > Integer.MAX_VALUE - 201 || f.isOptional()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic number here is our block of Reserved and Metadata FieldIds
|
|
||
| Record projected = writeAndRead("empty_proj", writeSchema, emptyStruct, record); | ||
| Assert.assertNull("Should not project data", projected.get("data")); | ||
| Record result = (Record) projected.get("location"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should access the fields that are present by position as well. The expected position for this is 0, so asserting that the same record is returned would be a good check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this be the same as checking if "id" is not projected? Maybe that's a better check? I can add that in too
core/src/test/java/org/apache/iceberg/avro/TestReadProjection.java
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| @Test | ||
| public void testEmptyNestedStructRequiredProjection() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case looks right to me.
| // This field does not require any known sub-fields but is required in the projection so keep it without | ||
| // any of it's subfields | ||
| hasChange = true; | ||
| Schema empty = AvroSchemaUtil.removeFields(field); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use removeFields when this class already has copyRecord? Couldn't this just call copyRecord(field.schema(), ImmutableList.of())?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CopyRecord does not work on UnionTypes :(
|
|
||
| @Override | ||
| public Set<Integer> field(Types.NestedField field, Set<Integer> fieldResult) { | ||
| if (isEmptyStruct(field) || isPotentiallyEmpty(field)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The potentially empty handling here seems strange to me. I think it would make more sense for the field to be added or fixed up in BuildAvroProjection instead of catching the case so that an empty struct is retained by PruneColumns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's another alternative. We could pass through the original avro schema to build projection and just fall back to copy field and remove subfields if a record is missing but required.
In my mind "pruning" means removing everything that is unnecessary, but these fields are necessary so fixing pruning seemed more natural to me than building it back in in buildAvroProjection. Although we already do add more fields in the projection code for fields that don't exist at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so I did a brief look at the buildAvroProjection code to see any possible issues, the main difficulty is that we would need to simultaneously traverse the real avro schema along with the one passed as "pruned". If we don't do this we have no way of determining if a required field is actually in the avro schema or not so we can't do proper correctness checks for whether an expected required field should actually be there.
Other than that I think the implementation should be fairly straight forward, if we find a record that should be there but isn't, we build a new record based on the actual field then call "visit" on that field. We will end up breaking the
Let me see how difficult this would be.
| return "_x" + Integer.toHexString(character).toUpperCase(); | ||
| } | ||
|
|
||
| public static Schema removeFields(Schema.Field field) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation here seems very specific to a case and not useful as a generic method. Maybe we should move it into the PruneColumns class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep sure, it's not very common
|
@szehon-ho Let's work together to figure out a solution to this :) |
|
Yes , as we talked offline, I ended up debugging the same thing as @RussellSpitzer while hitting #1378 . My thought was that that changing the PruneColumns or BuildAvroProjection behaviour would be a bit involved, so I took a different approach, which I put up as reference: #2395 It's a poor workaround of the problem by adding non-empty struct when reading the Metadata (manifest) Entries and All-Entries table, though it seems to work for most cases. (Bug hits nested non-empty required structs that are pruned away, can look for those if there are more) I just put it as an option in case there is no good other fix. Of course if there is a proper fix, I'd be interested to follow/ help wherever I can, though I heard from Russell that it is not very easy :) In any case would love to see a solution to the bug. We were trying to write an analytics job that calculated how many data lands in a table per time period, and thought to aggregate entries table and join with snapshot table (which has timestamp), but it seems this bug prevents any aggregates on entries table, without ugly workarounds. |
Previously we would crash whenever attempting to project only non data_file columns from a partitioned Iceberg table. This would occur because our projection in Manifest Readerwould always require the "data_file" field even if it was an empty struct. This was an issue because GetProjectedIDs only can retreive ids from struct fields. If a struct is empty it ignores the field leading to a projection without the struct. This worked on unpartitioned tables because of a second bug in the column pruning which would always included empty structs regardless of if they were requested or not. This in concert with the above bug would allow un partitioned tables to work. Their schema data_file schema could never be empty because it contained a field (partition) which was an empty struct so it would always be included. Here we fix both bugs by properly ignoring not requested fields, and correctly projecting empty structs when they are explicitly requested.
Inorder to fix all variations of empty struct projections we introduce a second traversal of the avro schema which just identifies fields which need to be read as empty structs. We use these integers in parallel with those fields which are being actively included in the pruned schema, but instead of including them as is, we remove all of their sub- fields before adding them. Since we are still following in the normal PruneColumns logic all required parents for the pruned column are added automatically. In addition new tests are added to make sure that empty and nested empty projections work as expected.
Previously if a user attempted to project a column which did not exist in the file schema we would throw an error if it was required, or allow it if it was optional. But if this optional column was the only column in the parent struct, the parent struct would be ignored so we could not add the child field. To fix this we add structs which contain only Metadata columns or Optional columsn to our list of structs to preserve because they may be needed for child fields which don't actually exist.
Was previously replacing fields which had known subfields with empty versions. We should only do that if there are no known subfields.
…ma instead of preserved during Pruning
30684ec to
3168630
Compare
|
@rdblue + @szehon-ho - @karuppayya reminded me to fix this with our hope that it would fix #2783, While I did fix this PR it doesn't seem to have fixed the other issue. That said, I at least fixed these two issues :) I would appreciate a review so we can at least fix this portion. |
|
If I remember correctly, the main problem here was a mismatch between the expected schema and the actual schema when reading from Avro files. That was caused by some unexpected behaviors across a few helpers:
Hopefully that captures the issues here. First, I think that To fix this, I propose the following:
I think that should fix this. To me, that seems like a couple of PRs: one to add Does that sound reasonable? |
|
Yep, Seems close to the original plan we had here, i'll keep the new tests and work on rebuilding the original getId's/Project methods. |
|
Fixing with #2953 |
Example of Bug 1
This bug also applies if D is projected containing optional columns which are not in the write schema. For example
D { Optional : otherColumn } or metadata columns D { MetadataColumn.FILE_POSITION }
Example of Bug 2
In combination these sometimes negate each other ~ kinda
Example of interaction of bugs