-
Notifications
You must be signed in to change notification settings - Fork 3k
Fixes read metadata failed after dropped partition for V1 format #3411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @RussellSpitzer @rdblue @openinx @jackye1995, could you help to review this? Thanks a lot. |
|
OK I think I understand the issue now, this is specifically an error where the attempt to build a common Partition Spec builds a spec using void transforms in V1 tables, the void column has a different return type than the original partition transform and throws an error. Is that correct? I would recommend renaming this issue/pr to match the underlying issue. I also am guessing because this is the issue, there is no problem with V2 tables. @aokolnychyi Should probably also take a look |
|
I may be able to take a look within the next few days |
|
|
||
| Table manifestsTable = new ManifestsTable(table.ops(), table); | ||
| TableScan scan = manifestsTable.newScan() | ||
| .filter(Expressions.lessThan("length", 10000L)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we filtering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be removed. The test is adjusted from testManifestsTableAlwaysIgnoresResiduals.
ec22082 to
1905c5e
Compare
|
Thanks @RussellSpitzer for the review.
Yes, that's right.
Any suggestion for this? What about |
1905c5e to
c8aab84
Compare
|
@RussellSpitzer, I am sorry for the late update. Thanks for the review. Comments have been addressed. Pls take another look when you are free. |
|
Sorry for the delay on my side. This PR is the next one I'll look into. I took a quick look and I think I remember the context. |
|
Okay, I think I got the issue. Essentially, we don't know the correct type produced by Void transforms that we assign in v1 tables when a partition field is dropped. We assume the output type matches the source field type but it is not always the case. Let me take a look at the solution on Monday. |
| Map<Integer, PartitionField> fieldMap = Maps.newHashMap(); | ||
| List<NestedField> structFields = Lists.newArrayList(); | ||
| Map<Integer, Type> typeMap = Maps.newHashMap(); | ||
| Map<Integer, String> nameMap = Maps.newHashMap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have run into similar issues, and I think this will help resolve the type change of columns in older specs.
Another thing I have seen and is probably still a problem, is that this method may return a column name multiple times. Consider the following:
Table schema: a int, b date, c date
spec0: year(b), a
spec1: a
spec2: year(b), year(c)
then the result is something like this: 1000: b_year int, 1001: a int, 1002: b_year int, 1003: c_year int
Further down the line when we construct a Schema object, we will have a failure due to b_year name being present in two fields (1000, 1002).
How should this case be handled? Maybe appending _r [fieldId] to each column name?
cc: @RussellSpitzer , @aokolnychyi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aokolnychyi I think you're referring to what's happening in V1 tables. For those the spec is ever-growing in a way that no partition fields/transforms are removed, but rather converted to void.
The rename logic is there: https://github.com/apache/iceberg/blob/master/core%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Ficeberg%2FBaseUpdatePartitionSpec.java#L179
but I think this is not used for V2 tables, as per https://github.com/apache/iceberg/blob/master/core%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Ficeberg%2FBaseUpdatePartitionSpec.java#L261-L268
Since in V2, specs don't retain old deleted partition fields this rename is not required for normal operations.
The problem I'm describing only affects the metadata table queries, because for V2, due to the lack of above renames, Partitioning.partitionType() collects all partition fields from all previous specs too. With the lack of renames this can result in the same field name being present multiple times, and cause the PartitionsTable's (or DataFilesTable's) schema to be failed to get constructed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you are right, @szlta. Here is an example to reproduce.
PartitionSpec initialSpec = PartitionSpec.builderFor(SCHEMA)
.identity("data")
.build();
TestTables.TestTable table = TestTables.create(tableDir, "test", SCHEMA, initialSpec, V2_FORMAT_VERSION);
table.updateSpec()
.removeField("data")
.commit();
table.updateSpec()
.addField("data")
.addField("id")
.commit();
struct<1000: data: optional string, 1001: data: optional string, 1002: id: optional int>
While it would be great to update the logic that evolves the spec, I think we have to adapt the method that builds a common representation too. Otherwise, existing tables may be broken. Maintaining a set of used names and appending a suffix of the field ID sounds like a reasonable approach.
Thoughts, @rdblue @RussellSpitzer @szehon-ho @flyrain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we are using field IDs while projecting values from the common representation in Spark.
At least, that part is working.
protected Map<Integer, StructProjection> buildPartitionProjections(Types.StructType partitionType,
Map<Integer, PartitionSpec> specs) {
Map<Integer, StructProjection> partitionProjections = Maps.newHashMap();
specs.forEach((specID, spec) ->
partitionProjections.put(specID, StructProjection.create(partitionType, spec.partitionType()))
);
return partitionProjections;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have two cases to solve this problem. The combined partition struct type is used in both metadata tables and when we need to project the _partition field for certain queries. For queries that use the _partition field, I think adding a suffix to make the names unique is the right approach. That's internal so it doesn't really matter what we rename to, as long as we get the projections that produce the partition tuple for a given spec right.
For metadata tables, we expect the field names to match the partition names. A simple example is that we use identity partitions using the original column name. So it would be weird to partition by category and need to query category_r1003 in the metadata table. I think that the right thing to do for metadata tables is to have a separate way to produce the combined struct that produces only one partition column per name.
In @aokolnychyi's example, 1001: data should be present and 1000: data should not be used for metadata tables.
This is also an area where we may want to bring back compatible partition columns. There's no reason why Anton's example couldn't detect that 1000: data was in an old spec and reuse the ID to avoid this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about partition transforms where the transform itself could change, but the field name remains the same?
E.g.: bucket(data, 10) would be data_bucket, and after dropping this from spec and re-adding with a new spec bucket(data, 8), id, then we would have 1000: data_bucket, 1001: data_bucket, 1002: id where the two data_bucket fields describe different things. I guess this is similar for truncate.
A partition in the old spec where data_bucket = 0 would be different from a partition in the new spec where data_bucket = 0
I agree with @rdblue that it's weird to have category_r1003 in queries, but if we want to avoid it, I see two ways of proceeding from metadata query perspective:
A metadata table should..
- ..either give back partition information as per the latest spec only..
- ..or we could combine the data returned for similarly named partition fields cascaded into just one field, but extend such tables with
spec_idinformation in these cases.
So as per above example use case the partitions table would look like this for the two cases (with 1 partition in each spec):
1:
+---------------------------------+--------------------+------------------+
| test.partition | test.record_count | test.file_count |
+---------------------------------+--------------------+------------------+
| {data_bucket : 0, id : 1} | 1 | 1 |
+---------------------------------+--------------------+------------------+
2:
+---------------------------------+--------------------+------------------+--------------+
| test.partition | test.record_count | test.file_count | test.spec_id |
+---------------------------------+--------------------+------------------+--------------+
| {data_bucket : 0, id : null} | 1 | 1 | 0 |
| {data_bucket : 0, id : 1} | 1 | 1 | 1 |
+---------------------------------+--------------------+------------------+--------------+
Note this is how it would be for V2 tables. For V1, due to the renaming we're already doing, the renamed fields would be present too (unless we aim at changing that too):
1:
+---------------------------------+--------------------+------------------+
| test.partition | test.record_count | test.file_count |
+---------------------------------+--------------------+------------------+
| {data_bucket : 0, id : 1} | 1 | 1 |
+---------------------------------+--------------------+------------------+
2:
+----------------------------------------------------------+--------------------+------------------+--------------+
| test.partition | test.record_count | test.file_count | test.spec_id |
+----------------------------------------------------------+--------------------+------------------+--------------+
| {data_bucket_1000 : 0, data_bucket : null, id : null} | 1 | 1 | 0 |
| {data_bucket_1000 : null, data_bucket: 0, id : 1} | 1 | 1 | 1 |
+----------------------------------------------------------+--------------------+------------------+--------------+
I'm personally more for solution 2, where we don't have to omit the old partitions but at the same time we get a nice and coherent partition info result. It's kind of in league what I'm proposing in issue #4292 (we could also continue the discussion of this problem there, I didn't mean to hijack @ConeyLiu 's PR like this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also up for solution 2, where we rename just the columns from older specs. That sounds like a reasonable solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll catch up today. Sorry for the delay!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the fixes in this PR look good other than a couple minor updates that are needed. We should definitely follow up with a PR that fixes the names as suggested by @szlta.
|
@ConeyLiu, can you update the description? It is really helpful to not only say what you're trying to fix, but also to describe the approach and why the changes are needed. Thanks! |
Updated. |
| .sorted(Comparator.comparingInt(NestedField::fieldId)) | ||
| List<NestedField> sortedStructFields = fieldMap.keySet().stream() | ||
| .sorted(Comparator.naturalOrder()) | ||
| .map(fieldId -> NestedField.optional(fieldId, nameMap.get(fieldId), typeMap.get(fieldId))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the changes in this file are correct and fix the first issue, where the type may be incorrect for void transform fields in v1 tables.
0e52d2f to
159a0d5
Compare
|
Rebased, thanks for the reminder. @aokolnychyi |
|
Thanks @ConeyLiu ! |
|
Also thanks to all of the reviewers who took their time to check on this PR, I know this was a tricky issue and I'm glad we had so many eyes on the issue. |
|
Thanks all! |
|
Should this fix be included in 0.13.2? @rdblue @RussellSpitzer |
|
@szehon-ho, that sounds fine to me. |
|
👍 looks like it has popped up in a few places. Thanks, @ConeyLiu let me know if you want to make a pr against 0.13.x branch and I can review, otherwise i will take a look. (not sure exactly if thats the process) |
|
Thanks @szehon-ho, will do it. |
…r V1 format (apache#3411) V1 Tables replace dropped partition transforms with a Void Transform which always returns null. The type of this Void transform always matches the column's original type. This has issues when the column's type differed from the transform used to previously partition the column. Here the issue is patched by retaining the correct type for the transform values when those values are being read after the transform has been dropped.
…r V1 format (apache#3411) V1 Tables replace dropped partition transforms with a Void Transform which always returns null. The type of this Void transform always matches the column's original type. This has issues when the column's type differed from the transform used to previously partition the column. Here the issue is patched by retaining the correct type for the transform values when those values are being read after the transform has been dropped. (cherry picked from commit 6441d6e)
This patch fixes two problems:
VoidTransformto replace the removed partition field. While the result type ofVoidTransformsame as the type of the field. For example, the result type ofBucket(2, string_type_field)is int, while is string forVoidTransform(string_type_field). So we should use the original partition field type(int instead of string) to build the common partitioning type.PartitionSpecto convert thePartitionFiedlSummaryto human string, not the current table spec.Closes #3374