-
Notifications
You must be signed in to change notification settings - Fork 3k
Add data_file.spec_id to metadata tables #3015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
a6cee68
74bc6a5
67ad3ea
919b22b
203a5be
028dada
26628f0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -235,45 +235,48 @@ public void put(int i, Object value) { | |
| this.format = FileFormat.valueOf(value.toString()); | ||
| return; | ||
| case 3: | ||
| this.partitionData = (PartitionData) value; | ||
| this.partitionSpecId = (value != null) ? (Integer) value : -1; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @szehon-ho @RussellSpitzer @rdblue @openinx FYI, shifting order is a breaking change that caused Flink failing to restore from checkpoint. It is a not big deal for us this time as we are still in testing phase. I just like to call out that we need to be more careful in the future.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well I did warn against it :)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @stevenzwu, it's good that we caught this. How did it happen? Was Flink relying on a specific position?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Flink Iceberg sink checkpoints the manifest file. After upgrading to the latest Iceberg master branch, Flink job can't restore the checkpoint due to this Avro schema position change.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @stevenzwu, I understood the failure scenario. I'm wondering where Flink is relying on position. That sounds like a Flink bug and we should make sure we fix it.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rdblue
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi guys sorry I'm out of town and may reply slowly. @stevenzwu i am curious did you find a fix? I'm not sure i understand the complete problem, but this change should not modify the serialized form of the metadata file if its saved in Flink checkpoint as specId is a derived field, (if that is the concern). See VXMetadata.java controlling serialization format, which is not changed.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is definitely a Flink bug. Avro can handle schema evolution. You just need to keep track of the write schema and the read schema. My guess is that it is not correctly tracking the write schema used and so you get incorrect results at read time. Where is the write schema tracked for Flink state? |
||
| return; | ||
| case 4: | ||
| this.recordCount = (Long) value; | ||
| this.partitionData = (PartitionData) value; | ||
| return; | ||
| case 5: | ||
| this.fileSizeInBytes = (Long) value; | ||
| this.recordCount = (Long) value; | ||
| return; | ||
| case 6: | ||
| this.columnSizes = (Map<Integer, Long>) value; | ||
| this.fileSizeInBytes = (Long) value; | ||
| return; | ||
| case 7: | ||
| this.valueCounts = (Map<Integer, Long>) value; | ||
| this.columnSizes = (Map<Integer, Long>) value; | ||
| return; | ||
| case 8: | ||
| this.nullValueCounts = (Map<Integer, Long>) value; | ||
| this.valueCounts = (Map<Integer, Long>) value; | ||
| return; | ||
| case 9: | ||
| this.nanValueCounts = (Map<Integer, Long>) value; | ||
| this.nullValueCounts = (Map<Integer, Long>) value; | ||
| return; | ||
| case 10: | ||
| this.lowerBounds = SerializableByteBufferMap.wrap((Map<Integer, ByteBuffer>) value); | ||
| this.nanValueCounts = (Map<Integer, Long>) value; | ||
| return; | ||
| case 11: | ||
| this.upperBounds = SerializableByteBufferMap.wrap((Map<Integer, ByteBuffer>) value); | ||
| this.lowerBounds = SerializableByteBufferMap.wrap((Map<Integer, ByteBuffer>) value); | ||
| return; | ||
| case 12: | ||
| this.keyMetadata = ByteBuffers.toByteArray((ByteBuffer) value); | ||
| this.upperBounds = SerializableByteBufferMap.wrap((Map<Integer, ByteBuffer>) value); | ||
| return; | ||
| case 13: | ||
| this.splitOffsets = ArrayUtil.toLongArray((List<Long>) value); | ||
| this.keyMetadata = ByteBuffers.toByteArray((ByteBuffer) value); | ||
| return; | ||
| case 14: | ||
| this.equalityIds = ArrayUtil.toIntArray((List<Integer>) value); | ||
| this.splitOffsets = ArrayUtil.toLongArray((List<Long>) value); | ||
| return; | ||
| case 15: | ||
| this.sortOrderId = (Integer) value; | ||
| this.equalityIds = ArrayUtil.toIntArray((List<Integer>) value); | ||
| return; | ||
| case 16: | ||
| this.sortOrderId = (Integer) value; | ||
| return; | ||
| case 17: | ||
| this.fileOrdinal = (long) value; | ||
| return; | ||
| default: | ||
|
|
@@ -301,32 +304,34 @@ public Object get(int i) { | |
| case 2: | ||
| return format != null ? format.toString() : null; | ||
| case 3: | ||
| return partitionData; | ||
| return partitionSpecId; | ||
| case 4: | ||
| return recordCount; | ||
| return partitionData; | ||
| case 5: | ||
| return fileSizeInBytes; | ||
| return recordCount; | ||
| case 6: | ||
| return columnSizes; | ||
| return fileSizeInBytes; | ||
| case 7: | ||
| return valueCounts; | ||
| return columnSizes; | ||
| case 8: | ||
| return nullValueCounts; | ||
| return valueCounts; | ||
| case 9: | ||
| return nanValueCounts; | ||
| return nullValueCounts; | ||
| case 10: | ||
| return lowerBounds; | ||
| return nanValueCounts; | ||
| case 11: | ||
| return upperBounds; | ||
| return lowerBounds; | ||
| case 12: | ||
| return keyMetadata(); | ||
| return upperBounds; | ||
| case 13: | ||
| return splitOffsets(); | ||
| return keyMetadata(); | ||
| case 14: | ||
| return equalityFieldIds(); | ||
| return splitOffsets(); | ||
| case 15: | ||
| return sortOrderId; | ||
| return equalityFieldIds(); | ||
| case 16: | ||
| return sortOrderId; | ||
| case 17: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like this is a bug from https://github.com/apache/iceberg/pull/1723/files. I think this should return
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea you are right, I can file issue and take a look at it after this change is in.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| return pos; | ||
| default: | ||
| throw new UnsupportedOperationException("Unknown field ordinal: " + pos); | ||
|
|
@@ -442,6 +447,7 @@ public String toString() { | |
| .add("content", content.toString().toLowerCase(Locale.ROOT)) | ||
| .add("file_path", filePath) | ||
| .add("file_format", format) | ||
| .add("spec_id", specId()) | ||
| .add("partition", partitionData) | ||
| .add("record_count", recordCount) | ||
| .add("file_size_in_bytes", fileSizeInBytes) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to update the spec since we are assigning a new field ID here. We should at least note that it is reserved, even though we don't write it into data files. We can do that in a follow-up.