-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Core: Support first-row-id for manifests and manifest lists #12672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| import org.junit.jupiter.api.Test; | ||
| import org.junit.jupiter.api.io.TempDir; | ||
|
|
||
| public class TestManifestEncryption { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This duplicated the manifest tests in TestManifestWriterVersions, so I updated that suite to allow this one to override the plaintext EncryptionManager with the test EM.
| * | ||
| * @deprecated will be removed in 1.10.0; use {@link ManifestWriter#toManifestFile()} instead. | ||
| */ | ||
| @Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to remove a few more references to this older constructor. I'm not sure why there was a public constructor and I think we should avoid having one. That's why it is now deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This touches more files so I'll take care of it in a follow up.
| private final long snapshotId; | ||
| private final long sequenceNumber; | ||
| private final String manifestLocation; | ||
| private Long nextRowId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't thought through all the usages but should we have this be atomic? I think we may need handle multiple threads using this class at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we create a separate InheritableMetadata for each manifest reader and the readers make no guarantees about thread safety.
| return wrapped.keyMetadata(); | ||
| case 15: | ||
| if (wrappedFirstRowId != null) { | ||
| // if first-row-id is assigned, ensure that it is valid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is basically making sure that if we inherited a firstRowID it is not also written in the underlying entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first-row-id isn't an entry field is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, maybe I didn't understand. Why are we checking that wrapped.firstRowID() is null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose is to make sure that this isn't used to replace an already assigned first-row-id. In order to assign one by calling wrap(file, firstRowId), the file must be a data file and the row id can't already be assigned.
core/src/test/java/org/apache/iceberg/TestManifestListVersions.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestManifestListVersions.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestManifestListVersions.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestManifestWriterVersions.java
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| @Test | ||
| public void testV3WriteWithInheritance() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are pretty hard for me to follow. I probably just haven't been in this code in a while but I can't tell what each one is trying to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's a bit weird because the write methods do more than one thing (write and also assert some things). I was trying to avoid refactoring the entire suite for this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll spend some more time reading through them to see if I can follow then.
core/src/test/java/org/apache/iceberg/rest/responses/TestLoadTableResponseParser.java
Outdated
Show resolved
Hide resolved
154494f to
c5b867c
Compare
| Schema entriesTableSchema = | ||
| TypeUtil.selectNot( | ||
| Spark3Util.loadIcebergTable(spark, tableName + ".entries").schema(), | ||
| Set.of(DataFile.FIRST_ROW_ID.fieldId())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertions in this suite were previously reading the manifest directly as Avro and matching against the rows returned by Spark. That always produces null for first_row_id but the manifest reader sets the field using inheritance so the tests were failing.
The solution is to remove first_row_id from both the entries table schema (used to read manifests) and from the Spark datasets (via selectNonDerived).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be worth adding a
private Schema entriesTableSchema() {
return TypeUtil.selectNot(
Spark3Util.loadIcebergTable(spark, tableName + ".entries").schema(),
Set.of(DataFile.FIRST_ROW_ID.fieldId()));
}To reduce the redundancy
| assertThat(file.pos()).as("Position should match").isEqualTo(expectedPos); | ||
| assertThat(((BaseFile) file).get(20)) | ||
| .as("Position from field index should match") | ||
| .isEqualTo(expectedPos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this because the line above tests that file.pos() is correct. This assertion is incorrect because it expects the field order not to change.
| manifest.path()); | ||
| } | ||
|
|
||
| /** Returns {@link InheritableMetadata} for rewriting a manifest before it is committed. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm no longer changing this class, but I think it's reasonable to keep the additional javadoc.
| } | ||
| } | ||
|
|
||
| private static <F extends ContentFile<F>> Function<ManifestEntry<F>, ManifestEntry<F>> idAssigner( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping state in InheritableMetadata doesn't work because it is reused each time the reader produces an iterator. That causes incorrect row ID assignment (caught by the tests I'm working on). Instead, I've introduced this assigner function that is called where InheritableMetadata is used. InheritableMetadata is used for constants, this is used for state-based ID assignment.
Record count is always projected.
5b3432d to
03bada9
Compare
| table.newRewrite().deleteFile(filePart1).deleteFile(filePart2).addFile(fileCompacted).commit(); | ||
|
|
||
| // Rewrites are currently just treated as appends. In the future we could treat these as no-ops | ||
| // rewrites produce new manifests without first-row-id or any information about how many rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // rewrites produce new manifests without first-row-id or any information about how many rows | |
| // Rewrites produce new manifests without first-row-id or any information about how many rows |
|
|
||
| // Rewrites are currently just treated as appends. In the future we could treat these as no-ops | ||
| // rewrites produce new manifests without first-row-id or any information about how many rows | ||
| // are new. without tracking a new metric for a manifest (e.g., assigned-rows) or assuming that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // are new. without tracking a new metric for a manifest (e.g., assigned-rows) or assuming that | |
| // are new. Without tracking a new metric for a manifest (e.g., assigned-rows) or assuming that |
core/src/test/java/org/apache/iceberg/TestRowLineageMetadata.java
Outdated
Show resolved
Hide resolved
| file.put(3, 0); // specId | ||
| } | ||
|
|
||
| // suppress the readable metrics and first-row-id that are not in manifest files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // suppress the readable metrics and first-row-id that are not in manifest files | |
| // Suppress the readable metrics and first-row-id fields that are not in manifest files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we didn't use sentence case in non-javadoc comments because you get weird capitalized fragments and unnecessary PRs/commits to "fix" the case. Are these strong nits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea what our standard on this is really
// [A-Z]
Has 5422 Hits
// [a-z]
Has 10896 Hits
So we definitely are leaning towards no case
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I have a few nits on tests but I'm onboard with the implementation
danielcweeks
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 pending checks. (I think the failures were unrelated, so I restarted them).
|
Thanks for the reviews, @RussellSpitzer and @danielcweeks! I'm going to merge this so that we can get working on the next set of changes, including #12836. |
This adds support for first-row-id in manifests and manifest lists.
Manifests are updated so that data files inherit/assign a
first-row-idwhen the field is null, based on the record counts of previous data files. Currently, thefirst-row-idfor a data file is based on the manifest'sfirst-row-idand the number of records in files without an assignedfirst-row-id. I think that this matches the expected behavior, which is based on therecord_countof allADDEDdata files (The spec states: "When reading, the first_row_id is assigned by replacing null with the manifest's first_row_id plus the sum of record_count for all added data files that preceded the file in the manifest.")Manifest lists are updated so that
first-row-idfor a data manifest is always written, either because the manifest has an already assignedfirst-row-idor by assigning a new one. The number of added records in a manifest assigned a newfirst-row-idis used to update anext-row-idthat is either used for the next manifest or is used asnext-row-idin table metadata. This strategy updates thenext-row-idby the number of added records in all new data manifests. This is not what the spec currently says but I think is what we meant at the time.The v3 spec states:
I think this language would require allocating the number of rows in added data files in the whole table for every commit, not just the added rows in the new manifests. What I've implemented is allocating the number of added rows in the manifests that are assigned a new
first-row-id, which is the same as for each new manifest.