-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Use schema at the time of the snapshot when reading a snapshot. #1508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is for #1501. |
|
Thanks for working on this, @wypoon! This was just recently pointed out by another contributor as well. We're currently tracking this in #1029, which has a bit more detail about how this should be built. While I like that your solution found a way to recover the old snapshot, we don't want to need to read old metadata files to recover it. Iceberg should track the schema that was used for each snapshot. That means a few steps are needed:
Then we would need to add a |
|
@rdblue, thanks for the feedback and the pointer to #1029. |
|
Yes, we would add |
|
Should we read the old metadata files as a fallback for data written before this addition? |
We could, but it doesn't seem worth the effort to maintain it to me. If other people disagree, I'm fine adding it. |
|
The question is what to do when we access a historical snapshot for older snapshots. It seems we have a few choices:
Using the current schema seems like the worst choice, as that results in inconsistent behavior for reasons that are invisible to the user. Failing the query would break behavior that works today. While I agree that having more code to handle older data is annoying, the complexity here seems low (~20 lines of code) relative to the benefit. We have much higher complexity in other areas, such as supporting non-projected identity columns for converted Hive tables (and that's not even covered by the specification). |
I agree with that. And since we plan to make schema tracking required in v2, we can easily remove this work-around. |
|
@wypoon, are you interested in working on both? We could update this PR to use the work-around you implemented and to update the |
| * | ||
| * @return this table's schema | ||
| */ | ||
| Schema schemaForSnapshotAsOfTime(long timestampMillis); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of changing the table API in this commit, I think we only need to add a schema method to Snapshot. That will be used later when we have better tracking. Not adding too many methods to the public API in this commit will give us more flexibility later and make this simpler to get in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can avoid adding the schemaForSnaphot methods to the Table API. I think only BaseTable needs to have the methods. However, the places where I need to call schemaForSnapshot have references to simply Table; I'd have to check if the Table is an instance of BaseTable.
I also suggested in the general comments that Snapshot could simply have a schemaId method in the future. And that would then get used in a revised implementation of schemaForSnapshot in BaseTable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree with you. We have some flexibility here with how we look up the schema for a given snapshot. The main thing is that we don't want to expose any methods through the Table API right now. And since we don't have a schema id yet, we can't really add that method to Snapshot.
Eventually, I think we would want Snapshot to return its own schema rather than requiring a lookup, which is why I suggested that option. But we might be able to avoid the problem for now.
Like I said, I just don't think we want to add anything to Table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the methods from Table, adding them only to BaseTable. Thus the Table API is unchanged.
| this.schema = SparkSchemaUtil.prune(getTableSchema(), requestedSchema, filterExpression(), caseSensitive); | ||
| } else { | ||
| this.schema = table.schema(); | ||
| this.schema = getTableSchema(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to updates here, we will need to update schema handling in BaseTableScan so that the projection returned by the scan is based on the schema of the Snapshot that will be scanned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only point I don't understand. As far as I can tell, BaseTableScan is constructed with the user's requested schema. If the user is wanting to read from an old snapshot, then they should be supplying a requested schema compatible with the snapshot's schema. I couldn't see anything I needed to change in BaseTableScan. If I'm missing something here, can you please point out to me the parts I need to change?
I added a new test case where a requested schema is used, and that revealed a bug in the spark2 IcebergSource#createReader. In that method, there is a call to SparkSchemaUtil.convert(Schema, StructType) whose only purpose seems to be to check that the requested StructType does not contain fields in the Schema (for then the call would cause an exception), and here I needed to use the snapshot Schema rather than the table Schema. Other than that, the projection returned by the scan appears to be correct.
| TableProperties.PARQUET_BATCH_SIZE, TableProperties.PARQUET_BATCH_SIZE_DEFAULT)); | ||
| } | ||
|
|
||
| private Schema getTableSchema() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method name could be more descriptive. The verb "get" doesn't add anything so we don't normally use it. And this is also not the table schema, it is a snapshot's schema. How about renaming this to snapshotSchema()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to snapshotSchema as you suggest.
|
|
||
| // Build Spark table based on Iceberg table, and return it | ||
| return new SparkTable(icebergTable, schema); | ||
| return new SparkTable(icebergTable, schema, options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that SparkTable should include a snapshotId so that it returns the correct snapshot's schema to the analyzer, when possible. I don't think that it needs to include these options, when no other options are used by the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I extracted snapshotId and asOfTimestamp from options here if they appear and pass the values to SparkTable in its constructor instead of having SparkTable do that. I honestly think it's just half a dozen of one and six of the other.
|
I think you need to check Also, from what I can tell, the metadata is always refreshed after the snapshot has been created (e.g., As a sanity check, the code could check the expected snapshot-id against the |
|
Thanks everyone for your interest in this work! |
|
Good observations, @edwinchoi. I'm not sure why the snapshot log is cleared out when a table is replaced. That log should track the snapshot that was current a given time, which seems like the right one to use. We use it for time travel queries, for example. But since the log is not there, we can use the metadata file log instead as you suggested to find the snapshot that was current at some time. We should not use the list of snapshots because there is no guarantee that any given snapshot was ever the current version (like a branch in git that is never the main branch). |
Yes, we will need to have a way to get the right schema from table metadata by ID. I think what might make sense for now is to pass a |
It seems to me that it'd be simpler if in the future, |
|
I posted in a comment above, but I'll echo it here: I don't think we should be adding anything to the |
|
@rdblue, how would a snapshot be recorded in the metadata if it weren't at some point current? Intermediate snapshots are removed on commit, so I would imagine the only case where the metadata could refer to snapshots that were never current is if the commit were rejected. In that case, the snapshots wouldn't be reachable from the current commit. What am I missing? Example: -- S1
CREATE TABLE test.ns.tbl USING iceberg AS
SELECT * FROM VALUES (1, "Alice"), (2, "Bob") AS (id, fname);
-- S2
CREATE OR REPLACE TABLE test.ns.tbl USING iceberg AS
SELECT * FROM VALUES (1, 5, "alice"), (2, 3, "bob") AS (id, name_len, name) ;
-- S3
INSERT INTO test.ns.tbl VALUES (3, 5, "carol")The table You can rewind to |
|
There are a few cases where a snapshot might not have been the current state:
Also, the table state can be rolled back to a previous snapshot and new commits will form a new history afterwards. |
|
Thanks for elaborating. My thoughts...
I don't think WAP is working as expected under RTAS. spark.sql("""
CREATE TABLE test.ns.tbl
USING iceberg
TBLPROPERTIES ('write.wap.enabled'='true')
AS SELECT * FROM VALUES (1, "Alice"), (2, "Bob") AS (id, fname)
""")
spark.conf.set("spark.wap.id", "12345")
spark.sql("""
CREATE OR REPLACE TABLE test.ns.tbl
USING iceberg
AS SELECT * FROM VALUES (1, 5, "alice"), (2, 3, "bob") AS (id, name_len, name)
""")
spark.conf.unset("spark.wap.id")After running this, the schema from the staged change is showing up but the data that should exist isn't accessible, i.e.,
I don't believe this changes the notion of what was current at some point in time. If you view the timeline as being from the database's point of view, then rolling back doesn't change the fact that at some point in time, a snapshot, that is now inaccessible, was visible in the database. |
You're right. I think we should disable WAP with RTAS until we figure out what the behavior should be.
That's right, but it is an example of not being able to simply use the list of snapshots. This was a side point that supports the idea that point in time lookup should be done using the snapshot log, or metadata file log if it is not available, and snapshot lookup should be done strictly by id. |
2a9bc7a to
d34fdbd
Compare
My understanding from @rdblue's replies is that one should use the snapshot log, so I have stuck with that. Also, from what I see, the metadata timestamp is always the same as the snapshot timestamp when the metadata is written for a new snapshot. I changed the loop to look for the metadata log entry whose timestamp == the snapshot timestamp (and break out of the loop). (I could also use >= as you suggest, but it should make no difference.) |
| FileSystem fs = tablePath.getFileSystem(spark.sessionState().newHadoopConf()); | ||
| fs.delete(tablePath, true); | ||
| catalog.dropTable(currentIdentifier, false); | ||
| if (currentIdentifier != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code used to be in the spark common module before Anton removed it. On rebase, it was moved to v2.4, and I did not scrutinize what happened as there were no conflicts to resolve and tests passed.
This is actually a fix for a latent bug that was exposed when I skipped tests. Every test in TestIcebergSourceTablesBase calls createTable, which in this subclass sets the static currentIdentifier. At the end of each test, this @After method drops the table identified by currentIdentifier. If the test is skipped, this @After method is still run, and will fail because the table was already dropped after the previous test.
My impetus was to skip the tests I added when testing Spark 3. Now that the code is duplicated and the v3.0 version of TestIcebergSourceTablesBase does not have the added tests, test skipping using Assume is not necessary. Neverthelss, I'd argue that this is still a bug and I propose to keep the fix here and add it to v.3.0 as well.
Your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that makes sense.
|
I think this is almost ready. Just a few minor things to fix. |
...k/v2.4/spark2/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java
Outdated
Show resolved
Hide resolved
...k/v2.4/spark2/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java
Outdated
Show resolved
Hide resolved
According to Edwin Choi, in order to get the schema for a snapshot, the only safe option is to scan the metadata files to find the one where the current-snapshot-id matches target snapshot id.
The changes are mostly in spark3. They are necessitated by the catalog support introduced in apache#1783. As the spark3 IcebergSource now implements SupportsCatalogOptions, DataFrameReader#load no longer calls IcebergSource#getTable but calls SparkCatalog#loadTable directly. In order for the SparkTable returned by SparkCatalog#loadTable(Identifier) to be aware of the snapshot, the information about the snapshot needs to be present in the Identifier. For this reason, we introduce a SnapshotAwareIdentifier interface extending Identifier. As SupportsCatalogOptions does not allow a schema to be specified (requested), SparkTable no longer needs a requestedSchema field, so some dead code is removed from it.
Rebased on master. Use constants from SparkReadOptions. Implement snapshotSchema() in SparkFilesScan as it extends SparkBatchScan.
Avoid introducing new methods to BaseTable. Add helper methods to SnapshotUtil instead. Move recovery of schema from previous metadata files in the event that snapshot does not have associated schema id to new PR. Remove snapshotSchema method from SparkBatchScam and its subclasses, as it is not needed. Adjust schema in BaseTableScan when useSnapshot is called.
Use the existing CatalogAndIdentifier and swap out the Identifier for a snapshot-aware TableIdentifier if snapshotId or asOfTimestamp is set.
Fix a bug in BaseTableScan#useSnapshot. Some clean up in SnapshotUtil. Some streamlining in added unit tests. Refactor spark2 Reader to configure the TableScan on construction, and let the TableScan get the schema for the snapshot. Rename new TableIdentifier to SparkTableIdentifier to avoid confusion with existing TableIdentifier (in different package). Add convenience constructor to PathIdentifier to avoid modifying tests for it.
…ableScan. Use SnapshotUtil.snapshotIdAsOfTime in BaseTableScan#asOfTime. Move formatTimestampMillis from BaseTableScan to SnapshotUtil in order to use it there (BaseTableScan is a package-private and not a public class). Fix some error messages.
290c4a0 to
afec25c
Compare
Remove some unused code. Sync v3.0 with changed code in v2.4.
| .option("snapshot-id", String.valueOf(firstCommitId)) | ||
| .option(SparkReadOptions.SNAPSHOT_ID, String.valueOf(firstCommitId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep v3.0 in sync with v2.4 to the extent applicable.
|
@rdblue are we gating changes to spark/v3.0 until v3.2 is added? If so, I can remove the v3.0 changes here. If you really want, I can remove the fix for |
|
Thanks @wypoon! Looks great. |
Now that a snapshot has a schema id associated with it, when reading a snapshot, we should use the schema referenced by that schema id. We implement this for Spark 2. The implementation for Spark 3 is deferred pending resolution of #3269. Changes for other engines that support time travel need to be implemented separately. In addition, for snapshots that are written before Iceberg added schema id to snapshot, we plan to implement recovering the schema by reading previous metadata files, in a future change.