-
Notifications
You must be signed in to change notification settings - Fork 3k
Docs: Add stream-from-timestamp in spark-configuration.md #3732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs: Add stream-from-timestamp in spark-configuration.md #3732
Conversation
kbendick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this @rajarshisarkar
A small nit related to my own inability to comprehend things, so feel free to update it or not. The current description is ok by me too
But I'm +1 😄
site/docs/spark-configuration.md
Outdated
| | file-open-cost | As per table property | Overrides this table's read.split.open-file-cost | | ||
| | vectorization-enabled | As per table property | Overrides this table's read.parquet.vectorization.enabled | | ||
| | batch-size | As per table property | Overrides this table's read.parquet.vectorization.batch-size | | ||
| | stream-from-timestamp | Long.MIN_VALUE | Timestamp in milliseconds; start a stream from the snapshot that occurs after this timestamp. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Could you reword the description to be a bit more clear about which snapshot is started from? That was a source of confusion for a while for me.
Maybe Timestamp in milliseconds, start streaming this table from the first snapshot that occurs strictly after this timestamp?
It's a minor change, but it took me a while to be sure which snapshot was being referred to after being explained it, so I think it might help others as well to be more sure of what we mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Also, the rest of the descriptions don't seem to end in periods. I'm not sure if that's because they end with configuration keys, but if other sentences in the same column don't end with periods can you remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kbendick
Thanks for reviewing!
stream-from-timestamp would actually stream the table from the snapshot >= timestamp. Here's an example: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java#L109-L117
Hence, I have kept the description as Timestamp in milliseconds, start streaming this table from the first snapshot that occurs strictly after this timestamp.
Please let me know if we can phrase it better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok cool. I knew I got confused by that somehow so I appreciate you detailing it for me. If it’s greater than or equal, probably the first snapshot that occurs at or after this timestamp is fine. The strictly after implies strictly greater than. That was my mistake. I apologize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! I have made the changes.
|
@kbendick Can you please review once you get time. |
kbendick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Thank you for the contributions and thank you for the updates @rajarshisarkar! Sorry for the delay in response.
site/docs/spark-configuration.md
Outdated
| | file-open-cost | As per table property | Overrides this table's read.split.open-file-cost | | ||
| | vectorization-enabled | As per table property | Overrides this table's read.parquet.vectorization.enabled | | ||
| | batch-size | As per table property | Overrides this table's read.parquet.vectorization.batch-size | | ||
| | stream-from-timestamp | Long.MIN_VALUE | Timestamp in milliseconds, start streaming this table from the first snapshot that occurs at or after this timestamp | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the default is not set, so the default behavior is not to stream from a timestamp but to stream from the oldest snapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may also want to update this to "first known ancestor snapshot" and add a note:
!!! Note
If `stream-from-timestamp` is before the oldest ancestor snapshot in the table, the oldest ancestor will be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, for the inputs @rdblue.
The default value Long.MIN_VALUE is set here: https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java#L215-L220
I have updated the wordings to "first known ancestor snapshot".
I have added the note about the default behavior. Do let me know if I should keep the note in the default column & remove Long.MIN_VALUE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove Long.MIN_VALUE. While that's in the code, we want to document behavior, not the specific implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification, @rdblue. I have pushed the change.
|
This PR is likely going to be affected by changes made in this PR: #3775 Removed my approval until that settles. |
|
I think this needs a note about the behavior. Once that's in I can merge. |
|
Thanks, @rajarshisarkar! |
* apache/iceberg#3723 * apache/iceberg#3732 * apache/iceberg#3749 * apache/iceberg#3766 * apache/iceberg#3787 * apache/iceberg#3796 * apache/iceberg#3809 * apache/iceberg#3820 * apache/iceberg#3878 * apache/iceberg#3890 * apache/iceberg#3892 * apache/iceberg#3944 * apache/iceberg#3976 * apache/iceberg#3993 * apache/iceberg#3996 * apache/iceberg#4008 * apache/iceberg#3758 and 3856 * apache/iceberg#3761 * apache/iceberg#2062 * apache/iceberg#3422 * remove restriction related to legacy parquet file list
Add
stream-from-timestampinspark-configuration.md.Reference: Code