Docs: Add stream-from-timestamp in spark-configuration.md #3732

rajarshisarkar · 2021-12-13T14:21:02Z

Add stream-from-timestamp in spark-configuration.md.

Reference: Code

kbendick

Thanks for adding this @rajarshisarkar

A small nit related to my own inability to comprehend things, so feel free to update it or not. The current description is ok by me too

But I'm +1 😄

kbendick · 2021-12-14T02:55:36Z

site/docs/spark-configuration.md

 | file-open-cost  | As per table property | Overrides this table's read.split.open-file-cost                                          |
 | vectorization-enabled  | As per table property | Overrides this table's read.parquet.vectorization.enabled                                          |
 | batch-size  | As per table property | Overrides this table's read.parquet.vectorization.batch-size                                          |
+| stream-from-timestamp | Long.MIN_VALUE | Timestamp in milliseconds; start a stream from the snapshot that occurs after this timestamp. |


Nit: Could you reword the description to be a bit more clear about which snapshot is started from? That was a source of confusion for a while for me.

Maybe Timestamp in milliseconds, start streaming this table from the first snapshot that occurs strictly after this timestamp?

It's a minor change, but it took me a while to be sure which snapshot was being referred to after being explained it, so I think it might help others as well to be more sure of what we mean.

Nit: Also, the rest of the descriptions don't seem to end in periods. I'm not sure if that's because they end with configuration keys, but if other sentences in the same column don't end with periods can you remove it?

Hi @kbendick

Thanks for reviewing!

stream-from-timestamp would actually stream the table from the snapshot >= timestamp. Here's an example: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java#L109-L117

Hence, I have kept the description as Timestamp in milliseconds, start streaming this table from the first snapshot that occurs strictly after this timestamp.

Please let me know if we can phrase it better.

Ah ok cool. I knew I got confused by that somehow so I appreciate you detailing it for me. If it’s greater than or equal, probably the first snapshot that occurs at or after this timestamp is fine. The strictly after implies strictly greater than. That was my mistake. I apologize.

Makes sense! I have made the changes.

rajarshisarkar · 2021-12-17T04:34:50Z

@kbendick Can you please review once you get time.

kbendick

This looks great. Thank you for the contributions and thank you for the updates @rajarshisarkar! Sorry for the delay in response.

rdblue · 2021-12-20T17:59:43Z

site/docs/spark-configuration.md

 | file-open-cost  | As per table property | Overrides this table's read.split.open-file-cost                                          |
 | vectorization-enabled  | As per table property | Overrides this table's read.parquet.vectorization.enabled                                          |
 | batch-size  | As per table property | Overrides this table's read.parquet.vectorization.batch-size                                          |
+| stream-from-timestamp | Long.MIN_VALUE | Timestamp in milliseconds, start streaming this table from the first snapshot that occurs at or after this timestamp |


I think that the default is not set, so the default behavior is not to stream from a timestamp but to stream from the oldest snapshot.

You may also want to update this to "first known ancestor snapshot" and add a note:

!!! Note If `stream-from-timestamp` is before the oldest ancestor snapshot in the table, the oldest ancestor will be used.

Thanks, for the inputs @rdblue.

The default value Long.MIN_VALUE is set here: https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java#L215-L220

I have updated the wordings to "first known ancestor snapshot".

I have added the note about the default behavior. Do let me know if I should keep the note in the default column & remove Long.MIN_VALUE.

Please remove Long.MIN_VALUE. While that's in the code, we want to document behavior, not the specific implementation.

Thanks for the clarification, @rdblue. I have pushed the change.

kbendick · 2021-12-21T00:05:48Z

This PR is likely going to be affected by changes made in this PR: #3775

Removed my approval until that settles.

rdblue · 2021-12-21T00:48:28Z

I think this needs a note about the behavior. Once that's in I can merge.

rdblue · 2022-01-04T16:25:03Z

Thanks, @rajarshisarkar!

* apache/iceberg#3723 * apache/iceberg#3732 * apache/iceberg#3749 * apache/iceberg#3766 * apache/iceberg#3787 * apache/iceberg#3796 * apache/iceberg#3809 * apache/iceberg#3820 * apache/iceberg#3878 * apache/iceberg#3890 * apache/iceberg#3892 * apache/iceberg#3944 * apache/iceberg#3976 * apache/iceberg#3993 * apache/iceberg#3996 * apache/iceberg#4008 * apache/iceberg#3758 and 3856 * apache/iceberg#3761 * apache/iceberg#2062 * apache/iceberg#3422 * remove restriction related to legacy parquet file list

Add stream-from-timestamp in spark-configuration.md

384e041

github-actions bot added the docs label Dec 13, 2021

kbendick approved these changes Dec 14, 2021

View reviewed changes

rajarshisarkar added 2 commits December 14, 2021 20:41

Implement review comments

1d1582f

Implement review comments

7a1957c

kbendick approved these changes Dec 17, 2021

View reviewed changes

rdblue reviewed Dec 20, 2021

View reviewed changes

kbendick self-requested a review December 20, 2021 23:57

rajarshisarkar and others added 3 commits December 22, 2021 10:17

Implement review comments

88c89e8

Implement review comments

ad98e2d

Update spark-configuration.md

f9a65dd

rdblue merged commit 16e5820 into apache:master Jan 4, 2022

jackye1995 pushed a commit to jackye1995/iceberg-docs that referenced this pull request Feb 8, 2022

https://github.com/apache/iceberg/pull/3732

610659c

Docs: Add stream-from-timestamp in spark-configuration.md #3732

Docs: Add stream-from-timestamp in spark-configuration.md #3732

Uh oh!

Conversation

rajarshisarkar commented Dec 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rajarshisarkar commented Dec 17, 2021

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

rdblue Dec 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kbendick commented Dec 21, 2021

Uh oh!

rdblue commented Dec 21, 2021

Uh oh!

rdblue commented Jan 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rajarshisarkar commented Dec 13, 2021 •

edited

Loading

rdblue Dec 20, 2021 •

edited

Loading