Skip to content

Conversation

@rajarshisarkar
Copy link
Contributor

@rajarshisarkar rajarshisarkar commented Dec 13, 2021

Add stream-from-timestamp in spark-configuration.md.

Reference: Code

@github-actions github-actions bot added the docs label Dec 13, 2021
Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @rajarshisarkar

A small nit related to my own inability to comprehend things, so feel free to update it or not. The current description is ok by me too

But I'm +1 😄

| file-open-cost | As per table property | Overrides this table's read.split.open-file-cost |
| vectorization-enabled | As per table property | Overrides this table's read.parquet.vectorization.enabled |
| batch-size | As per table property | Overrides this table's read.parquet.vectorization.batch-size |
| stream-from-timestamp | Long.MIN_VALUE | Timestamp in milliseconds; start a stream from the snapshot that occurs after this timestamp. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could you reword the description to be a bit more clear about which snapshot is started from? That was a source of confusion for a while for me.

Maybe Timestamp in milliseconds, start streaming this table from the first snapshot that occurs strictly after this timestamp?

It's a minor change, but it took me a while to be sure which snapshot was being referred to after being explained it, so I think it might help others as well to be more sure of what we mean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Also, the rest of the descriptions don't seem to end in periods. I'm not sure if that's because they end with configuration keys, but if other sentences in the same column don't end with periods can you remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kbendick

Thanks for reviewing!

stream-from-timestamp would actually stream the table from the snapshot >= timestamp. Here's an example: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java#L109-L117

Hence, I have kept the description as Timestamp in milliseconds, start streaming this table from the first snapshot that occurs strictly after this timestamp.

Please let me know if we can phrase it better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok cool. I knew I got confused by that somehow so I appreciate you detailing it for me. If it’s greater than or equal, probably the first snapshot that occurs at or after this timestamp is fine. The strictly after implies strictly greater than. That was my mistake. I apologize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! I have made the changes.

@rajarshisarkar
Copy link
Contributor Author

@kbendick Can you please review once you get time.

Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thank you for the contributions and thank you for the updates @rajarshisarkar! Sorry for the delay in response.

| file-open-cost | As per table property | Overrides this table's read.split.open-file-cost |
| vectorization-enabled | As per table property | Overrides this table's read.parquet.vectorization.enabled |
| batch-size | As per table property | Overrides this table's read.parquet.vectorization.batch-size |
| stream-from-timestamp | Long.MIN_VALUE | Timestamp in milliseconds, start streaming this table from the first snapshot that occurs at or after this timestamp |
Copy link
Contributor

@rdblue rdblue Dec 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the default is not set, so the default behavior is not to stream from a timestamp but to stream from the oldest snapshot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may also want to update this to "first known ancestor snapshot" and add a note:

!!! Note
    If `stream-from-timestamp` is before the oldest ancestor snapshot in the table, the oldest ancestor will be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, for the inputs @rdblue.

The default value Long.MIN_VALUE is set here: https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java#L215-L220

I have updated the wordings to "first known ancestor snapshot".

I have added the note about the default behavior. Do let me know if I should keep the note in the default column & remove Long.MIN_VALUE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove Long.MIN_VALUE. While that's in the code, we want to document behavior, not the specific implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification, @rdblue. I have pushed the change.

@kbendick kbendick self-requested a review December 20, 2021 23:57
@kbendick
Copy link
Contributor

This PR is likely going to be affected by changes made in this PR: #3775

Removed my approval until that settles.

@rdblue
Copy link
Contributor

rdblue commented Dec 21, 2021

I think this needs a note about the behavior. Once that's in I can merge.

@rdblue rdblue merged commit 16e5820 into apache:master Jan 4, 2022
@rdblue
Copy link
Contributor

rdblue commented Jan 4, 2022

Thanks, @rajarshisarkar!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants