Add timestamp to table definition in Nessie Catalog #1825

rymurr · 2020-11-25T14:53:56Z

Also clean up a few items from #1587

rdblue · 2020-12-01T01:40:27Z

nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java

+          : client.getTreeApi().getReferenceByName(requestedRef);
+      if (ref instanceof Hash) {
+        LOGGER.warn("Cannot specify a hash {} and timestamp {} together. " +
+            "The timestamp is redundant and has been ignored", requestedRef, timestamp);


I would rather fail when this happens than choose one argument to ignore.

rdblue · 2020-12-01T01:42:18Z

nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java

+      if (ref instanceof Hash) {
+        LOGGER.warn("Cannot specify a hash {} and timestamp {} together. " +
+            "The timestamp is redundant and has been ignored", requestedRef, timestamp);
+        return new UpdateableReference(ref, client.getTreeApi());


We might want to change the name of UpdateableReference because it is misleading in a context like this. The reference here isn't actually updateable because it is a hash. I think the class is named because it can update some references.

rdblue · 2020-12-01T01:43:39Z

nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java

+      List<CommitMeta> ops = client.getTreeApi().getCommitLog(ref.getName()).getOperations();
+      for (CommitMeta info : ops) {
+        if (info.getCommitTime() != null && Instant.ofEpochMilli(info.getCommitTime()).isBefore(timestamp)) {
+          return new UpdateableReference(ImmutableHash.builder().name(info.getHash()).build(), client.getTreeApi());


Won't this return the first reference? Is there a guarantee that the commit log is in reverse order?

The commit log will always be in reverse order. (similar to how git would show its log). So this will show most recent commit first and base commit last. I have added another test to verify the behaviour.

rdblue · 2020-12-01T01:44:53Z

nessie/src/main/java/org/apache/iceberg/nessie/TableReference.java

+
+  private enum FormatOptions {
+    DATE_TIME(DateTimeFormatter.ISO_DATE_TIME, Instant::from),
+    LOCAL_DATE_TIME(DateTimeFormatter.ISO_LOCAL_DATE_TIME, t -> LocalDateTime.from(t).atZone(UTC).toInstant()),


These references are all in the JVM's local time zone?

These are to parse the given ISO timestamp into an Instant. If the given timestamp has a timezone on it it will be parsed by DATE_TIME with the timezone and relocated to UTC for the instant. If it doesn't have a timezone it will be parsed as a UTC timestamp and if it doesn't have a time at all it will be given midnight at UTC for that date. I believe this is what happens in the tests, I reran w/ a variety of interesting timezones to verify.

rdblue · 2020-12-01T01:47:07Z

nessie/src/main/java/org/apache/iceberg/nessie/TableReference.java

+  private enum FormatOptions {
+    DATE_TIME(DateTimeFormatter.ISO_DATE_TIME, Instant::from),
+    LOCAL_DATE_TIME(DateTimeFormatter.ISO_LOCAL_DATE_TIME, t -> LocalDateTime.from(t).atZone(UTC).toInstant()),
+    LOCAL_DATE(DateTimeFormatter.ISO_LOCAL_DATE, t -> LocalDate.from(t).atStartOfDay(UTC).toInstant());


This doesn't seem specific enough to include. I would expect the ref on some day to be the ref at the end of that day, but I think this would produce the ref at the start of the day.

hmmmm...I am torn. On one hand I agree, it makes the users life easier if we add 23:59:59 to the timestamp they specified. However I don't like changing what they asked for, if they ask for a date I think they should be given what they asked for, they can always specify time if they want EOD. Or they can create a tag to represent EOD. Thoughts?

I'm fine changing this to 23:59:59, but there is also a good argument that it is ambiguous and should be left out. Up to you guys whether you want to do that or not. Sounds like @jacques-n is in favor of the prefix matching behavior.

I think prefix matching is the better way to think about it. The datetime arithmetic is a bit fidgety now but I am happy with the test coverage.

rdblue · 2020-12-01T01:50:00Z

nessie/src/test/java/org/apache/iceberg/nessie/TestNessieTable.java

        String.format("/data/%s.avro", filename);
    try (FileAppender<GenericData.Record> writer = Avro.write(Files.localOutput(fileLocation))
-        .schema(schema)
+        .schema(SCHEMA)


In the future, let's separate style changes from behavior or features. It's okay now because this is unreleased so no one is submitting patches or cherry-picking into a branch. But in the future we prefer to avoid commit conflicts by keeping PR changes small and focused.

rymurr · 2020-12-01T12:33:16Z

Thanks for the code review @rdblue , made some updates based on your comments.

jacques-n · 2020-12-01T14:12:35Z

I agree with @rdblue wrt end of day/period.

Writing '#2019' should mean what is the last value for that period. This should actually be consistent across the board. Give me the most recent commit that matches this declaration. Even if you give seconds, if there two commits within that second, you should get the most recent one. The key here is that the person is giving a year or day reference, not a time reference. Whatever internal resolution we use should not influence what they mean when they express a particular period.

jacques-n · 2020-12-01T14:17:07Z

To put another way. The table identifier is expressing a pattern. Pick the most recent commit matching a pattern using a "startswith" behavior.

rymurr · 2020-12-01T14:46:14Z

end of period wins by a vote of 2-1. ;-) fixed and added some tests

jacques-n · 2020-12-01T18:26:46Z

What do you think about actually implementing using prefix matching code rather than time conversions? Basically, convert the commit timestamp to iso and then do prefix matching? Too crazy?

rdblue · 2020-12-01T21:23:54Z

Too crazy?

Too crazy.

I think that would hit issues with timestamp precision, whether date and time are separated by space or T, and zone offsets. How would we interpret 2020-12-1? Or 2020-12-01 -07:00?

I prefer strict conversion into times and then adjusting based on precision. If it was a date, then add 1 day, convert to timestamp, and subtract 1 microsecond.

By the way, what time zone are these supposed to be in? Does this need to respect the SQL session time zone?

rymurr · 2020-12-02T12:30:59Z

Too crazy.

Hehehe thanks Ryan. To be fair though the current impl doesn't parse confusing timestamps well. I think the correct impl would be to parse both commit time and passed timestamps into ZonedDateTime then write as String in ISO format. Comparing those with prefix matching would then make sense. However I think that may be overkill at this time as the round up mechanism in this PR will satisfy the vast majority of use cases.

rymurr · 2020-12-02T12:36:27Z

By the way, what time zone are these supposed to be in? Does this need to respect the SQL session time zone?

Currently it treats everything as UTC, to be honest I didn't know about spark.sql.session.timeZone. Want me to change it to respect "spark.sql.session.timeZone"? I am personally biased against doing anything with data outside of UTC but if other people are using that setting we should probably not surprise them.

rymurr · 2020-12-07T19:00:42Z

@jacques-n & @rdblue is this good to go now? I think I have addressed everything but may have missed something.

rdblue · 2020-12-07T22:16:21Z

Currently it treats everything as UTC, to be honest I didn't know about spark.sql.session.timeZone. Want me to change it to respect "spark.sql.session.timeZone"? I am personally biased against doing anything with data outside of UTC but if other people are using that setting we should probably not surprise them.

Since these can be embedded in SQL queries, I think it makes the most sense to match what Spark does.

That should be parsing timestamps without an offset or zone using the SQL session time zone. That may default using spark.sql.session.timeZone, but there is usually a way to set it in a session so if we can get a live value from Spark then we should try to do that.

rymurr · 2020-12-08T14:43:43Z

Cool, makes sense @rdblue . I have updated to respect the SQL timezone. It is a bit ugly as we have to employ reflection to get hold of spark conf from Nessie.

rymurr · 2021-01-11T13:01:06Z

I've just rebased and I think all comments have been addressed. Any chance we can close this one out?

rdblue · 2021-01-11T18:10:49Z

I'll try to get time to review this in the next few days. I'm wary of adding code that interprets timestamps from user strings, though. I'm just not sure it is a good idea to do this before Spark SQL can support AS OF TIMESTAMP or AS OF VERSION clauses.

jackye1995 · 2021-01-13T07:35:48Z

nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java

 */
 public class NessieCatalog extends BaseMetastoreCatalog implements AutoCloseable, SupportsNamespaces, Configurable {
-  private static final Logger logger = LoggerFactory.getLogger(NessieCatalog.class);
+  private static final Logger LOGGER = LoggerFactory.getLogger(NessieCatalog.class);


The standard is to name it LOG

fixed. Perhaps we should add a checkstyle for this? Seems to be enforced sporadically in the codebase atm

jackye1995 · 2021-01-13T07:36:39Z

nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java

+      if (ref instanceof Hash) {
+        throw new IllegalArgumentException(String.format("Cannot specify a hash %s and timestamp %s together. " +
+            "The timestamp is redundant and has been ignored", requestedRef, timestamp));
+      }


nit: add space after if and for

I am not 100% sure I take your meaning. I have added a newline after the if and for, is that what you meant?

yes, thank you

jackye1995 · 2021-01-13T07:40:48Z

nessie/src/main/java/org/apache/iceberg/nessie/TableReference.java

    return new TableReference(identifier, null, null);
  }
+
+  private enum FormatOptions {


since this is only dealing with datetime, probably better to call it DateTimeFormatOptions

jackye1995 · 2021-01-13T07:43:59Z

nessie/src/main/java/org/apache/iceberg/nessie/TableReference.java

+        sparkAvailable = false; // spark not on classpath
+      }
+    }
+    if (sparkAvailable != null && sparkAvailable) {


sparkAvailable != null is not redundant because it must be set by L206 or L208. Even if it is not set, if (sparkAvailable) should still work

agreed. I have split into two booleans now. I don't like relying on the nullability of a Boolean

jackye1995 · 2021-01-13T07:45:57Z

nessie/src/main/java/org/apache/iceberg/nessie/TableReference.java

+
+  private enum FormatOptions {
+    DATE_TIME(DateTimeFormatter.ISO_DATE_TIME, ZonedDateTime::from),
+    LOCAL_DATE_TIME(DateTimeFormatter.ISO_LOCAL_DATE_TIME, t -> LocalDateTime.from(t).atZone(sparkTimezoneOrUTC())),


Looks like we are relying heavily on Spark functionalities through reflection. I am not very familiar with the Nessie module, is this a common pattern? I feel we should make it more generic for other runtime environments.

Hey Jack, I am not such a huge fan either. I would prefer to use UTC only. Or in a pinch the system timezone. The Spark check is there to respect the case whena user has set the spark sql timezone. I haven't found the equivalent parameter in Flink but either way it gets a bit sticky to check via reflection for all engines potential paramters.

I think there are a few options:

leave as is and only suppport Spark for the time being

remove this and add a parameter/option to set this catalogs timezone

only use UTC or system settings.

What do you think?

I am fine with the current implementation with Spark specific logic and your refactored code, if that is a primary use case that you need to cover this is probably the best way to have it. But we should definitely take a note for this and think about how interactions and dependencies between different modules, especially with different engines, should be handled in the future.

rymurr · 2021-01-14T10:02:36Z

Thanks for the review @jackye1995 I have updated based on your comments. The last question is what to do about timezones.

@rdblue I agree that handling timestamps on our own is icky however we will likely end up having to anyways. We would like to be able to support queries that compare tables across versions/times (eg a join between table1@time1 and table1@time2) and be able to do this consistently on different engines. I suppose in time we can delegate to the Spark timestamp parsing logic if it can be found via reflection but in the end we will still have to parse timestamps at this level.

rymurr · 2021-01-14T12:28:09Z

Looks like the build failure is a flaky test. Will re-trigger with next round of code review

jackye1995

Sorry for the late response, missed it in emails.

jackye1995 · 2021-01-20T23:48:28Z

nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java

+      if (ref instanceof Hash) {
+        throw new IllegalArgumentException(String.format("Cannot specify a hash %s and timestamp %s together. " +
+            "The timestamp is redundant and has been ignored", requestedRef, timestamp));
+      }


yes, thank you

jackye1995 · 2021-01-20T23:49:43Z

nessie/src/main/java/org/apache/iceberg/nessie/TableReference.java

+
+  private enum FormatOptions {
+    DATE_TIME(DateTimeFormatter.ISO_DATE_TIME, ZonedDateTime::from),
+    LOCAL_DATE_TIME(DateTimeFormatter.ISO_LOCAL_DATE_TIME, t -> LocalDateTime.from(t).atZone(sparkTimezoneOrUTC())),


I am fine with the current implementation with Spark specific logic and your refactored code, if that is a primary use case that you need to cover this is probably the best way to have it. But we should definitely take a note for this and think about how interactions and dependencies between different modules, especially with different engines, should be handled in the future.

rdblue reviewed Dec 1, 2020

View reviewed changes

rymurr force-pushed the timestamp-support branch from 19311d3 to 75b954e Compare December 1, 2020 12:31

github-actions bot added the NESSIE label Dec 1, 2020

rymurr force-pushed the timestamp-support branch from eebf4f9 to 948728a Compare December 8, 2020 14:42

rymurr force-pushed the timestamp-support branch from cb7fc67 to c30bd23 Compare January 11, 2021 13:00

jackye1995 reviewed Jan 13, 2021

View reviewed changes

Ryan Murray added 6 commits January 14, 2021 10:54

Add timestamp to table definition in Nessie Catalog

757022d

update based on code review

0d8f20e

search at end of period

0120858

fix build

0da975c

localize to Spark timezone if present

6b53830

fix checkstyle

2ba0b29

Ryan Murray added 2 commits January 14, 2021 10:54

fix build after rebase

333076c

address code review

27a3521

rymurr force-pushed the timestamp-support branch from 38fea9b to 27a3521 Compare January 14, 2021 09:55

jackye1995 approved these changes Jan 20, 2021

View reviewed changes

rymurr mentioned this pull request Jul 15, 2021

Bump Nessie to 0.8.2 + related changes #2588

Merged

snazy mentioned this pull request Jul 15, 2021

Nessie: re-enable time-travel functionality #2830

Closed

nastra mentioned this pull request Oct 13, 2021

Handle times in SQL extensions better projectnessie/nessie#1452

Closed

rymurr closed this May 2, 2022

rymurr deleted the timestamp-support branch May 2, 2022 13:00

Add timestamp to table definition in Nessie Catalog #1825

Add timestamp to table definition in Nessie Catalog #1825

Uh oh!

Conversation

rymurr commented Nov 25, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rymurr commented Dec 1, 2020

Uh oh!

jacques-n commented Dec 1, 2020

Uh oh!

jacques-n commented Dec 1, 2020

Uh oh!

rymurr commented Dec 1, 2020

Uh oh!

jacques-n commented Dec 1, 2020

Uh oh!

rdblue commented Dec 1, 2020

Uh oh!

rymurr commented Dec 2, 2020

Uh oh!

rymurr commented Dec 2, 2020

Uh oh!

rymurr commented Dec 7, 2020

Uh oh!

rdblue commented Dec 7, 2020

Uh oh!

rymurr commented Dec 8, 2020

Uh oh!

rymurr commented Jan 11, 2021

Uh oh!

rdblue commented Jan 11, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!