-
Notifications
You must be signed in to change notification settings - Fork 3k
Add timestamp to table definition in Nessie Catalog #1825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| : client.getTreeApi().getReferenceByName(requestedRef); | ||
| if (ref instanceof Hash) { | ||
| LOGGER.warn("Cannot specify a hash {} and timestamp {} together. " + | ||
| "The timestamp is redundant and has been ignored", requestedRef, timestamp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather fail when this happens than choose one argument to ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 fixed
| if (ref instanceof Hash) { | ||
| LOGGER.warn("Cannot specify a hash {} and timestamp {} together. " + | ||
| "The timestamp is redundant and has been ignored", requestedRef, timestamp); | ||
| return new UpdateableReference(ref, client.getTreeApi()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to change the name of UpdateableReference because it is misleading in a context like this. The reference here isn't actually updateable because it is a hash. I think the class is named because it can update some references.
| List<CommitMeta> ops = client.getTreeApi().getCommitLog(ref.getName()).getOperations(); | ||
| for (CommitMeta info : ops) { | ||
| if (info.getCommitTime() != null && Instant.ofEpochMilli(info.getCommitTime()).isBefore(timestamp)) { | ||
| return new UpdateableReference(ImmutableHash.builder().name(info.getHash()).build(), client.getTreeApi()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this return the first reference? Is there a guarantee that the commit log is in reverse order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit log will always be in reverse order. (similar to how git would show its log). So this will show most recent commit first and base commit last. I have added another test to verify the behaviour.
|
|
||
| private enum FormatOptions { | ||
| DATE_TIME(DateTimeFormatter.ISO_DATE_TIME, Instant::from), | ||
| LOCAL_DATE_TIME(DateTimeFormatter.ISO_LOCAL_DATE_TIME, t -> LocalDateTime.from(t).atZone(UTC).toInstant()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These references are all in the JVM's local time zone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are to parse the given ISO timestamp into an Instant. If the given timestamp has a timezone on it it will be parsed by DATE_TIME with the timezone and relocated to UTC for the instant. If it doesn't have a timezone it will be parsed as a UTC timestamp and if it doesn't have a time at all it will be given midnight at UTC for that date. I believe this is what happens in the tests, I reran w/ a variety of interesting timezones to verify.
| private enum FormatOptions { | ||
| DATE_TIME(DateTimeFormatter.ISO_DATE_TIME, Instant::from), | ||
| LOCAL_DATE_TIME(DateTimeFormatter.ISO_LOCAL_DATE_TIME, t -> LocalDateTime.from(t).atZone(UTC).toInstant()), | ||
| LOCAL_DATE(DateTimeFormatter.ISO_LOCAL_DATE, t -> LocalDate.from(t).atStartOfDay(UTC).toInstant()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem specific enough to include. I would expect the ref on some day to be the ref at the end of that day, but I think this would produce the ref at the start of the day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmmm...I am torn. On one hand I agree, it makes the users life easier if we add 23:59:59 to the timestamp they specified. However I don't like changing what they asked for, if they ask for a date I think they should be given what they asked for, they can always specify time if they want EOD. Or they can create a tag to represent EOD. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine changing this to 23:59:59, but there is also a good argument that it is ambiguous and should be left out. Up to you guys whether you want to do that or not. Sounds like @jacques-n is in favor of the prefix matching behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think prefix matching is the better way to think about it. The datetime arithmetic is a bit fidgety now but I am happy with the test coverage.
| String.format("/data/%s.avro", filename); | ||
| try (FileAppender<GenericData.Record> writer = Avro.write(Files.localOutput(fileLocation)) | ||
| .schema(schema) | ||
| .schema(SCHEMA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, let's separate style changes from behavior or features. It's okay now because this is unreleased so no one is submitting patches or cherry-picking into a branch. But in the future we prefer to avoid commit conflicts by keeping PR changes small and focused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
19311d3 to
75b954e
Compare
|
Thanks for the code review @rdblue , made some updates based on your comments. |
|
I agree with @rdblue wrt end of day/period. Writing '#2019' should mean what is the last value for that period. This should actually be consistent across the board. Give me the most recent commit that matches this declaration. Even if you give seconds, if there two commits within that second, you should get the most recent one. The key here is that the person is giving a year or day reference, not a time reference. Whatever internal resolution we use should not influence what they mean when they express a particular period. |
|
To put another way. The table identifier is expressing a pattern. Pick the most recent commit matching a pattern using a "startswith" behavior. |
|
end of period wins by a vote of 2-1. ;-) fixed and added some tests |
|
What do you think about actually implementing using prefix matching code rather than time conversions? Basically, convert the commit timestamp to iso and then do prefix matching? Too crazy? |
Too crazy. I think that would hit issues with timestamp precision, whether date and time are separated by space or I prefer strict conversion into times and then adjusting based on precision. If it was a date, then add 1 day, convert to timestamp, and subtract 1 microsecond. By the way, what time zone are these supposed to be in? Does this need to respect the SQL session time zone? |
Hehehe thanks Ryan. To be fair though the current impl doesn't parse confusing timestamps well. I think the correct impl would be to parse both commit time and passed timestamps into |
Currently it treats everything as UTC, to be honest I didn't know about |
|
@jacques-n & @rdblue is this good to go now? I think I have addressed everything but may have missed something. |
Since these can be embedded in SQL queries, I think it makes the most sense to match what Spark does. That should be parsing timestamps without an offset or zone using the SQL session time zone. That may default using |
eebf4f9 to
948728a
Compare
|
Cool, makes sense @rdblue . I have updated to respect the SQL timezone. It is a bit ugly as we have to employ reflection to get hold of spark conf from Nessie. |
cb7fc67 to
c30bd23
Compare
|
I've just rebased and I think all comments have been addressed. Any chance we can close this one out? |
|
I'll try to get time to review this in the next few days. I'm wary of adding code that interprets timestamps from user strings, though. I'm just not sure it is a good idea to do this before Spark SQL can support |
| */ | ||
| public class NessieCatalog extends BaseMetastoreCatalog implements AutoCloseable, SupportsNamespaces, Configurable { | ||
| private static final Logger logger = LoggerFactory.getLogger(NessieCatalog.class); | ||
| private static final Logger LOGGER = LoggerFactory.getLogger(NessieCatalog.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard is to name it LOG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. Perhaps we should add a checkstyle for this? Seems to be enforced sporadically in the codebase atm
| if (ref instanceof Hash) { | ||
| throw new IllegalArgumentException(String.format("Cannot specify a hash %s and timestamp %s together. " + | ||
| "The timestamp is redundant and has been ignored", requestedRef, timestamp)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add space after if and for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure I take your meaning. I have added a newline after the if and for, is that what you meant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, thank you
| return new TableReference(identifier, null, null); | ||
| } | ||
|
|
||
| private enum FormatOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is only dealing with datetime, probably better to call it DateTimeFormatOptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| sparkAvailable = false; // spark not on classpath | ||
| } | ||
| } | ||
| if (sparkAvailable != null && sparkAvailable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sparkAvailable != null is not redundant because it must be set by L206 or L208. Even if it is not set, if (sparkAvailable) should still work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. I have split into two booleans now. I don't like relying on the nullability of a Boolean
|
|
||
| private enum FormatOptions { | ||
| DATE_TIME(DateTimeFormatter.ISO_DATE_TIME, ZonedDateTime::from), | ||
| LOCAL_DATE_TIME(DateTimeFormatter.ISO_LOCAL_DATE_TIME, t -> LocalDateTime.from(t).atZone(sparkTimezoneOrUTC())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we are relying heavily on Spark functionalities through reflection. I am not very familiar with the Nessie module, is this a common pattern? I feel we should make it more generic for other runtime environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Jack, I am not such a huge fan either. I would prefer to use UTC only. Or in a pinch the system timezone. The Spark check is there to respect the case whena user has set the spark sql timezone. I haven't found the equivalent parameter in Flink but either way it gets a bit sticky to check via reflection for all engines potential paramters.
I think there are a few options:
- leave as is and only suppport Spark for the time being
- remove this and add a parameter/option to set this catalogs timezone
- only use UTC or system settings.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with the current implementation with Spark specific logic and your refactored code, if that is a primary use case that you need to cover this is probably the best way to have it. But we should definitely take a note for this and think about how interactions and dependencies between different modules, especially with different engines, should be handled in the future.
38fea9b to
27a3521
Compare
|
Thanks for the review @jackye1995 I have updated based on your comments. The last question is what to do about timezones. @rdblue I agree that handling timestamps on our own is icky however we will likely end up having to anyways. We would like to be able to support queries that compare tables across versions/times (eg a join between |
|
Looks like the build failure is a flaky test. Will re-trigger with next round of code review |
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late response, missed it in emails.
| if (ref instanceof Hash) { | ||
| throw new IllegalArgumentException(String.format("Cannot specify a hash %s and timestamp %s together. " + | ||
| "The timestamp is redundant and has been ignored", requestedRef, timestamp)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, thank you
|
|
||
| private enum FormatOptions { | ||
| DATE_TIME(DateTimeFormatter.ISO_DATE_TIME, ZonedDateTime::from), | ||
| LOCAL_DATE_TIME(DateTimeFormatter.ISO_LOCAL_DATE_TIME, t -> LocalDateTime.from(t).atZone(sparkTimezoneOrUTC())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with the current implementation with Spark specific logic and your refactored code, if that is a primary use case that you need to cover this is probably the best way to have it. But we should definitely take a note for this and think about how interactions and dependencies between different modules, especially with different engines, should be handled in the future.
Also clean up a few items from #1587