Skip to content

Conversation

@rymurr
Copy link
Contributor

@rymurr rymurr commented Nov 25, 2020

Also clean up a few items from #1587

: client.getTreeApi().getReferenceByName(requestedRef);
if (ref instanceof Hash) {
LOGGER.warn("Cannot specify a hash {} and timestamp {} together. " +
"The timestamp is redundant and has been ignored", requestedRef, timestamp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather fail when this happens than choose one argument to ignore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 fixed

if (ref instanceof Hash) {
LOGGER.warn("Cannot specify a hash {} and timestamp {} together. " +
"The timestamp is redundant and has been ignored", requestedRef, timestamp);
return new UpdateableReference(ref, client.getTreeApi());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to change the name of UpdateableReference because it is misleading in a context like this. The reference here isn't actually updateable because it is a hash. I think the class is named because it can update some references.

List<CommitMeta> ops = client.getTreeApi().getCommitLog(ref.getName()).getOperations();
for (CommitMeta info : ops) {
if (info.getCommitTime() != null && Instant.ofEpochMilli(info.getCommitTime()).isBefore(timestamp)) {
return new UpdateableReference(ImmutableHash.builder().name(info.getHash()).build(), client.getTreeApi());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this return the first reference? Is there a guarantee that the commit log is in reverse order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit log will always be in reverse order. (similar to how git would show its log). So this will show most recent commit first and base commit last. I have added another test to verify the behaviour.


private enum FormatOptions {
DATE_TIME(DateTimeFormatter.ISO_DATE_TIME, Instant::from),
LOCAL_DATE_TIME(DateTimeFormatter.ISO_LOCAL_DATE_TIME, t -> LocalDateTime.from(t).atZone(UTC).toInstant()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These references are all in the JVM's local time zone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are to parse the given ISO timestamp into an Instant. If the given timestamp has a timezone on it it will be parsed by DATE_TIME with the timezone and relocated to UTC for the instant. If it doesn't have a timezone it will be parsed as a UTC timestamp and if it doesn't have a time at all it will be given midnight at UTC for that date. I believe this is what happens in the tests, I reran w/ a variety of interesting timezones to verify.

private enum FormatOptions {
DATE_TIME(DateTimeFormatter.ISO_DATE_TIME, Instant::from),
LOCAL_DATE_TIME(DateTimeFormatter.ISO_LOCAL_DATE_TIME, t -> LocalDateTime.from(t).atZone(UTC).toInstant()),
LOCAL_DATE(DateTimeFormatter.ISO_LOCAL_DATE, t -> LocalDate.from(t).atStartOfDay(UTC).toInstant());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem specific enough to include. I would expect the ref on some day to be the ref at the end of that day, but I think this would produce the ref at the start of the day.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmmm...I am torn. On one hand I agree, it makes the users life easier if we add 23:59:59 to the timestamp they specified. However I don't like changing what they asked for, if they ask for a date I think they should be given what they asked for, they can always specify time if they want EOD. Or they can create a tag to represent EOD. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine changing this to 23:59:59, but there is also a good argument that it is ambiguous and should be left out. Up to you guys whether you want to do that or not. Sounds like @jacques-n is in favor of the prefix matching behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think prefix matching is the better way to think about it. The datetime arithmetic is a bit fidgety now but I am happy with the test coverage.

String.format("/data/%s.avro", filename);
try (FileAppender<GenericData.Record> writer = Avro.write(Files.localOutput(fileLocation))
.schema(schema)
.schema(SCHEMA)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, let's separate style changes from behavior or features. It's okay now because this is unreleased so no one is submitting patches or cherry-picking into a branch. But in the future we prefer to avoid commit conflicts by keeping PR changes small and focused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@rymurr
Copy link
Contributor Author

rymurr commented Dec 1, 2020

Thanks for the code review @rdblue , made some updates based on your comments.

@jacques-n
Copy link

I agree with @rdblue wrt end of day/period.

Writing '#2019' should mean what is the last value for that period. This should actually be consistent across the board. Give me the most recent commit that matches this declaration. Even if you give seconds, if there two commits within that second, you should get the most recent one. The key here is that the person is giving a year or day reference, not a time reference. Whatever internal resolution we use should not influence what they mean when they express a particular period.

@jacques-n
Copy link

To put another way. The table identifier is expressing a pattern. Pick the most recent commit matching a pattern using a "startswith" behavior.

@rymurr
Copy link
Contributor Author

rymurr commented Dec 1, 2020

end of period wins by a vote of 2-1. ;-) fixed and added some tests

@jacques-n
Copy link

What do you think about actually implementing using prefix matching code rather than time conversions? Basically, convert the commit timestamp to iso and then do prefix matching? Too crazy?

@rdblue
Copy link
Contributor

rdblue commented Dec 1, 2020

Too crazy?

Too crazy.

I think that would hit issues with timestamp precision, whether date and time are separated by space or T, and zone offsets. How would we interpret 2020-12-1? Or 2020-12-01 -07:00?

I prefer strict conversion into times and then adjusting based on precision. If it was a date, then add 1 day, convert to timestamp, and subtract 1 microsecond.

By the way, what time zone are these supposed to be in? Does this need to respect the SQL session time zone?

@rymurr
Copy link
Contributor Author

rymurr commented Dec 2, 2020

Too crazy.

Hehehe thanks Ryan. To be fair though the current impl doesn't parse confusing timestamps well. I think the correct impl would be to parse both commit time and passed timestamps into ZonedDateTime then write as String in ISO format. Comparing those with prefix matching would then make sense. However I think that may be overkill at this time as the round up mechanism in this PR will satisfy the vast majority of use cases.

@rymurr
Copy link
Contributor Author

rymurr commented Dec 2, 2020

By the way, what time zone are these supposed to be in? Does this need to respect the SQL session time zone?

Currently it treats everything as UTC, to be honest I didn't know about spark.sql.session.timeZone. Want me to change it to respect "spark.sql.session.timeZone"? I am personally biased against doing anything with data outside of UTC but if other people are using that setting we should probably not surprise them.

@rymurr
Copy link
Contributor Author

rymurr commented Dec 7, 2020

@jacques-n & @rdblue is this good to go now? I think I have addressed everything but may have missed something.

@rdblue
Copy link
Contributor

rdblue commented Dec 7, 2020

Currently it treats everything as UTC, to be honest I didn't know about spark.sql.session.timeZone. Want me to change it to respect "spark.sql.session.timeZone"? I am personally biased against doing anything with data outside of UTC but if other people are using that setting we should probably not surprise them.

Since these can be embedded in SQL queries, I think it makes the most sense to match what Spark does.

That should be parsing timestamps without an offset or zone using the SQL session time zone. That may default using spark.sql.session.timeZone, but there is usually a way to set it in a session so if we can get a live value from Spark then we should try to do that.

@rymurr rymurr force-pushed the timestamp-support branch from eebf4f9 to 948728a Compare December 8, 2020 14:42
@rymurr
Copy link
Contributor Author

rymurr commented Dec 8, 2020

Cool, makes sense @rdblue . I have updated to respect the SQL timezone. It is a bit ugly as we have to employ reflection to get hold of spark conf from Nessie.

@rymurr rymurr force-pushed the timestamp-support branch from cb7fc67 to c30bd23 Compare January 11, 2021 13:00
@rymurr
Copy link
Contributor Author

rymurr commented Jan 11, 2021

I've just rebased and I think all comments have been addressed. Any chance we can close this one out?

@rdblue
Copy link
Contributor

rdblue commented Jan 11, 2021

I'll try to get time to review this in the next few days. I'm wary of adding code that interprets timestamps from user strings, though. I'm just not sure it is a good idea to do this before Spark SQL can support AS OF TIMESTAMP or AS OF VERSION clauses.

*/
public class NessieCatalog extends BaseMetastoreCatalog implements AutoCloseable, SupportsNamespaces, Configurable {
private static final Logger logger = LoggerFactory.getLogger(NessieCatalog.class);
private static final Logger LOGGER = LoggerFactory.getLogger(NessieCatalog.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard is to name it LOG

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. Perhaps we should add a checkstyle for this? Seems to be enforced sporadically in the codebase atm

if (ref instanceof Hash) {
throw new IllegalArgumentException(String.format("Cannot specify a hash %s and timestamp %s together. " +
"The timestamp is redundant and has been ignored", requestedRef, timestamp));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add space after if and for

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure I take your meaning. I have added a newline after the if and for, is that what you meant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thank you

return new TableReference(identifier, null, null);
}

private enum FormatOptions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is only dealing with datetime, probably better to call it DateTimeFormatOptions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

sparkAvailable = false; // spark not on classpath
}
}
if (sparkAvailable != null && sparkAvailable) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sparkAvailable != null is not redundant because it must be set by L206 or L208. Even if it is not set, if (sparkAvailable) should still work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. I have split into two booleans now. I don't like relying on the nullability of a Boolean


private enum FormatOptions {
DATE_TIME(DateTimeFormatter.ISO_DATE_TIME, ZonedDateTime::from),
LOCAL_DATE_TIME(DateTimeFormatter.ISO_LOCAL_DATE_TIME, t -> LocalDateTime.from(t).atZone(sparkTimezoneOrUTC())),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are relying heavily on Spark functionalities through reflection. I am not very familiar with the Nessie module, is this a common pattern? I feel we should make it more generic for other runtime environments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Jack, I am not such a huge fan either. I would prefer to use UTC only. Or in a pinch the system timezone. The Spark check is there to respect the case whena user has set the spark sql timezone. I haven't found the equivalent parameter in Flink but either way it gets a bit sticky to check via reflection for all engines potential paramters.

I think there are a few options:

  1. leave as is and only suppport Spark for the time being
  2. remove this and add a parameter/option to set this catalogs timezone
  3. only use UTC or system settings.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the current implementation with Spark specific logic and your refactored code, if that is a primary use case that you need to cover this is probably the best way to have it. But we should definitely take a note for this and think about how interactions and dependencies between different modules, especially with different engines, should be handled in the future.

@rymurr rymurr force-pushed the timestamp-support branch from 38fea9b to 27a3521 Compare January 14, 2021 09:55
@rymurr
Copy link
Contributor Author

rymurr commented Jan 14, 2021

Thanks for the review @jackye1995 I have updated based on your comments. The last question is what to do about timezones.

@rdblue I agree that handling timestamps on our own is icky however we will likely end up having to anyways. We would like to be able to support queries that compare tables across versions/times (eg a join between table1@time1 and table1@time2) and be able to do this consistently on different engines. I suppose in time we can delegate to the Spark timestamp parsing logic if it can be found via reflection but in the end we will still have to parse timestamps at this level.

@rymurr
Copy link
Contributor Author

rymurr commented Jan 14, 2021

Looks like the build failure is a flaky test. Will re-trigger with next round of code review

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response, missed it in emails.

if (ref instanceof Hash) {
throw new IllegalArgumentException(String.format("Cannot specify a hash %s and timestamp %s together. " +
"The timestamp is redundant and has been ignored", requestedRef, timestamp));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thank you


private enum FormatOptions {
DATE_TIME(DateTimeFormatter.ISO_DATE_TIME, ZonedDateTime::from),
LOCAL_DATE_TIME(DateTimeFormatter.ISO_LOCAL_DATE_TIME, t -> LocalDateTime.from(t).atZone(sparkTimezoneOrUTC())),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the current implementation with Spark specific logic and your refactored code, if that is a primary use case that you need to cover this is probably the best way to have it. But we should definitely take a note for this and think about how interactions and dependencies between different modules, especially with different engines, should be handled in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants