-
Notifications
You must be signed in to change notification settings - Fork 3k
Support timestamp in partition path #3933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| return Literal.of(asString).to(Types.DateType.get()).value(); | ||
| case TIMESTAMP: | ||
| if (!asString.contains("T")) { | ||
| return java.sql.Timestamp.valueOf(asString).getTime() * 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path is something like this ts=2021-01-01 00:00:00.999. It doesn't contain T or Z.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will time zones be confused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the timezone has already been taken care of. For example, if I have row
1, "John Doe", "hr", toTimestamp("2021-01-01T00:00:00.999999999Z"
The partition path is actually ts=2020-12-31 16:00:00.999.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering here if we can find do the same inversion that's in the hive code for parsing this? Or does hive really allow producing either path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that the test only creates one type of timestamp string so do we only need to cover that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be very strict with what we parse and accept here and delegating this work to java.sql.Timestamp seems like a good way to introduce formats that we don't intend to support. I would rather use what is built into StringLiteral and supply different formats. Can you look at using OffsetDateTime.parse and a format for this instead?
|
+1 for the solution. It should be pretty useful in the following scenario. The table only have a timestamp column, but want to partitioned by year/date. The Iceberg doc gives an example about it, https://iceberg.apache.org/#spark-ddl/#partitioned-by. |
|
@huaxingao to be clear here, this is a fix mostly for MigrationTableUtils right? I was looking for any other consumers of the function but I only found that and test code. |
|
@RussellSpitzer After I took a look at |
Yep makes sense to me! Yufei and I were just discussing another bug which we thought might be related (it was not). I think this is fine to fix. I just want to make sure our string parsing is correct, from what I could tell the string that is generated for "timestamp" should be environment dependent since in the Hive code it just uses "value.toString". If we are confident that this covers most use cases I'm ok with merging. |
|
iceberg/api/src/main/java/org/apache/iceberg/PartitionSpec.java Lines 173 to 186 in 12bf61d
|
| }; | ||
|
|
||
| private static Timestamp toTimestamp(String value) { | ||
| return new Timestamp(DateTime.parse(value).getMillis()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a reliable way to convert to a timestamp because Java Date and SQL Timestamp have internal time zone logic. Instead, please use Literal or parse values directly.
| return Literal.of(asString).to(Types.DateType.get()).value(); | ||
| case TIMESTAMP: | ||
| if (!asString.contains("T")) { | ||
| Instant instant = java.sql.Timestamp.valueOf(asString).toInstant(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to do this conversion without using java.sql.Timestamp?
| new StructField("ts", DataTypes.TimestampType, true, Metadata.empty()) | ||
| }; | ||
|
|
||
| private static Timestamp toTimestamp(String value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I see. I was wrong before. This is actually producing a Timestamp because that's what Spark's public row API uses. Instead, could you use SparkSQL and literal values to avoid doing this conversion manually? I think that will make for a much more reliable test.
|
@huaxingao Hi, I wonder if you help continue on this effort? I saw the same issue and I hope to see this get fixed in upstream. Thanks |
|
@puchengy Feel free to take over this PR if you have time to continue. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Partition path containing timestamp such as
ts=2021-01-01 00:00:00.999is not supported and iceberg throws Exception here. It seems to me that timestamp in partition path should be supported and this PR lifts the restriction.