Support timestamp in partition path #3933

huaxingao · 2022-01-19T23:45:30Z

Partition path containing timestamp such as ts=2021-01-01 00:00:00.999 is not supported and iceberg throws Exception here. It seems to me that timestamp in partition path should be supported and this PR lifts the restriction.

huaxingao · 2022-01-19T23:53:43Z

api/src/main/java/org/apache/iceberg/types/Conversions.java

        return Literal.of(asString).to(Types.DateType.get()).value();
+      case TIMESTAMP:
+        if (!asString.contains("T")) {
+          return java.sql.Timestamp.valueOf(asString).getTime() * 1000;


The path is something like this ts=2021-01-01 00:00:00.999. It doesn't contain T or Z.

Will time zones be confused?

My understanding is that the timezone has already been taken care of. For example, if I have row

1, "John Doe", "hr", toTimestamp("2021-01-01T00:00:00.999999999Z"

The partition path is actually ts=2020-12-31 16:00:00.999.

I was wondering here if we can find do the same inversion that's in the hive code for parsing this? Or does hive really allow producing either path?

I notice that the test only creates one type of timestamp string so do we only need to cover that?

We need to be very strict with what we parse and accept here and delegating this work to java.sql.Timestamp seems like a good way to introduce formats that we don't intend to support. I would rather use what is built into StringLiteral and supply different formats. Can you look at using OffsetDateTime.parse and a format for this instead?

flyrain · 2022-01-20T21:38:35Z

+1 for the solution. It should be pretty useful in the following scenario. The table only have a timestamp column, but want to partitioned by year/date.

CREATE TABLE %s (id Integer, name String, dept String, ts Timestamp) USING iceberg PARTITIONED BY (years(ts))
CREATE TABLE %s (id Integer, name String, dept String, ts Timestamp) USING iceberg PARTITIONED BY (date(ts))

The Iceberg doc gives an example about it, https://iceberg.apache.org/#spark-ddl/#partitioned-by.

CREATE TABLE prod.db.sample (
    id bigint,
    data string,
    category string,
    ts timestamp)
USING iceberg
PARTITIONED BY (bucket(16, id), days(ts), category)

RussellSpitzer · 2022-01-21T05:24:57Z

@huaxingao to be clear here, this is a fix mostly for MigrationTableUtils right? I was looking for any other consumers of the function but I only found that and test code.

huaxingao · 2022-01-22T04:11:32Z

@RussellSpitzer
Sorry for the late reply. Yes, the fix is mostly for MigrationTableUtils. It is for this path DataFiles.withPartitionPath -> DataFiles.fillFromPath -> Conversions.fromPartitionString.

After I took a look at PartitionSpec, I feel that timestamp is intentionally blocked in Conversions.fromPartitionString, because in PartitionSpec Builder, it has year, month, day, time, but no timestamp. I guess timestamp as a partition column is not supported because it doesn't have any real usages. However, creating table using timestamp as a partition column
CREATE TABLE test (id Integer, name String, dept String, ts Timestamp) USING iceberg PARTITIONED BY (ts)
can create the table successfully, but users don't know that the partition column ts is not working, so I feel that we probably should support partitioned by timestamp. If not, we probably should block creating table partitioned by timestamp.

RussellSpitzer · 2022-01-24T13:52:45Z

@RussellSpitzer Sorry for the late reply. Yes, the fix is mostly for MigrationTableUtils. It is for this path DataFiles.withPartitionPath -> DataFiles.fillFromPath -> Conversions.fromPartitionString.

After I took a look at PartitionSpec, I feel that timestamp is intentionally blocked in Conversions.fromPartitionString, because in PartitionSpec Builder, it has year, month, day, time, but no timestamp. I guess timestamp as a partition column is not supported because it doesn't have any real usages. However, creating table using timestamp as a partition column CREATE TABLE test (id Integer, name String, dept String, ts Timestamp) USING iceberg PARTITIONED BY (ts) can create the table successfully, but users don't know that the partition column ts is not working, so I feel that we probably should support partitioned by timestamp. If not, we probably should block creating table partitioned by timestamp.

Yep makes sense to me! Yufei and I were just discussing another bug which we thought might be related (it was not). I think this is fine to fix. I just want to make sure our string parsing is correct, from what I could tell the string that is generated for "timestamp" should be environment dependent since in the Hive code it just uses "value.toString". If we are confident that this covers most use cases I'm ok with merging.

RussellSpitzer · 2022-01-24T19:29:52Z

iceberg/api/src/main/java/org/apache/iceberg/PartitionSpec.java

Lines 173 to 186 in 12bf61d

    
           public String partitionToPath(StructLike data) { 
        
             StringBuilder sb = new StringBuilder(); 
        
             Class<?>[] javaClasses = javaClasses(); 
        
             for (int i = 0; i < javaClasses.length; i += 1) { 
        
               PartitionField field = fields[i]; 
        
               String valueString = field.transform().toHumanString(get(data, i, javaClasses[i])); 
        
               if (i > 0) { 
        
                 sb.append("/"); 
        
               } 
        
               sb.append(field.name()).append("=").append(escape(valueString)); 
        
             } 
        
             return sb.toString(); 
        
           }

// Iceberg Code (which escapes string value)

https://github.com/apache/hive/blob/master/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java#L[…]6 // Hive Code

rdblue · 2022-01-24T19:38:53Z

...park-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java

+  };
+
+  private static Timestamp toTimestamp(String value) {
+    return new Timestamp(DateTime.parse(value).getMillis());


This is not a reliable way to convert to a timestamp because Java Date and SQL Timestamp have internal time zone logic. Instead, please use Literal or parse values directly.

rdblue · 2022-01-26T00:32:19Z

api/src/main/java/org/apache/iceberg/types/Conversions.java

        return Literal.of(asString).to(Types.DateType.get()).value();
+      case TIMESTAMP:
+        if (!asString.contains("T")) {
+          Instant instant = java.sql.Timestamp.valueOf(asString).toInstant();


Is it possible to do this conversion without using java.sql.Timestamp?

rdblue · 2022-01-26T00:33:50Z

...park-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java

+      new StructField("ts", DataTypes.TimestampType, true, Metadata.empty())
+  };
+
+  private static Timestamp toTimestamp(String value) {


Okay, I see. I was wrong before. This is actually producing a Timestamp because that's what Spark's public row API uses. Instead, could you use SparkSQL and literal values to avoid doing this conversion manually? I think that will make for a much more reliable test.

puchengy · 2023-04-06T20:54:08Z

@huaxingao Hi, I wonder if you help continue on this effort? I saw the same issue and I hope to see this get fixed in upstream. Thanks

huaxingao · 2023-04-07T04:09:53Z

@puchengy Feel free to take over this PR if you have time to continue.

github-actions · 2024-08-07T00:13:24Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-08-14T00:13:26Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Support timestamp in partition path

b678b7e

github-actions bot added API spark labels Jan 19, 2022

huaxingao mentioned this pull request Jan 19, 2022

support timestamp in partition path #3932

Closed

huaxingao commented Jan 19, 2022

View reviewed changes

flyrain approved these changes Jan 21, 2022

View reviewed changes

rdblue reviewed Jan 24, 2022

View reviewed changes

huaxingao added 3 commits January 25, 2022 09:15

use literal to convert timestamp

6888b32

fix silly mistake

a534f4e

remove extra blank line

bbd7df3

rdblue reviewed Jan 26, 2022

View reviewed changes

ljfgem mentioned this pull request Feb 17, 2022

Support timestamp as partition type linkedin/iceberg#91

Merged

JonasJ-ap mentioned this pull request Apr 6, 2023

Support timestamp type in partition string when importing files #7291

Closed

github-actions bot added the stale label Aug 7, 2024

github-actions bot closed this Aug 14, 2024

Support timestamp in partition path #3933

Support timestamp in partition path #3933

Uh oh!

Conversation

huaxingao commented Jan 19, 2022

Uh oh!

huaxingao Jan 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hililiwei Jan 20, 2022

Choose a reason for hiding this comment

Uh oh!

huaxingao Jan 20, 2022

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Jan 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Jan 21, 2022

Choose a reason for hiding this comment

Uh oh!

rdblue Jan 24, 2022

Choose a reason for hiding this comment

Uh oh!

flyrain commented Jan 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RussellSpitzer commented Jan 21, 2022

Uh oh!

huaxingao commented Jan 22, 2022

Uh oh!

RussellSpitzer commented Jan 24, 2022

Uh oh!

RussellSpitzer commented Jan 24, 2022

Uh oh!

rdblue Jan 24, 2022

Choose a reason for hiding this comment

Uh oh!

rdblue Jan 26, 2022

Choose a reason for hiding this comment

Uh oh!

rdblue Jan 26, 2022

Choose a reason for hiding this comment

Uh oh!

puchengy commented Apr 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huaxingao commented Apr 7, 2023

Uh oh!

github-actions bot commented Aug 7, 2024

Uh oh!

github-actions bot commented Aug 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

huaxingao Jan 19, 2022 •

edited

Loading

RussellSpitzer Jan 21, 2022 •

edited

Loading

flyrain commented Jan 20, 2022 •

edited

Loading

puchengy commented Apr 6, 2023 •

edited

Loading