Skip to content

Conversation

@huaxingao
Copy link
Contributor

Partition path containing timestamp such as ts=2021-01-01 00:00:00.999 is not supported and iceberg throws Exception here. It seems to me that timestamp in partition path should be supported and this PR lifts the restriction.

return Literal.of(asString).to(Types.DateType.get()).value();
case TIMESTAMP:
if (!asString.contains("T")) {
return java.sql.Timestamp.valueOf(asString).getTime() * 1000;
Copy link
Contributor Author

@huaxingao huaxingao Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path is something like this ts=2021-01-01 00:00:00.999. It doesn't contain T or Z.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will time zones be confused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the timezone has already been taken care of. For example, if I have row

1, "John Doe", "hr", toTimestamp("2021-01-01T00:00:00.999999999Z"

The partition path is actually ts=2020-12-31 16:00:00.999.

Copy link
Member

@RussellSpitzer RussellSpitzer Jan 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering here if we can find do the same inversion that's in the hive code for parsing this? Or does hive really allow producing either path?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that the test only creates one type of timestamp string so do we only need to cover that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be very strict with what we parse and accept here and delegating this work to java.sql.Timestamp seems like a good way to introduce formats that we don't intend to support. I would rather use what is built into StringLiteral and supply different formats. Can you look at using OffsetDateTime.parse and a format for this instead?

@flyrain
Copy link
Contributor

flyrain commented Jan 20, 2022

+1 for the solution. It should be pretty useful in the following scenario. The table only have a timestamp column, but want to partitioned by year/date.

CREATE TABLE %s (id Integer, name String, dept String, ts Timestamp) USING iceberg PARTITIONED BY (years(ts))
CREATE TABLE %s (id Integer, name String, dept String, ts Timestamp) USING iceberg PARTITIONED BY (date(ts))

The Iceberg doc gives an example about it, https://iceberg.apache.org/#spark-ddl/#partitioned-by.

CREATE TABLE prod.db.sample (
    id bigint,
    data string,
    category string,
    ts timestamp)
USING iceberg
PARTITIONED BY (bucket(16, id), days(ts), category)

@RussellSpitzer
Copy link
Member

@huaxingao to be clear here, this is a fix mostly for MigrationTableUtils right? I was looking for any other consumers of the function but I only found that and test code.

@huaxingao
Copy link
Contributor Author

@RussellSpitzer
Sorry for the late reply. Yes, the fix is mostly for MigrationTableUtils. It is for this path DataFiles.withPartitionPath -> DataFiles.fillFromPath -> Conversions.fromPartitionString.

After I took a look at PartitionSpec, I feel that timestamp is intentionally blocked in Conversions.fromPartitionString, because in PartitionSpec Builder, it has year, month, day, time, but no timestamp. I guess timestamp as a partition column is not supported because it doesn't have any real usages. However, creating table using timestamp as a partition column
CREATE TABLE test (id Integer, name String, dept String, ts Timestamp) USING iceberg PARTITIONED BY (ts)
can create the table successfully, but users don't know that the partition column ts is not working, so I feel that we probably should support partitioned by timestamp. If not, we probably should block creating table partitioned by timestamp.

@RussellSpitzer
Copy link
Member

@RussellSpitzer Sorry for the late reply. Yes, the fix is mostly for MigrationTableUtils. It is for this path DataFiles.withPartitionPath -> DataFiles.fillFromPath -> Conversions.fromPartitionString.

After I took a look at PartitionSpec, I feel that timestamp is intentionally blocked in Conversions.fromPartitionString, because in PartitionSpec Builder, it has year, month, day, time, but no timestamp. I guess timestamp as a partition column is not supported because it doesn't have any real usages. However, creating table using timestamp as a partition column CREATE TABLE test (id Integer, name String, dept String, ts Timestamp) USING iceberg PARTITIONED BY (ts) can create the table successfully, but users don't know that the partition column ts is not working, so I feel that we probably should support partitioned by timestamp. If not, we probably should block creating table partitioned by timestamp.

Yep makes sense to me! Yufei and I were just discussing another bug which we thought might be related (it was not). I think this is fine to fix. I just want to make sure our string parsing is correct, from what I could tell the string that is generated for "timestamp" should be environment dependent since in the Hive code it just uses "value.toString". If we are confident that this covers most use cases I'm ok with merging.

@RussellSpitzer
Copy link
Member

public String partitionToPath(StructLike data) {
StringBuilder sb = new StringBuilder();
Class<?>[] javaClasses = javaClasses();
for (int i = 0; i < javaClasses.length; i += 1) {
PartitionField field = fields[i];
String valueString = field.transform().toHumanString(get(data, i, javaClasses[i]));
if (i > 0) {
sb.append("/");
}
sb.append(field.name()).append("=").append(escape(valueString));
}
return sb.toString();
}
// Iceberg Code (which escapes string value)

https://github.com/apache/hive/blob/master/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java#L[…]6 // Hive Code

};

private static Timestamp toTimestamp(String value) {
return new Timestamp(DateTime.parse(value).getMillis());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a reliable way to convert to a timestamp because Java Date and SQL Timestamp have internal time zone logic. Instead, please use Literal or parse values directly.

return Literal.of(asString).to(Types.DateType.get()).value();
case TIMESTAMP:
if (!asString.contains("T")) {
Instant instant = java.sql.Timestamp.valueOf(asString).toInstant();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to do this conversion without using java.sql.Timestamp?

new StructField("ts", DataTypes.TimestampType, true, Metadata.empty())
};

private static Timestamp toTimestamp(String value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see. I was wrong before. This is actually producing a Timestamp because that's what Spark's public row API uses. Instead, could you use SparkSQL and literal values to avoid doing this conversion manually? I think that will make for a much more reliable test.

@puchengy
Copy link
Contributor

puchengy commented Apr 6, 2023

@huaxingao Hi, I wonder if you help continue on this effort? I saw the same issue and I hope to see this get fixed in upstream. Thanks

@huaxingao
Copy link
Contributor Author

@puchengy Feel free to take over this PR if you have time to continue.

@github-actions
Copy link

github-actions bot commented Aug 7, 2024

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 7, 2024
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants