Skip to content

Conversation

@sshkvar
Copy link
Owner

@sshkvar sshkvar commented Jun 11, 2021

Add spark.sql.iceberg.store-timestamp-without-zone spark config to indicate which iceberg type (Types.TimestampType.withZone() or Types.TimestampType.withoutZone()) will be used for spark TimestampType type

@github-actions github-actions bot added the SPARK label Jun 11, 2021
wangjunyou and others added 23 commits June 12, 2021 16:21
This also adds AssertJ to testCompile in all modules so assertions can be used elsewhere.
* Spec: Add identifier-field-ids to schema.
* Spec: Add section for partition evolution.
* Spec: Add schemas list and current-schema-id to table metadata.
* Spec: Add key_metadata to manifest list.
* Spec: Add schema-id to Snapshot metadata.
`.withFailMessage(..)` was mistakenly used and was therefore overriding
the actual error reporting, making debugging difficult.
…#2689)

* support custom target name in partition spec builder

* address the comments.
Param scanAllFiles Used to check whether all the data files should be processed, or only added files.Here we should replace  scanAllFiles to  !scanAllFiles.
public static final String HANDLE_TIMESTAMP_WITHOUT_TIMEZONE_SESSION_PROPERTY =
"spark.sql.iceberg.handle-timestamp-without-timezone";

public static final String STORE_TIMESTAMP_WITHOUT_TIMEZONE_SESSION_PROPERTY =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to just use one property for both reading and writing

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there was incorrect name of this property, changed to READ_TIMESTAMP_AS_TIMESTAMP_WITHOUT_TIMEZONE.

This property is responsible for handling how we will represent spark TimestampType type in iceberg. By default spark TimestampType type will be converted to Types.TimestampType.withZone() iceberg type, but if we set READ_TIMESTAMP_AS_TIMESTAMP_WITHOUT_TIMEZONE to true, spark timestamp type will be converted to Types.TimestampType.withoutZone()

*/
public static boolean hasTimestampWithoutZone(Schema schema) {
return TypeUtil.find(schema, t ->
t.typeId().equals(Type.TypeID.TIMESTAMP) && !((Types.TimestampType) t).shouldAdjustToUTC()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could just check against TimestampType.withoutZone() which returns the singleton instance

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, changed the code

class SparkTypeToType extends SparkTypeVisitor<Type> {
private final StructType root;
private int nextId = 0;
private final boolean useTimestampWithoutZone;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this flag is for, I think it's probably safer to always return timestamptype.withZone from here, and handle the mismatch outside of this code

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this logic to SparkFixupTimestampType.java


public class SparkUtil {

public static final String HANDLE_TIMESTAMP_WITHOUT_TIMEZONE_FLAG = "spark-handle-timestamp-without-timezone";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acutally we could probably just use the same session property here, so we just have one property

spark.sql.iceberg.convert-timestamp-without-timezone

Or something like that

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return SparkOrcValueWriters.decimal(primitive.getPrecision(), primitive.getScale());
case TIMESTAMP_INSTANT:
case TIMESTAMP:
return SparkOrcValueWriters.timestampTz();
Copy link

@RussellSpitzer RussellSpitzer Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need corresponding code in the ParquetWriter as well?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think no


private StructType lazyType() {
if (type == null) {
Preconditions.checkArgument(readTimestampWithoutZone || !SparkUtil.hasTimestampWithoutZone(lazySchema()),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually cancel what I said about those other spots, this seems like a great place to check for whether we are allowed to do the TZ conversion

Copy link
Owner Author

@sshkvar sshkvar Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so we don't need any changes here?

@sshkvar
Copy link
Owner Author

sshkvar commented Jul 14, 2021

@sshkvar It seems like this PR is out of sync with master so I can't merge it, can you rebase it?

I thinks I need to close this PR, main PR is apache#2757

@sshkvar sshkvar closed this Jul 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.