Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

I've found the type compatibility being asymmetric among the cases, and it required me to look into the code to find the conversions (TypeToSparkType, SparkTypeToType).

Given the conversions won't change frequently, it'd be nice to document the conversions so that end users don't feel surprised. This would be also helpful if end users consider using same Iceberg table across components - end users will have a clear view of the types when they create an Iceberg table from Spark, when they read an Iceberg table from Spark, when they write to the Iceberg table via Spark.

Probably it might be ideal to have such type compatibility matrixes for all supported components, so that end users can look into these matrixes and check type issues on interop.

| timestamp with timezone | timestamp | |
| timestamp without timezone | <N/A> | |
| string | string | |
| uuid | string | |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I simply put uuid -> string based on the implementation of TypeToSparkType, I failed to verify this as it looks to be really tricky to write UUID column.

It doesn't look possible to write UUID column from Spark. Even adding UUID type to SUPPORTED_PRIMITIVES in DataTest leads multiple tests failing.

Is there any known way to write UUID column, or it'd be better to simply remove uuid here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should ignore UUID for now. No engines support it and we are considering whether we should remove it from the spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followed up at trinodb/trino#6663 and on the mailing list.

| string | string | |
| char | string | |
| varchar | string | |
| binary | binary | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary can be written to a fixed column. It will be validated at write time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you've created two separate tables. I think I would probably combine them with notes about create vs write instead of having two. That would be more confusing because readers would need to choose which one they need.

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Oct 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I had to split read and write because of UUID, but if we don't mind about UUID then it'll be closely symmetric. I'll try to consolidate twos into one.

I still think we'd better having two different matrixes for create table and read/write (consolidated, Iceberg types to (->) Spark types), because in many cases the type to pivot on create table in Spark is Spark type (as these types are what end users need to type in), while the type to pivot on read/write table in any engines is Iceberg type. I thought the type to pivot on write to table is engine's column type but I changed my mind as the types of columns are Iceberg types hence end users need to think based on these types.

It might be arguable that end users need to think about the final type of the column (Iceberg type) when creating table, which might end up with pivoting Iceberg type on create table. I don't have a strong opinion, as it's also a valid opinion, but there might be also someone who wants to see the Iceberg table as Spark's world of view (restrict the usage to Spark only) and don't want to concern about Iceberg types.

Either of the direction would be fine for me. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that having multiple tables causes users to need to look carefully for what is different between them, and increases what they need to pay attention to ("which table do I need?"). I'd like to have as few of these as possible. So I'd remove UUID and add notes for any differences between create and write.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks for the opinion. I consolidated tables on create and write. Actually I tried to consolidate all of three tables, but it seemed a bit confusing as directions on conversion are opposite. Please take a look again. Thanks!

| date | date | |
| time | N/A | |
| timestamp with timezone | timestamp | |
| timestamp without timezone | N/A | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we convert timestamp without zone to timestamp in Spark when reading?

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Oct 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I wasn't too concerned about timestamp (I tried my best to consider the type as epoch based on UTC) so no strong opinion here. If we'd like to also support this (probably adjusting to the TZ in Spark?) it might be also good to do vice versa, so that read and write are symmetric.

| date | date | |
| time | | Not supported |
| timestamp with timezone | timestamp | |
| timestamp without timezone | | Not supported |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be merged with the table above by adding lines for time, timestamp without time zone, and fixed. What do you think about that? It would be a lot less documentation and the "notes" column above would be used. Right now, it's blank.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I feel more confusing if I have to look into table via -> direction in some cases and <- direction in some other cases, with considering notes to determine asymmetric behavior.

e.g. If we consider conversion from Spark to Iceberg, multiple Spark types are converted with one Iceberg type, but for sure it's not true for opposite direction. That said, we need to leave special mark representing the opposite case, which Spark type is the result of conversion for one Iceberg type.

The more we consolidate, the less intuitive the table would be. The read case doesn't require any explanation on the table, and once we consolidate this into create/write, explanations for create/write would make confusions even on read case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue Kindly reminder to see your thought about my comment. If you are still not persuaded, I can probably raise another PR to go with your suggestion, and we could probably compare twos and pick one. WDYT?

@rdblue rdblue merged commit eee4f87 into apache:master Nov 3, 2020
@rdblue
Copy link
Contributor

rdblue commented Nov 3, 2020

Thanks, @HeartSaVioR! I'm okay with adding two tables if you think it's going to be helpful. I merged this and it should go out when we push docs for 0.10.0. Thank you!

@HeartSaVioR
Copy link
Contributor Author

Thanks for understanding, and thanks again for reviewing and merging!

@HeartSaVioR HeartSaVioR deleted the DOC-spark-iceberg-type-compatibility branch November 3, 2020 20:47
anuragmantri added a commit to anuragmantri/iceberg that referenced this pull request Jul 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants