DOCS: describe type compatibility between Spark and Iceberg #1611

HeartSaVioR · 2020-10-14T06:27:44Z

I've found the type compatibility being asymmetric among the cases, and it required me to look into the code to find the conversions (TypeToSparkType, SparkTypeToType).

Given the conversions won't change frequently, it'd be nice to document the conversions so that end users don't feel surprised. This would be also helpful if end users consider using same Iceberg table across components - end users will have a clear view of the types when they create an Iceberg table from Spark, when they read an Iceberg table from Spark, when they write to the Iceberg table via Spark.

Probably it might be ideal to have such type compatibility matrixes for all supported components, so that end users can look into these matrixes and check type issues on interop.

HeartSaVioR · 2020-10-14T06:33:12Z

site/docs/spark.md

+| timestamp with timezone    | timestamp               |       |
+| timestamp without timezone | <N/A>                   |       |
+| string                     | string                  |       |
+| uuid                       | string                  |       |


While I simply put uuid -> string based on the implementation of TypeToSparkType, I failed to verify this as it looks to be really tricky to write UUID column.

It doesn't look possible to write UUID column from Spark. Even adding UUID type to SUPPORTED_PRIMITIVES in DataTest leads multiple tests failing.

Is there any known way to write UUID column, or it'd be better to simply remove uuid here?

We should ignore UUID for now. No engines support it and we are considering whether we should remove it from the spec.

Followed up at trinodb/trino#6663 and on the mailing list.

rdblue · 2020-10-14T15:58:37Z

site/docs/spark.md

+| string          | string                  |       |
+| char            | string                  |       |
+| varchar         | string                  |       |
+| binary          | binary                  |       |


Binary can be written to a fixed column. It will be validated at write time.

I see, you've created two separate tables. I think I would probably combine them with notes about create vs write instead of having two. That would be more confusing because readers would need to choose which one they need.

Yeah I had to split read and write because of UUID, but if we don't mind about UUID then it'll be closely symmetric. I'll try to consolidate twos into one.

I still think we'd better having two different matrixes for create table and read/write (consolidated, Iceberg types to (->) Spark types), because in many cases the type to pivot on create table in Spark is Spark type (as these types are what end users need to type in), while the type to pivot on read/write table in any engines is Iceberg type. I thought the type to pivot on write to table is engine's column type but I changed my mind as the types of columns are Iceberg types hence end users need to think based on these types.

It might be arguable that end users need to think about the final type of the column (Iceberg type) when creating table, which might end up with pivoting Iceberg type on create table. I don't have a strong opinion, as it's also a valid opinion, but there might be also someone who wants to see the Iceberg table as Spark's world of view (restrict the usage to Spark only) and don't want to concern about Iceberg types.

Either of the direction would be fine for me. WDYT?

I think that having multiple tables causes users to need to look carefully for what is different between them, and increases what they need to pay attention to ("which table do I need?"). I'd like to have as few of these as possible. So I'd remove UUID and add notes for any differences between create and write.

OK thanks for the opinion. I consolidated tables on create and write. Actually I tried to consolidate all of three tables, but it seemed a bit confusing as directions on conversion are opposite. Please take a look again. Thanks!

rdblue · 2020-10-14T15:58:56Z

site/docs/spark.md

+| date                       | date                    |       |
+| time                       | N/A                     |       |
+| timestamp with timezone    | timestamp               |       |
+| timestamp without timezone | N/A                     |       |


Should we convert timestamp without zone to timestamp in Spark when reading?

Honestly I wasn't too concerned about timestamp (I tried my best to consider the type as epoch based on UTC) so no strong opinion here. If we'd like to also support this (probably adjusting to the TZ in Spark?) it might be also good to do vice versa, so that read and write are symmetric.

rdblue · 2020-10-23T21:53:04Z

site/docs/spark.md

+| date                       | date                    |               |
+| time                       |                         | Not supported |
+| timestamp with timezone    | timestamp               |               |
+| timestamp without timezone |                         | Not supported |


I think this could be merged with the table above by adding lines for time, timestamp without time zone, and fixed. What do you think about that? It would be a lot less documentation and the "notes" column above would be used. Right now, it's blank.

Personally I feel more confusing if I have to look into table via -> direction in some cases and <- direction in some other cases, with considering notes to determine asymmetric behavior.

e.g. If we consider conversion from Spark to Iceberg, multiple Spark types are converted with one Iceberg type, but for sure it's not true for opposite direction. That said, we need to leave special mark representing the opposite case, which Spark type is the result of conversion for one Iceberg type.

The more we consolidate, the less intuitive the table would be. The read case doesn't require any explanation on the table, and once we consolidate this into create/write, explanations for create/write would make confusions even on read case.

@rdblue Kindly reminder to see your thought about my comment. If you are still not persuaded, I can probably raise another PR to go with your suggestion, and we could probably compare twos and pick one. WDYT?

rdblue · 2020-11-03T20:20:19Z

Thanks, @HeartSaVioR! I'm okay with adding two tables if you think it's going to be helpful. I merged this and it should go out when we push docs for 0.10.0. Thank you!

HeartSaVioR · 2020-11-03T20:46:55Z

Thanks for understanding, and thanks again for reviewing and merging!

…entations (apache#13106) (apache#1611) Co-authored-by: Cheng Pan <[email protected]>

DOCS: describe type compatibility between Spark and Iceberg

d9714fb

HeartSaVioR commented Oct 14, 2020

View reviewed changes

refine

e276e08

rdblue reviewed Oct 14, 2020

View reviewed changes

Applied review comment

bdc7cfa

rdblue reviewed Oct 23, 2020

View reviewed changes

rdblue merged commit eee4f87 into apache:master Nov 3, 2020

HeartSaVioR deleted the DOC-spark-iceberg-type-compatibility branch November 3, 2020 20:47

anuragmantri added a commit to anuragmantri/iceberg that referenced this pull request Jul 25, 2025

Spark 4.0: Migrate Iceberg Stored Procedures to Spark built-in implem…

aef51a9

…entations (apache#13106) (apache#1611) Co-authored-by: Cheng Pan <[email protected]>

DOCS: describe type compatibility between Spark and Iceberg #1611

DOCS: describe type compatibility between Spark and Iceberg #1611

Uh oh!

Conversation

HeartSaVioR commented Oct 14, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Oct 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Oct 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Nov 3, 2020

Uh oh!

HeartSaVioR commented Nov 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HeartSaVioR Oct 15, 2020 •

edited

Loading

HeartSaVioR Oct 15, 2020 •

edited

Loading