-
Notifications
You must be signed in to change notification settings - Fork 3k
DOCS: describe type compatibility between Spark and Iceberg #1611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOCS: describe type compatibility between Spark and Iceberg #1611
Conversation
site/docs/spark.md
Outdated
| | timestamp with timezone | timestamp | | | ||
| | timestamp without timezone | <N/A> | | | ||
| | string | string | | | ||
| | uuid | string | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I simply put uuid -> string based on the implementation of TypeToSparkType, I failed to verify this as it looks to be really tricky to write UUID column.
It doesn't look possible to write UUID column from Spark. Even adding UUID type to SUPPORTED_PRIMITIVES in DataTest leads multiple tests failing.
Is there any known way to write UUID column, or it'd be better to simply remove uuid here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should ignore UUID for now. No engines support it and we are considering whether we should remove it from the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Followed up at trinodb/trino#6663 and on the mailing list.
| | string | string | | | ||
| | char | string | | | ||
| | varchar | string | | | ||
| | binary | binary | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binary can be written to a fixed column. It will be validated at write time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, you've created two separate tables. I think I would probably combine them with notes about create vs write instead of having two. That would be more confusing because readers would need to choose which one they need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I had to split read and write because of UUID, but if we don't mind about UUID then it'll be closely symmetric. I'll try to consolidate twos into one.
I still think we'd better having two different matrixes for create table and read/write (consolidated, Iceberg types to (->) Spark types), because in many cases the type to pivot on create table in Spark is Spark type (as these types are what end users need to type in), while the type to pivot on read/write table in any engines is Iceberg type. I thought the type to pivot on write to table is engine's column type but I changed my mind as the types of columns are Iceberg types hence end users need to think based on these types.
It might be arguable that end users need to think about the final type of the column (Iceberg type) when creating table, which might end up with pivoting Iceberg type on create table. I don't have a strong opinion, as it's also a valid opinion, but there might be also someone who wants to see the Iceberg table as Spark's world of view (restrict the usage to Spark only) and don't want to concern about Iceberg types.
Either of the direction would be fine for me. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that having multiple tables causes users to need to look carefully for what is different between them, and increases what they need to pay attention to ("which table do I need?"). I'd like to have as few of these as possible. So I'd remove UUID and add notes for any differences between create and write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks for the opinion. I consolidated tables on create and write. Actually I tried to consolidate all of three tables, but it seemed a bit confusing as directions on conversion are opposite. Please take a look again. Thanks!
site/docs/spark.md
Outdated
| | date | date | | | ||
| | time | N/A | | | ||
| | timestamp with timezone | timestamp | | | ||
| | timestamp without timezone | N/A | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we convert timestamp without zone to timestamp in Spark when reading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I wasn't too concerned about timestamp (I tried my best to consider the type as epoch based on UTC) so no strong opinion here. If we'd like to also support this (probably adjusting to the TZ in Spark?) it might be also good to do vice versa, so that read and write are symmetric.
| | date | date | | | ||
| | time | | Not supported | | ||
| | timestamp with timezone | timestamp | | | ||
| | timestamp without timezone | | Not supported | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be merged with the table above by adding lines for time, timestamp without time zone, and fixed. What do you think about that? It would be a lot less documentation and the "notes" column above would be used. Right now, it's blank.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I feel more confusing if I have to look into table via -> direction in some cases and <- direction in some other cases, with considering notes to determine asymmetric behavior.
e.g. If we consider conversion from Spark to Iceberg, multiple Spark types are converted with one Iceberg type, but for sure it's not true for opposite direction. That said, we need to leave special mark representing the opposite case, which Spark type is the result of conversion for one Iceberg type.
The more we consolidate, the less intuitive the table would be. The read case doesn't require any explanation on the table, and once we consolidate this into create/write, explanations for create/write would make confusions even on read case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue Kindly reminder to see your thought about my comment. If you are still not persuaded, I can probably raise another PR to go with your suggestion, and we could probably compare twos and pick one. WDYT?
|
Thanks, @HeartSaVioR! I'm okay with adding two tables if you think it's going to be helpful. I merged this and it should go out when we push docs for 0.10.0. Thank you! |
|
Thanks for understanding, and thanks again for reviewing and merging! |
…entations (apache#13106) (apache#1611) Co-authored-by: Cheng Pan <[email protected]>
I've found the type compatibility being asymmetric among the cases, and it required me to look into the code to find the conversions (TypeToSparkType, SparkTypeToType).
Given the conversions won't change frequently, it'd be nice to document the conversions so that end users don't feel surprised. This would be also helpful if end users consider using same Iceberg table across components - end users will have a clear view of the types when they create an Iceberg table from Spark, when they read an Iceberg table from Spark, when they write to the Iceberg table via Spark.
Probably it might be ideal to have such type compatibility matrixes for all supported components, so that end users can look into these matrixes and check type issues on interop.