-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark: Support metadata columns in 3.2 #3373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| import org.apache.spark.sql.connector.catalog.MetadataColumn; | ||
| import org.apache.spark.sql.types.DataType; | ||
|
|
||
| public class SparkMetadataColumn implements MetadataColumn { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, Spark does not offer a utility for creating metadata columns similarly to Expressions. That's why I had to implement it in Iceberg. We should probably move it to Spark.
| DataType sparkPartitionType = SparkSchemaUtil.convert(Partitioning.partitionType(table())); | ||
| return new MetadataColumn[] { | ||
| new SparkMetadataColumn(MetadataColumns.SPEC_ID.name(), DataTypes.IntegerType, false), | ||
| new SparkMetadataColumn(MetadataColumns.PARTITION_COLUMN_NAME, sparkPartitionType, true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the partition column is nullable (e.g. unpartitioned tables).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that we can project the partition. I've been meaning to add a way to project the individual partition fields, but this is probably way easier.
|
|
||
| @Override | ||
| public Table loadTable(Identifier ident) throws NoSuchTableException { | ||
| String[] parts = ident.name().split("\\$", 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the ugly workaround we had earlier.
pan3793
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (non-binding)
| if (table == null && namespace.equals(Namespace.of("default"))) { | ||
| table = TestTables.load(tableIdentifier.name()); | ||
| } | ||
|
|
||
| return new SparkTable(table, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If table is null but namespace isn't default, what will happen here?
I guess since this is for testing it's not as much of a concern, but should we throw NoSuchTableException anyways to help out test authors (or do I possibly have that completely confused)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the way we use TestSparkCatalog is is a little bit weird right now. I just made it work. I can throw an exception too.
|
If a table column and a metadata column have the same name, the metadata column will never be requested. Currently we have very simple names for Iceberg metadata columns. I wonder if we should make it more complex on engine side, such as |
|
That's a valid concern, @jackye1995. At the same time, it is kind of nice to be able to use the exact names as they are documented in the spec. Maybe, we can add an alias and support both? |
|
@jackye1995 @rdblue @RussellSpitzer, any thoughts on supporting both real and aliased metadata names? |
|
I don't have problems with reserved column names for the system even if they are simple. I think changing the underlying names is fine as well though. I'm less a fan of aliases, since I think it just makes things more confusing and behaviors end up being table dependent then right? |
wypoon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise.
One other question: In SupportsMetadataColumns, it says "If a table column and a metadata column have the same name, the metadata column will never be requested. It is recommended that Table implementations reject data column name that conflict with metadata column names." Do we have any logic that does that (reject data column names that conflict with metadata column names)? If not, we should, right?
| new SparkMetadataColumn(MetadataColumns.SPEC_ID.name(), DataTypes.IntegerType, false), | ||
| new SparkMetadataColumn(MetadataColumns.PARTITION_COLUMN_NAME, sparkPartitionType, true), | ||
| new SparkMetadataColumn(MetadataColumns.FILE_PATH.name(), DataTypes.StringType, false), | ||
| new SparkMetadataColumn(MetadataColumns.ROW_POSITION.name(), DataTypes.LongType, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not too familiar with the metadata columns, but I see 5 currently defined in Iceberg's MetadataColumns. Is there a reason to omit _deleted here? And it probably doesn't matter, but should we keep the columns in the order defined in MetadataColumns (ordered by id from -1 to -5)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can match the order in MetadataColumns for consistency.
I am not sure how useful _deleted metadata column will be in Spark now. I guess it will be always false?
@jackye1995 @chenjunjiedada @RussellSpitzer @rdblue, any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_deleted can be added later. The purpose of that field is to allow us to merge deletes in actions.
|
Looks great. Thanks, @aokolnychyi! |
Sorry I didn't see this ongoing discussion before I merged. If everyone is okay with it, let's continue to discuss and address it in a follow-up. At least that way we unblock Anton's other work. I would be fine adding a way to detect this case and change the column name. Or just waiting for someone to complain. I doubt many people are using |
This PR adds proper support for metadata columns in Spark 3.2.