Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

This PR adds proper support for metadata columns in Spark 3.2.

@github-actions github-actions bot added the spark label Oct 26, 2021
@aokolnychyi
Copy link
Contributor Author

import org.apache.spark.sql.connector.catalog.MetadataColumn;
import org.apache.spark.sql.types.DataType;

public class SparkMetadataColumn implements MetadataColumn {
Copy link
Contributor Author

@aokolnychyi aokolnychyi Oct 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, Spark does not offer a utility for creating metadata columns similarly to Expressions. That's why I had to implement it in Iceberg. We should probably move it to Spark.

DataType sparkPartitionType = SparkSchemaUtil.convert(Partitioning.partitionType(table()));
return new MetadataColumn[] {
new SparkMetadataColumn(MetadataColumns.SPEC_ID.name(), DataTypes.IntegerType, false),
new SparkMetadataColumn(MetadataColumns.PARTITION_COLUMN_NAME, sparkPartitionType, true),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the partition column is nullable (e.g. unpartitioned tables).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that we can project the partition. I've been meaning to add a way to project the individual partition fields, but this is probably way easier.


@Override
public Table loadTable(Identifier ident) throws NoSuchTableException {
String[] parts = ident.name().split("\\$", 2);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the ugly workaround we had earlier.

Copy link
Member

@pan3793 pan3793 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (non-binding)

Comment on lines +40 to +44
if (table == null && namespace.equals(Namespace.of("default"))) {
table = TestTables.load(tableIdentifier.name());
}

return new SparkTable(table, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If table is null but namespace isn't default, what will happen here?

I guess since this is for testing it's not as much of a concern, but should we throw NoSuchTableException anyways to help out test authors (or do I possibly have that completely confused)?

Copy link
Contributor Author

@aokolnychyi aokolnychyi Oct 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the way we use TestSparkCatalog is is a little bit weird right now. I just made it work. I can throw an exception too.

@jackye1995
Copy link
Contributor

If a table column and a metadata column have the same name, the metadata column will never be requested. Currently we have very simple names for Iceberg metadata columns. I wonder if we should make it more complex on engine side, such as _iceberg_partition instead of just _partition. Any thoughts?

@aokolnychyi
Copy link
Contributor Author

That's a valid concern, @jackye1995. At the same time, it is kind of nice to be able to use the exact names as they are documented in the spec. Maybe, we can add an alias and support both?

@aokolnychyi
Copy link
Contributor Author

@jackye1995 @rdblue @RussellSpitzer, any thoughts on supporting both real and aliased metadata names?

@RussellSpitzer
Copy link
Member

RussellSpitzer commented Oct 28, 2021

I don't have problems with reserved column names for the system even if they are simple. I think changing the underlying names is fine as well though. I'm less a fan of aliases, since I think it just makes things more confusing and behaviors end up being table dependent then right?

Copy link
Contributor

@wypoon wypoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise.

One other question: In SupportsMetadataColumns, it says "If a table column and a metadata column have the same name, the metadata column will never be requested. It is recommended that Table implementations reject data column name that conflict with metadata column names." Do we have any logic that does that (reject data column names that conflict with metadata column names)? If not, we should, right?

new SparkMetadataColumn(MetadataColumns.SPEC_ID.name(), DataTypes.IntegerType, false),
new SparkMetadataColumn(MetadataColumns.PARTITION_COLUMN_NAME, sparkPartitionType, true),
new SparkMetadataColumn(MetadataColumns.FILE_PATH.name(), DataTypes.StringType, false),
new SparkMetadataColumn(MetadataColumns.ROW_POSITION.name(), DataTypes.LongType, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not too familiar with the metadata columns, but I see 5 currently defined in Iceberg's MetadataColumns. Is there a reason to omit _deleted here? And it probably doesn't matter, but should we keep the columns in the order defined in MetadataColumns (ordered by id from -1 to -5)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can match the order in MetadataColumns for consistency.

I am not sure how useful _deleted metadata column will be in Spark now. I guess it will be always false?
@jackye1995 @chenjunjiedada @RussellSpitzer @rdblue, any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_deleted can be added later. The purpose of that field is to allow us to merge deletes in actions.

@rdblue rdblue merged commit aeccf01 into apache:master Oct 31, 2021
@rdblue
Copy link
Contributor

rdblue commented Oct 31, 2021

Looks great. Thanks, @aokolnychyi!

@rdblue
Copy link
Contributor

rdblue commented Oct 31, 2021

If a table column and a metadata column have the same name, the metadata column will never be requested. Currently we have very simple names for Iceberg metadata columns. I wonder if we should make it more complex on engine side, such as _iceberg_partition instead of just _partition. Any thoughts?

Sorry I didn't see this ongoing discussion before I merged. If everyone is okay with it, let's continue to discuss and address it in a follow-up. At least that way we unblock Anton's other work.

I would be fine adding a way to detect this case and change the column name. Or just waiting for someone to complain. I doubt many people are using _partition so it may be better just to wait until someone hits it in practice.

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 10, 2022
hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 11, 2022
aokolnychyi pushed a commit that referenced this pull request Aug 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants