Spark: Support metadata columns in 3.2 #3373

aokolnychyi · 2021-10-26T05:33:54Z

This PR adds proper support for metadata columns in Spark 3.2.

aokolnychyi · 2021-10-26T05:35:02Z

cc @wypoon @pan3793 @kbendick @flyrain @karuppayya @szehon-ho @rdblue @RussellSpitzer

aokolnychyi · 2021-10-26T05:36:39Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkMetadataColumn.java

+import org.apache.spark.sql.connector.catalog.MetadataColumn;
+import org.apache.spark.sql.types.DataType;
+
+public class SparkMetadataColumn implements MetadataColumn {


As far as I know, Spark does not offer a utility for creating metadata columns similarly to Expressions. That's why I had to implement it in Iceberg. We should probably move it to Spark.

aokolnychyi · 2021-10-26T05:37:41Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java

+    DataType sparkPartitionType = SparkSchemaUtil.convert(Partitioning.partitionType(table()));
+    return new MetadataColumn[] {
+        new SparkMetadataColumn(MetadataColumns.SPEC_ID.name(), DataTypes.IntegerType, false),
+        new SparkMetadataColumn(MetadataColumns.PARTITION_COLUMN_NAME, sparkPartitionType, true),


Only the partition column is nullable (e.g. unpartitioned tables).

I like that we can project the partition. I've been meaning to add a way to project the individual partition fields, but this is probably way easier.

aokolnychyi · 2021-10-26T05:38:01Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkCatalog.java


  @Override
  public Table loadTable(Identifier ident) throws NoSuchTableException {
-    String[] parts = ident.name().split("\\$", 2);


I removed the ugly workaround we had earlier.

pan3793

LGTM (non-binding)

kbendick · 2021-10-26T20:01:20Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkCatalog.java

+    if (table == null && namespace.equals(Namespace.of("default"))) {
+      table = TestTables.load(tableIdentifier.name());
    }
+
+    return new SparkTable(table, false);


If table is null but namespace isn't default, what will happen here?

I guess since this is for testing it's not as much of a concern, but should we throw NoSuchTableException anyways to help out test authors (or do I possibly have that completely confused)?

Yeah, the way we use TestSparkCatalog is is a little bit weird right now. I just made it work. I can throw an exception too.

jackye1995 · 2021-10-26T22:34:59Z

If a table column and a metadata column have the same name, the metadata column will never be requested. Currently we have very simple names for Iceberg metadata columns. I wonder if we should make it more complex on engine side, such as _iceberg_partition instead of just _partition. Any thoughts?

aokolnychyi · 2021-10-27T01:24:00Z

That's a valid concern, @jackye1995. At the same time, it is kind of nice to be able to use the exact names as they are documented in the spec. Maybe, we can add an alias and support both?

aokolnychyi · 2021-10-27T20:51:53Z

@jackye1995 @rdblue @RussellSpitzer, any thoughts on supporting both real and aliased metadata names?

RussellSpitzer · 2021-10-28T15:19:31Z

I don't have problems with reserved column names for the system even if they are simple. I think changing the underlying names is fine as well though. I'm less a fan of aliases, since I think it just makes things more confusing and behaviors end up being table dependent then right?

wypoon

LGTM otherwise.

One other question: In SupportsMetadataColumns, it says "If a table column and a metadata column have the same name, the metadata column will never be requested. It is recommended that Table implementations reject data column name that conflict with metadata column names." Do we have any logic that does that (reject data column names that conflict with metadata column names)? If not, we should, right?

wypoon · 2021-10-30T01:45:04Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java

+        new SparkMetadataColumn(MetadataColumns.SPEC_ID.name(), DataTypes.IntegerType, false),
+        new SparkMetadataColumn(MetadataColumns.PARTITION_COLUMN_NAME, sparkPartitionType, true),
+        new SparkMetadataColumn(MetadataColumns.FILE_PATH.name(), DataTypes.StringType, false),
+        new SparkMetadataColumn(MetadataColumns.ROW_POSITION.name(), DataTypes.LongType, false)


I am not too familiar with the metadata columns, but I see 5 currently defined in Iceberg's MetadataColumns. Is there a reason to omit _deleted here? And it probably doesn't matter, but should we keep the columns in the order defined in MetadataColumns (ordered by id from -1 to -5)?

I can match the order in MetadataColumns for consistency.

I am not sure how useful _deleted metadata column will be in Spark now. I guess it will be always false?
@jackye1995 @chenjunjiedada @RussellSpitzer @rdblue, any thoughts?

_deleted can be added later. The purpose of that field is to allow us to merge deletes in actions.

rdblue · 2021-10-31T23:03:08Z

Looks great. Thanks, @aokolnychyi!

rdblue · 2021-10-31T23:06:11Z

If a table column and a metadata column have the same name, the metadata column will never be requested. Currently we have very simple names for Iceberg metadata columns. I wonder if we should make it more complex on engine side, such as _iceberg_partition instead of just _partition. Any thoughts?

Sorry I didn't see this ongoing discussion before I merged. If everyone is okay with it, let's continue to discuss and address it in a follow-up. At least that way we unblock Anton's other work.

I would be fine adding a way to detect this case and change the column name. Or just waiting for someone to complain. I doubt many people are using _partition so it may be better just to wait until someone hits it in practice.

Spark: Support metadata columns in 3.2

57c3f28

github-actions bot added the spark label Oct 26, 2021

aokolnychyi commented Oct 26, 2021

View reviewed changes

pan3793 approved these changes Oct 26, 2021

View reviewed changes

kbendick reviewed Oct 26, 2021

View reviewed changes

wypoon reviewed Oct 30, 2021

View reviewed changes

rdblue approved these changes Oct 31, 2021

View reviewed changes

rdblue merged commit aeccf01 into apache:master Oct 31, 2021

aokolnychyi mentioned this pull request Nov 3, 2021

Spark: Validate table columns don't conflict with metadata columns #3456

Merged

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022

Spark 3.1:Port apache#3373 to Spark 3.1

348f9b5

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 9, 2022

Spark 3.1:Port apache#3373 to Spark 3.1

96e80da

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 10, 2022

Spark 3.1:Port apache#3373 to Spark 3.1

58f8355

hililiwei mentioned this pull request Aug 10, 2022

Spark 3.1: Port some PRs to Spark 3.1 #5479

Closed

hililiwei added a commit to hililiwei/iceberg that referenced this pull request Aug 11, 2022

Spark 3.1:Port apache#3373 to Spark 3.1

e4b63e1

hililiwei mentioned this pull request Aug 11, 2022

Spark 3.1:Port #3373 to Spark 3.1 #5500

Merged

aokolnychyi pushed a commit that referenced this pull request Aug 12, 2022

Spark 3.1: Port #3373 to Spark 3.1 (#5500)

fd4dd10

Spark: Support metadata columns in 3.2 #3373

Spark: Support metadata columns in 3.2 #3373

Uh oh!

Conversation

aokolnychyi commented Oct 26, 2021

Uh oh!

aokolnychyi commented Oct 26, 2021

Uh oh!

aokolnychyi Oct 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Oct 26, 2021

Choose a reason for hiding this comment

Uh oh!

rdblue Oct 31, 2021

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Oct 26, 2021

Choose a reason for hiding this comment

Uh oh!

pan3793 left a comment

Choose a reason for hiding this comment

Uh oh!

kbendick Oct 26, 2021

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Oct 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 commented Oct 26, 2021

Uh oh!

aokolnychyi commented Oct 27, 2021

Uh oh!

aokolnychyi commented Oct 27, 2021

Uh oh!

RussellSpitzer commented Oct 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wypoon left a comment

Choose a reason for hiding this comment

Uh oh!

wypoon Oct 30, 2021

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Oct 31, 2021

Choose a reason for hiding this comment

Uh oh!

rdblue Oct 31, 2021

Choose a reason for hiding this comment

Uh oh!

rdblue commented Oct 31, 2021

Uh oh!

rdblue commented Oct 31, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

aokolnychyi Oct 26, 2021 •

edited

Loading

aokolnychyi Oct 26, 2021 •

edited

Loading

RussellSpitzer commented Oct 28, 2021 •

edited

Loading