Hive: Expose default partition spec and sort order in HMS. #4546

flyrain · 2022-04-12T18:30:46Z

cc @aokolnychyi @RussellSpitzer @szehon-ho @karuppayya @anuragmantri @kbendick @singhpk234

flyrain · 2022-04-12T18:39:09Z

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java

+  private void setPartitionSpec(TableMetadata metadata, Map<String, String> parameters) {
+    parameters.remove(TableProperties.DEFAULT_PARTITION_SPEC);
+    if (metadata.spec() != null && metadata.spec().isPartitioned()) {
+      parameters.put(TableProperties.DEFAULT_PARTITION_SPEC, PartitionSpecParser.toJson(metadata.spec()));


PartitionSpecParser.toJson() doesn't convert a source column id to its name. Here is an example of a partition spec json string.

{"spec-id":0,"fields":[{"name":"data_bucket","transform":"bucket[16]","source-id":2,"field-id":1000}]}

It only shows the source id(2), the column name is data as the following code shows. It'd be more user-friendly if we show the column name.

Schema schema = new Schema( required(1, "id", Types.IntegerType.get(), "unique ID"), required(2, "data", Types.StringType.get()) ); PartitionSpec spec = PartitionSpec.builderFor(schema) .bucket("data", 16) .build();

Looked a bit more. The current solution is fine since the partition builder will automatically add a name based on the source column name. For example, data_bucket is the name for the bucket transform on the column data. Moreover, the string generated by PartitionSpecParser.toJson() is consistent with the one in the metadata.json file. Let's keep it in that sense.

Would we need to store schema on HMS side to make sense of the ids? Otherwise today we are depending on the naming convention.

Schema has been stored in HMS,

iceberg/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java

Line 247 in 116026b

tbl.setSd(storageDescriptor(metadata, hiveEngineEnabled)); // set to pickup any schema changes

, but schema id is missing when it converts Iceberg schema to Hive table schema. It may make sense to replace the source-id with source-name.

Added source-name in the spec json string. It will be like this

{"spec-id":1,"fields":[{"name":"data_bucket_16","transform":"bucket[16]","source-id":2,"source-name":"data","field-id":1000}]}

(Chat offline)

Im thinking instead of changing it, its cleaner to store schema object in separate field, its more accurate then Hive schema and a bit less hacky than changing the partition-spec format to add it.

Added a new property current-schema to avoid adding column name into partition spec and sort order. It duplicates the schema with Hive format, but we think it's worth to do that. The Hive format loses important column info(id).

Yea also i think Hive columns may not have exact Iceberg schema in some cases. For instance, the type conversions are not exact 1-to-1 mapping.

hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCatalog.java

singhpk234 · 2022-04-13T12:58:59Z

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java


+  private void setPartitionSpec(TableMetadata metadata, Map<String, String> parameters) {
+    parameters.remove(TableProperties.DEFAULT_PARTITION_SPEC);
+    if (metadata.spec() != null && metadata.spec().isPartitioned()) {


[question] In v1 spec when a partition is dropped it is replaced by VoidTransform, if all the transforms are void we should consider it un-partitioned (This may be beyond the scope of present PR), but presently when we call isPartitioned it will return true in this case. do we want to store partition spec in this scenario ? Your thoughts.

This is based on ticket #3014 @RussellSpitzer filed a while back.

Thanks @singhpk234 for pointing out. #3059 is trying to fix #3014, and it is almost ready to merge. It should be fine in that case.

hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCatalog.java

kbendick · 2022-04-14T03:15:28Z

core/src/main/java/org/apache/iceberg/TableProperties.java

+   * <p>
+   * This reserved property is used to store the default partition spec.
+   */
+  public static final String DEFAULT_PARTITION_SPEC = "default-partition-spec";


It might be helpful to clarify what is meant by default for both of these.

Each javadoc states that it’s for the “default” spec / sort order, but I’m still not entirely sure what that means.

Is it the current sort order / partition spec that would be used if the user doesn’t override it for an individual query ?

Maybe something like “JSON representation of the table’s current configured partition spec, which will be used if not overridden for individual writes”. Kind of wordy but something along those lines would be helpful for me if quickly looking through the JavaDocs etc. Will leave that decision to you though.

Yes, it is for the current partition spec and sort order. Make sense to me. Will make the change.

I think you reverted too many? Should be 'current'.

I was trying to be consistent with the name in metadata.json, which is default-partition-spec. The same for sort order. I changed the comments though.

OK I see, yea I always find it confusing.

In the comment, maybe we can add that they are equivalent, otherwise the comment is even more confusing:

Reserved table property for the JSON representation of current (default) schema.

It is confusing. I like the current more. I keep the original name just for consistency.

szehon-ho

I agree , once we change the word default to current, it looks ok to me.

One concern I was thinking in general even from #4456 , whether all this JSON parsing increase the commit times which can be sensitive for Streaming applications (I guess having that flag to control partition length is good in #4456 as user can set it to 0 if so, but it doesn't help these ones). At some point should measure the commit time perf for large tables. Or just have a follow up pr for a flag to disable all these stats persistence altogether if its a concern, as right now there's no way to opt-out, wdyt?

szehon-ho · 2022-04-14T22:01:25Z

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java

+  private void setPartitionSpec(TableMetadata metadata, Map<String, String> parameters) {
+    parameters.remove(TableProperties.DEFAULT_PARTITION_SPEC);
+    if (metadata.spec() != null && metadata.spec().isPartitioned()) {
+      parameters.put(TableProperties.DEFAULT_PARTITION_SPEC, PartitionSpecParser.toJson(metadata.spec()));


Would we need to store schema on HMS side to make sense of the ids? Otherwise today we are depending on the naming convention.

flyrain · 2022-04-15T00:13:58Z

One concern I was thinking in general even from #4456 , whether all this JSON parsing increase the commit times which can be sensitive for Streaming applications (I guess having that flag to control partition length is good in #4456 as user can set it to 0 if so, but it doesn't help these ones). At some point should measure the commit time perf for large tables. Or just have a follow up pr for a flag to disable all these stats persistence altogether if its a concern, as right now there's no way to opt-out, wdyt?

Out of all table properties synced to HMS, schema, snapshot summary, partition spec will more likely impact the commit perf. We may think of a general way to control that. I'd suggest to handle that in another PR.

szehon-ho · 2022-04-15T16:36:55Z

Out of all table properties synced to HMS, schema, snapshot summary, partition spec will more likely impact the commit perf. We may think of a general way to control that. I'd suggest to handle that in another PR.

Yea I was mentioning, a follow-up PR is fine, or if someone has any other idea about this

core/src/main/java/org/apache/iceberg/PartitionSpecParser.java

szehon-ho

Looks better , thanks @flyrain , few comments on the code side.

As discussed offline and as I understand, these changes are for users of HiveCatalog who are interested in getting a better understanding of Iceberg metadata directly from HIve API. This could be done by RestCatalog functionalities once that has more adoptions.

szehon-ho · 2022-04-22T20:52:00Z

core/src/main/java/org/apache/iceberg/TableProperties.java

+   * <p>
+   * This reserved property is used to store the default partition spec.
+   */
+  public static final String DEFAULT_PARTITION_SPEC = "default-partition-spec";


I think you reverted too many? Should be 'current'.

szehon-ho · 2022-04-22T20:54:11Z

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java


+  private void setSchema(TableMetadata metadata, Map<String, String> parameters) {
+    parameters.remove(TableProperties.CURRENT_SCHEMA);
+    if (metadata.schema() != null) {


Lets' have common method and re-use to set various fields like

private void setField(TableMetadata metadata, Map<String, String> parameters, String key, String value) { parameters.remove(key); if (value.length <= maxHiveTablepropertySize) { parameters.put(key, value); } else { LOG.warn("Not exposing {} in HMS since it exceeds {} characters", maxHiveTablePropertySize); } }

Made the change.

szehon-ho

Optional: Maybe we can add a test, about schema being too big to serialize, like snapshotSummary test

szehon-ho · 2022-04-23T05:47:55Z

FYI @singhpk234 if you want to take another look ?

singhpk234

Looks good to me as well :) !!!

Thanks @flyrain

singhpk234 · 2022-04-24T18:17:04Z

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java

+  private void setField(Map<String, String> parameters, String key, String value) {
+    if (value.length() <= maxHiveTablePropertySize) {
+      parameters.put(key, value);
+    } else {
+      LOG.warn("Not exposing {} in HMS since it exceeds {} characters", key, maxHiveTablePropertySize);
+    }


[nit] should we make this :

iceberg/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java

Lines 425 to 430 in 449a743

if (summary.length() <= maxHiveTablePropertySize) {

parameters.put(TableProperties.CURRENT_SNAPSHOT_SUMMARY, summary);

} else {

LOG.warn("Not exposing the current snapshot({}) summary in HMS since it exceeds {} characters",

currentSnapshot.snapshotId(), maxHiveTablePropertySize);

}

also use this setter

Was trying to do that, but the warn message needs snapshot id here. But it requires changes for method setFiled() like the below, and changes for all other callers. It's like removing one duplication, but adding a few complication. I'd suggest to keep it as is.

setField(Map<String, String> parameters, String key, String value, String warnMessage)

flyrain · 2022-04-25T01:50:51Z

Optional: Maybe we can add a test, about schema being too big to serialize, like snapshotSummary test

Made the change

szehon-ho

Thanks @flyrain i had a few more comments , can you see if they make sense.

hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCatalog.java

szehon-ho · 2022-04-25T18:00:29Z

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java

+    }
+  }
+
+  private void setField(Map<String, String> parameters, String key, String value) {


One performance suggestion, if the user sets to 0 (disable this feature), we can skip the serialization for performance.

Maybe , easiest, we can we add some boolean function like exposeInHmsProperties() that checks if value is 0, and use it in all the methods? (open to better names)

if (exposeInHmsProperties() && metadata.sortOrder() != null && metadata.sortOrder().isSorted()) {

Setting it to 0 can control some of them(snapshot summary, schema, partition spec, and sort order), but not all of them. I'd suggest to create another PR to make sure all are taken care.

Make sense, but could we at least take care of the ones in this PR? ( schema, partition spec, sort order)? We can have a follow up for the other ones not touched by this PR.

Just didn't want to leave it in a state where we are wasting CPU cycle (JSON serialization) needlessly if the user turns off this feature. As this is done in the critical commit block, unlike the original serialization which happens before. The other HMS table properties to me are also less CPU intensive as they are just getting a field.

Make sense, made the change.

szehon-ho · 2022-04-25T18:05:18Z

hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java

 import org.apache.hadoop.hive.metastore.api.SerDeInfo;
 import org.apache.hadoop.hive.metastore.api.StorageDescriptor;
 import org.apache.hadoop.hive.metastore.api.Table;
 import org.apache.hadoop.hive.metastore.api.hive_metastoreConstants;


Two other suggestion for this class: can we add in comment of "HIVE_TABLE_PROPERTY_MAX_SIZE" , one more sentence to let user know how to turn off feature?

// set to 0 to not expose Iceberg metadata in HMS Table properties

And also, a precondition in HiveTableOperations constructor to check if value is non-negative.

See #4546 (comment).

Added the comment. Negative is fine, right?

I think probably better to disallow negative as it makes little sense? But to me its ok either way.

Would throwing exception be too much in that case? May just log a warning.

flyrain · 2022-04-25T22:38:02Z

hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCatalog.java

  }

+  @Test
+  public void testNotExposeTableProperties() {


A new test added for method exposeInHmsProperties()

szehon-ho · 2022-04-25T23:28:39Z

hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCatalog.java

+  public void testNotExposeTableProperties() {
+    Configuration conf = new Configuration();
+    conf.set("iceberg.hive.table-property-max-size", "0");
+    HiveTableOperations spyOps = spy(new HiveTableOperations(conf, null, null, catalog.name(), DB_NAME, "tbl"));


Nit: Unnecessary spy

szehon-ho

Thanks @flyrain looks good to me now, will merge tomorrow unless others have some review comment.

szehon-ho · 2022-04-26T19:04:27Z

Merged, thanks @flyrain for contribution and @singhpk234 for additional review

flyrain · 2022-04-26T19:23:17Z

Thanks @szehon-ho for the detailed review and merge! Thanks @kbendick and @singhpk234 for the review.

pvary · 2022-04-26T20:14:55Z

Maybe a little late in the game (I was on a longer PTO), and only partially related - since this is a Hive 4 feature -, but we did some work to display the partitioning information in the DESCRIBE FORMATTED command:
See: https://issues.apache.org/jira/browse/HIVE-25326

# Partition Transform Information	 	 
# col_name            	transform_type      	 
b                   	IDENTITY            	 
c                   	IDENTITY

flyrain · 2022-04-26T20:24:33Z

Thanks @pvary. We want to support the use case that doesn't need storage access. Does HIVE-25326 need to access the metadata.json file to show partition spec?

pvary · 2022-04-26T20:37:56Z

@flyrain: Yes, it uses the HiveIcebergStorageHandler.getPartitionTransformSpec which in turn uses IcebergTableUtil.getTable to load the table and read the spec from the table snapshot.

When we were working on creating Hive tables, I have started working on the Hive table creation we used the spec and the schema table property fields, but then we decided on removing them and keep the metadata json as a single source of truth. See: HiveIcebergMetaHook. PROPERTIES_TO_REMOVE. I hope this does not cause issues with the new code when the table is created.

flyrain · 2022-04-26T22:01:09Z

Got it. It should not. This PR is using a different name than InputFormatConfig.PARTITION_SPEC(iceberg.mr.table.partition.spec). Thanks for the reminder.

(cherry picked from commit ded9a4d)

github-actions bot added core hive labels Apr 12, 2022

flyrain commented Apr 12, 2022

View reviewed changes

singhpk234 reviewed Apr 13, 2022

View reviewed changes

kbendick reviewed Apr 14, 2022

View reviewed changes

szehon-ho reviewed Apr 14, 2022

View reviewed changes

flyrain added 3 commits April 21, 2022 12:12

Expose default partition spec and sort order in HMS.

e5cf435

Resolve comments.

409872b

Add source name in spec json string

9bf2a0f

flyrain force-pushed the spec-sort-order branch from 0e857b8 to 9bf2a0f Compare April 21, 2022 19:20

flyrain commented Apr 21, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/PartitionSpecParser.java Outdated Show resolved Hide resolved

flyrain added 3 commits April 21, 2022 15:27

Add check for thread hold

8016618

Resolve comments.

65cd12c

Fix the test failure.

cceba61

github-actions bot added the MR label Apr 22, 2022

szehon-ho reviewed Apr 22, 2022

View reviewed changes

flyrain added 2 commits April 22, 2022 14:11

Resolve comments.

88f8c7f

Resolve comments.

7c9f0d0

szehon-ho approved these changes Apr 23, 2022

View reviewed changes

singhpk234 approved these changes Apr 24, 2022

View reviewed changes

Resolve comments.

49796e7

szehon-ho reviewed Apr 25, 2022

View reviewed changes

Resolve comments

fcb5d48

flyrain commented Apr 25, 2022

View reviewed changes

szehon-ho reviewed Apr 25, 2022

View reviewed changes

Remove the spy in tests.

72dc061

Remove the spy in tests.

0c178a7

szehon-ho approved these changes Apr 26, 2022

View reviewed changes

szehon-ho merged commit ded9a4d into apache:master Apr 26, 2022

singhpk234 mentioned this pull request May 16, 2022

[WIP] AWS: Add partition info to Glue #4775

Closed

InvisibleProgrammer mentioned this pull request Dec 14, 2022

HIVE-26822: port changes before spotless apache/hive#3857

Closed

InvisibleProgrammer mentioned this pull request Jan 3, 2023

HIVE-26808: port iceberg catalog changes apache/hive#3907

Merged

sunchao pushed a commit to sunchao/iceberg that referenced this pull request May 9, 2023

Hive: Expose default partition spec and sort order in HMS (apache#4546)

65ea52a

(cherry picked from commit ded9a4d)

	if (summary.length() <= maxHiveTablePropertySize) {
	parameters.put(TableProperties.CURRENT_SNAPSHOT_SUMMARY, summary);
	} else {
	LOG.warn("Not exposing the current snapshot({}) summary in HMS since it exceeds {} characters",
	currentSnapshot.snapshotId(), maxHiveTablePropertySize);
	}

Hive: Expose default partition spec and sort order in HMS. #4546

Hive: Expose default partition spec and sort order in HMS. #4546

Uh oh!

Conversation

flyrain commented Apr 12, 2022

Uh oh!

flyrain Apr 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain Apr 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Apr 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain commented Apr 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho commented Apr 15, 2022

Uh oh!

Uh oh!

szehon-ho left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented Apr 23, 2022

Uh oh!

singhpk234 left a comment

Choose a reason for hiding this comment

Uh oh!

singhpk234 Apr 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

flyrain Apr 12, 2022 •

edited

Loading

flyrain Apr 21, 2022 •

edited

Loading

szehon-ho Apr 22, 2022 •

edited

Loading

szehon-ho left a comment •

edited

Loading

flyrain commented Apr 15, 2022 •

edited

Loading

szehon-ho left a comment •

edited

Loading

singhpk234 Apr 24, 2022 •

edited

Loading

szehon-ho Apr 25, 2022 •

edited

Loading

szehon-ho Apr 25, 2022 •

edited

Loading