-
Notifications
You must be signed in to change notification settings - Fork 3k
Hive: Expose default partition spec and sort order in HMS. #4546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 9 commits
e5cf435
409872b
9bf2a0f
8016618
65cd12c
cceba61
88f8c7f
7c9f0d0
49796e7
fcb5d48
72dc061
0c178a7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -52,8 +52,11 @@ | |||||||||||||
| import org.apache.hadoop.hive.metastore.api.hive_metastoreConstants; | ||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Two other suggestion for this class: can we add in comment of "HIVE_TABLE_PROPERTY_MAX_SIZE" , one more sentence to let user know how to turn off feature? And also, a precondition in HiveTableOperations constructor to check if value is non-negative.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See #4546 (comment).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added the comment. Negative is fine, right?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think probably better to disallow negative as it makes little sense? But to me its ok either way.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would throwing exception be too much in that case? May just log a warning. |
||||||||||||||
| import org.apache.iceberg.BaseMetastoreTableOperations; | ||||||||||||||
| import org.apache.iceberg.ClientPool; | ||||||||||||||
| import org.apache.iceberg.PartitionSpecParser; | ||||||||||||||
| import org.apache.iceberg.SchemaParser; | ||||||||||||||
| import org.apache.iceberg.Snapshot; | ||||||||||||||
| import org.apache.iceberg.SnapshotSummary; | ||||||||||||||
| import org.apache.iceberg.SortOrderParser; | ||||||||||||||
| import org.apache.iceberg.TableMetadata; | ||||||||||||||
| import org.apache.iceberg.TableProperties; | ||||||||||||||
| import org.apache.iceberg.exceptions.AlreadyExistsException; | ||||||||||||||
|
|
@@ -399,6 +402,9 @@ private void setHmsTableParameters(String newMetadataLocation, Table tbl, TableM | |||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| setSnapshotStats(metadata, parameters); | ||||||||||||||
| setSchema(metadata, parameters); | ||||||||||||||
| setPartitionSpec(metadata, parameters); | ||||||||||||||
| setSortOrder(metadata, parameters); | ||||||||||||||
|
|
||||||||||||||
| tbl.setParameters(parameters); | ||||||||||||||
| } | ||||||||||||||
|
|
@@ -433,6 +439,38 @@ void setSnapshotSummary(Map<String, String> parameters, Snapshot currentSnapshot | |||||||||||||
| } | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| private void setSchema(TableMetadata metadata, Map<String, String> parameters) { | ||||||||||||||
| parameters.remove(TableProperties.CURRENT_SCHEMA); | ||||||||||||||
| if (metadata.schema() != null) { | ||||||||||||||
|
||||||||||||||
| String schema = SchemaParser.toJson(metadata.schema()); | ||||||||||||||
| setField(parameters, TableProperties.CURRENT_SCHEMA, schema); | ||||||||||||||
| } | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| private void setPartitionSpec(TableMetadata metadata, Map<String, String> parameters) { | ||||||||||||||
| parameters.remove(TableProperties.DEFAULT_PARTITION_SPEC); | ||||||||||||||
| if (metadata.spec() != null && metadata.spec().isPartitioned()) { | ||||||||||||||
|
||||||||||||||
| String spec = PartitionSpecParser.toJson(metadata.spec()); | ||||||||||||||
| setField(parameters, TableProperties.DEFAULT_PARTITION_SPEC, spec); | ||||||||||||||
| } | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| private void setSortOrder(TableMetadata metadata, Map<String, String> parameters) { | ||||||||||||||
| parameters.remove(TableProperties.DEFAULT_SORT_ORDER); | ||||||||||||||
| if (metadata.sortOrder() != null && metadata.sortOrder().isSorted()) { | ||||||||||||||
| String sortOrder = SortOrderParser.toJson(metadata.sortOrder()); | ||||||||||||||
| setField(parameters, TableProperties.DEFAULT_SORT_ORDER, sortOrder); | ||||||||||||||
| } | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| private void setField(Map<String, String> parameters, String key, String value) { | ||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One performance suggestion, if the user sets to 0 (disable this feature), we can skip the serialization for performance. Maybe , easiest, we can we add some boolean function like exposeInHmsProperties() that checks if value is 0, and use it in all the methods? (open to better names)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Setting it to 0 can control some of them(snapshot summary, schema, partition spec, and sort order), but not all of them. I'd suggest to create another PR to make sure all are taken care.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Make sense, but could we at least take care of the ones in this PR? ( schema, partition spec, sort order)? We can have a follow up for the other ones not touched by this PR. Just didn't want to leave it in a state where we are wasting CPU cycle (JSON serialization) needlessly if the user turns off this feature. As this is done in the critical commit block, unlike the original serialization which happens before. The other HMS table properties to me are also less CPU intensive as they are just getting a field.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Make sense, made the change. |
||||||||||||||
| if (value.length() <= maxHiveTablePropertySize) { | ||||||||||||||
| parameters.put(key, value); | ||||||||||||||
| } else { | ||||||||||||||
| LOG.warn("Not exposing {} in HMS since it exceeds {} characters", key, maxHiveTablePropertySize); | ||||||||||||||
| } | ||||||||||||||
|
Comment on lines
+471
to
+476
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [nit] should we make this : iceberg/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java Lines 425 to 430 in 449a743
also use this setter
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Was trying to do that, but the warn message needs snapshot id here. But it requires changes for method |
||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| private StorageDescriptor storageDescriptor(TableMetadata metadata, boolean hiveEngineEnabled) { | ||||||||||||||
|
|
||||||||||||||
| final StorageDescriptor storageDescriptor = new StorageDescriptor(); | ||||||||||||||
|
|
||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be helpful to clarify what is meant by
defaultfor both of these.Each javadoc states that it’s for the “default” spec / sort order, but I’m still not entirely sure what that means.
Is it the current sort order / partition spec that would be used if the user doesn’t override it for an individual query ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like “JSON representation of the table’s current configured partition spec, which will be used if not overridden for individual writes”. Kind of wordy but something along those lines would be helpful for me if quickly looking through the JavaDocs etc. Will leave that decision to you though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is for the current partition spec and sort order. Make sense to me. Will make the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you reverted too many? Should be 'current'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to be consistent with the name in metadata.json, which is
default-partition-spec. The same for sort order. I changed the comments though.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see, yea I always find it confusing.
In the comment, maybe we can add that they are equivalent, otherwise the comment is even more confusing:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is confusing. I like the
currentmore. I keep the original name just for consistency.