-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-27421: Do not set column stats in metastore when non-native table can store column stats in its own format #4397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // Set table or partition column statistics in metastore. | ||
| db.setPartitionColumnStatistics(request); | ||
| } | ||
| db.setPartitionColumnStatistics(request); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused about the change.
If we can not get stats from puffine due to some exception, we can fallback get stats from metastore. So i think maybe write stats into the two places is meaningful. Please correct me if i misunderstand. Thansk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangbutao I agree with your point. However, storing stats in 2 places has its pros & cons -
Pros -
- We can fallback to metastore by changing the config -
hive.iceberg.stats.source=metastoreif we are not able to get stats from Puffin files.
Cons -
- Any change in Puffin files by external clients is not visible to metastore.
- Performance effect of executing these metastore DB calls to store column stats.
In the approach mentioned in the PR, if users want to use metastore to get stats if they are not able to get stats from Puffin, then set hive.iceberg.stats.source=metastore and execute ANALYZE TABLE <tableName> COMPUTE STATISTICS FOR COLUMNS. (This will have an overhead of one more ANALYZE query).
I will leave it to the community to decide if its best to store stats in 2 places or storing it in a single place is sufficient. If the community thinks that this it is best to store in 2 places, then I won't proceed further. Otherwise, I will continue with the patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can not get stats from puffine due to some exception, we can fallback get stats from metastore. So i think maybe write stats into the two places is meaningful
Storing at two places have additional costs during write & currently we have two modes, "iceberg" & "metastore", so both denotes where to store the stats.
Storing at both sides, seems to be a third mode, like "both" and presently we don't have a fallback logic either during read side, that if puffin file are inaccessible then go to metastore kind of thing.
May be if we want such a thing, we can have a new mode, if we feel that is required in future stages.
As of now, I think, "iceberg" mode should store only in puffin and "metastore" mode should store only in "metastore"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hive/ql/src/java/org/apache/hadoop/hive/ql/optimizer/StatsOptimizer.java
Lines 308 to 311 in 1060039
| Long rowCnt = getRowCnt(pctx, tsOp, tbl); | |
| // if we can not have correct table stats, then both the table stats and column stats are not useful. | |
| if (rowCnt == null) { |
hive/ql/src/java/org/apache/hadoop/hive/ql/optimizer/StatsOptimizer.java
Lines 932 to 950 in 1060039
| private Long getRowCnt( | |
| ParseContext pCtx, TableScanOperator tsOp, Table tbl) throws HiveException { | |
| Long rowCnt = 0L; | |
| if (tbl.isPartitioned()) { | |
| for (Partition part : pctx.getPrunedPartitions( | |
| tsOp.getConf().getAlias(), tsOp).getPartitions()) { | |
| if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(part.getTable(), part.getParameters())) { | |
| return null; | |
| } | |
| long partRowCnt = Long.parseLong(part.getParameters().get(StatsSetupConst.ROW_COUNT)); | |
| rowCnt += partRowCnt; | |
| } | |
| } else { // unpartitioned table | |
| if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(tbl, tbl.getParameters())) { | |
| return null; | |
| } | |
| rowCnt = Long.valueOf(tbl.getProperty(StatsSetupConst.ROW_COUNT)); | |
| } | |
| return rowCnt; |
Currently, Like this example HIVE-27347 always uses the iceberg basic stats from metatstore to optimize
count(*) query. We should consider how to do this if only using puffin stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created #5400. to address above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
above item is resolved now, so we could proceed with the merge
|
Kudos, SonarCloud Quality Gate passed! |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
…tore stats in its own format
6bcb236 to
d5bb8c7
Compare
|
Kudos, SonarCloud Quality Gate passed!
|
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |









What changes were proposed in this pull request?
Do not set column stats in metastore when non-native table can store column stats in its own format
Why are the changes needed?
Non-native table formats like Iceberg has the capability to store column stats in its own format (for Iceberg: Its stored in Puffin files).
However, these stats are stored in metastore as well after setting the column stats in its own format. We must avoid setting column stats in 2 places and must set only in a single place.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Qtest