-
Notifications
You must be signed in to change notification settings - Fork 4.8k
[WIP] Iceberg: Do not use HMS stats when statsSource is Iceberg #5400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
| } | ||
| rowCnt = Long.valueOf(tbl.getProperty(StatsSetupConst.ROW_COUNT)); | ||
| Map<String, String> basicStats = MetaStoreUtils.isNonNativeTable(tbl.getTTable()) ? | ||
| tbl.getStorageHandler().getBasicStatistics(tbl) : tbl.getParameters(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discuss:
Shoule we regard the stats is always accurate when statsSource is iceberg?
If so, we need always to keep the configuration iceberg.hive.keep.stats true when statsSource is iceberg, so that we can optimization the count(*) when statsSource is iceberg by StatsOptimizer. Same idea i wanted to do is #5215
hive/iceberg/iceberg-catalog/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
Line 174 in 4f7200d
| boolean keepHiveStats = conf.getBoolean(ConfigProperties.KEEP_HIVE_STATS, false); |
hive/iceberg/iceberg-catalog/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
Lines 225 to 227 in 4f7200d
| if (!keepHiveStats) { | |
| StatsSetupConst.setBasicStatsState(tbl.getParameters(), StatsSetupConst.FALSE); | |
| StatsSetupConst.clearColumnStatsState(tbl.getParameters()); |
| if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(tbl, tbl.getParameters())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, I missed the #5215
I am not very certain how accurate is the TOTAL_RECORDS_PROP from the snapshot summary especially when there are deletes.
Since we have a statsSource flag I just wanted to be consistent where we take a stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i think iceberg.hive.keep.stats should be enabled when stats source is not iceberg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangbutao, could you please check the comments in #5215 and maybe incorporate changes from this PR into yours
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very certain how accurate is the TOTAL_RECORDS_PROP from the snapshot summary especially when there are deletes.
Iceberg table with equal delets should not be optimized by the count query optimization, so we can skip to get the stats or return null in case of existing deletes.
yes, i think iceberg.hive.keep.stats should be enabled when stats source is not iceberg
iceberg.hive.keep.stats should be always enabled(true) when stats source is iceberg. iceberg.hive.keep.stats true means that the stats is accurate.
HMS stats for iceberg can be not accurate if some other engines(Spark、Trino) write the table but not update the HMS stats.
But if the statsSource is iceberg, the stats is retrieved from iceberg SnapshotSummary which is real time and accurate.
could you please check the comments in #5215 and maybe incorporate changes from this PR into yours
#5215 I want to optimize the count query with no care the values of iceberg.hive.keep.stats. But i am not do the limit like your change that not use HMS stats when statsSource is Iceberg. I am ok to incorporate this thange as well as some other supplements to #5215, we can continue to disscuss there.
| } | ||
| } | ||
| return false; | ||
| return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of delete files, if Iceberg statsSource can not compute query( eg. count(*)) using stats, i think HMS stats can't either. They both come from the same place -- SnapshotSummary.
We should not give a wrong impression that HMS can give the accurate stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even if we do alter table compute stats?
need to check how HMS stats works for ACID table deletes, does it stay accurate or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
In case of delete files, analyze table compute stats job can get the accurate stats as it will launch tez task to compute the stats.
And after the job analyze table compute stats, the HMS stats will be updated & accurate and iceberg.hive.keep.stats will be true, so we can use the HMS stats to optimize the count query.
But if the statsSource is Iceberg & in case of delete files, even we have done the job analyze table compute stats, we won't update the Iceberg SnapshotSummary, so we can not optimize the count query.
This will look a little weird. Users do a job analyze table compute stats to update the stats, but they can not optimize the count query if the statsSource is Iceberg & in case of delete files.
| @Override | ||
| public boolean canComputeQueryUsingStats(org.apache.hadoop.hive.ql.metadata.Table hmsTable) { | ||
| if (getStatsSource().equals(HiveMetaHook.ICEBERG) && hmsTable.getMetaTable() == null) { | ||
| if (hmsTable.getMetaTable() != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Query against Iceberg Branch/Tag can also benefit from the stats. We can optimize this later.
| } | ||
| rowCnt = Long.valueOf(tbl.getProperty(StatsSetupConst.ROW_COUNT)); | ||
| Map<String, String> basicStats = MetaStoreUtils.isNonNativeTable(tbl.getTTable()) ? | ||
| tbl.getStorageHandler().getBasicStatistics(tbl) : tbl.getParameters(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very certain how accurate is the TOTAL_RECORDS_PROP from the snapshot summary especially when there are deletes.
Iceberg table with equal delets should not be optimized by the count query optimization, so we can skip to get the stats or return null in case of existing deletes.
yes, i think iceberg.hive.keep.stats should be enabled when stats source is not iceberg
iceberg.hive.keep.stats should be always enabled(true) when stats source is iceberg. iceberg.hive.keep.stats true means that the stats is accurate.
HMS stats for iceberg can be not accurate if some other engines(Spark、Trino) write the table but not update the HMS stats.
But if the statsSource is iceberg, the stats is retrieved from iceberg SnapshotSummary which is real time and accurate.
could you please check the comments in #5215 and maybe incorporate changes from this PR into yours
#5215 I want to optimize the count query with no care the values of iceberg.hive.keep.stats. But i am not do the limit like your change that not use HMS stats when statsSource is Iceberg. I am ok to incorporate this thange as well as some other supplements to #5215, we can continue to disscuss there.
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |



What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
Is the change a dependency upgrade?
How was this patch tested?