Skip to content

Conversation

@deniskuzZ
Copy link
Member

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

@deniskuzZ deniskuzZ changed the title Do not use HMS stats when statsSource is Iceberg Iceberg: Do not use HMS stats when statsSource is Iceberg Aug 21, 2024
@sonarqubecloud
Copy link

}
rowCnt = Long.valueOf(tbl.getProperty(StatsSetupConst.ROW_COUNT));
Map<String, String> basicStats = MetaStoreUtils.isNonNativeTable(tbl.getTTable()) ?
tbl.getStorageHandler().getBasicStatistics(tbl) : tbl.getParameters();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discuss:
Shoule we regard the stats is always accurate when statsSource is iceberg?
If so, we need always to keep the configuration iceberg.hive.keep.stats true when statsSource is iceberg, so that we can optimization the count(*) when statsSource is iceberg by StatsOptimizer. Same idea i wanted to do is #5215

boolean keepHiveStats = conf.getBoolean(ConfigProperties.KEEP_HIVE_STATS, false);

if (!keepHiveStats) {
StatsSetupConst.setBasicStatsState(tbl.getParameters(), StatsSetupConst.FALSE);
StatsSetupConst.clearColumnStatsState(tbl.getParameters());

if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(tbl, tbl.getParameters())) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I missed the #5215
I am not very certain how accurate is the TOTAL_RECORDS_PROP from the snapshot summary especially when there are deletes.
Since we have a statsSource flag I just wanted to be consistent where we take a stats.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i think iceberg.hive.keep.stats should be enabled when stats source is not iceberg

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangbutao, could you please check the comments in #5215 and maybe incorporate changes from this PR into yours

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very certain how accurate is the TOTAL_RECORDS_PROP from the snapshot summary especially when there are deletes.

Iceberg table with equal delets should not be optimized by the count query optimization, so we can skip to get the stats or return null in case of existing deletes.

yes, i think iceberg.hive.keep.stats should be enabled when stats source is not iceberg

iceberg.hive.keep.stats should be always enabled(true) when stats source is iceberg. iceberg.hive.keep.stats true means that the stats is accurate.
HMS stats for iceberg can be not accurate if some other engines(Spark、Trino) write the table but not update the HMS stats.
But if the statsSource is iceberg, the stats is retrieved from iceberg SnapshotSummary which is real time and accurate.

could you please check the comments in #5215 and maybe incorporate changes from this PR into yours

#5215 I want to optimize the count query with no care the values of iceberg.hive.keep.stats. But i am not do the limit like your change that not use HMS stats when statsSource is Iceberg. I am ok to incorporate this thange as well as some other supplements to #5215, we can continue to disscuss there.

@deniskuzZ deniskuzZ marked this pull request as draft August 24, 2024 10:41
}
}
return false;
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of delete files, if Iceberg statsSource can not compute query( eg. count(*)) using stats, i think HMS stats can't either. They both come from the same place -- SnapshotSummary.

We should not give a wrong impression that HMS can give the accurate stats.

Copy link
Member Author

@deniskuzZ deniskuzZ Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even if we do alter table compute stats?
need to check how HMS stats works for ACID table deletes, does it stay accurate or not

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!
In case of delete files, analyze table compute stats job can get the accurate stats as it will launch tez task to compute the stats.

And after the job analyze table compute stats, the HMS stats will be updated & accurate and iceberg.hive.keep.stats will be true, so we can use the HMS stats to optimize the count query.

But if the statsSource is Iceberg & in case of delete files, even we have done the job analyze table compute stats, we won't update the Iceberg SnapshotSummary, so we can not optimize the count query.

This will look a little weird. Users do a job analyze table compute stats to update the stats, but they can not optimize the count query if the statsSource is Iceberg & in case of delete files.

@Override
public boolean canComputeQueryUsingStats(org.apache.hadoop.hive.ql.metadata.Table hmsTable) {
if (getStatsSource().equals(HiveMetaHook.ICEBERG) && hmsTable.getMetaTable() == null) {
if (hmsTable.getMetaTable() != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query against Iceberg Branch/Tag can also benefit from the stats. We can optimize this later.

}
rowCnt = Long.valueOf(tbl.getProperty(StatsSetupConst.ROW_COUNT));
Map<String, String> basicStats = MetaStoreUtils.isNonNativeTable(tbl.getTTable()) ?
tbl.getStorageHandler().getBasicStatistics(tbl) : tbl.getParameters();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very certain how accurate is the TOTAL_RECORDS_PROP from the snapshot summary especially when there are deletes.

Iceberg table with equal delets should not be optimized by the count query optimization, so we can skip to get the stats or return null in case of existing deletes.

yes, i think iceberg.hive.keep.stats should be enabled when stats source is not iceberg

iceberg.hive.keep.stats should be always enabled(true) when stats source is iceberg. iceberg.hive.keep.stats true means that the stats is accurate.
HMS stats for iceberg can be not accurate if some other engines(Spark、Trino) write the table but not update the HMS stats.
But if the statsSource is iceberg, the stats is retrieved from iceberg SnapshotSummary which is real time and accurate.

could you please check the comments in #5215 and maybe incorporate changes from this PR into yours

#5215 I want to optimize the count query with no care the values of iceberg.hive.keep.stats. But i am not do the limit like your change that not use HMS stats when statsSource is Iceberg. I am ok to incorporate this thange as well as some other supplements to #5215, we can continue to disscuss there.

@deniskuzZ deniskuzZ changed the title Iceberg: Do not use HMS stats when statsSource is Iceberg WIP: Iceberg: Do not use HMS stats when statsSource is Iceberg Sep 3, 2024
@deniskuzZ deniskuzZ changed the title WIP: Iceberg: Do not use HMS stats when statsSource is Iceberg [WIP] Iceberg: Do not use HMS stats when statsSource is Iceberg Sep 3, 2024
zhangbutao pushed a commit to zhangbutao/hive that referenced this pull request Oct 11, 2024
zhangbutao pushed a commit to zhangbutao/hive that referenced this pull request Oct 11, 2024
zhangbutao pushed a commit to zhangbutao/hive that referenced this pull request Oct 25, 2024
@github-actions
Copy link

github-actions bot commented Nov 3, 2024

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the [email protected] list if the patch is in need of reviews.

@github-actions github-actions bot added the stale label Nov 3, 2024
@github-actions github-actions bot closed this Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants