[WIP] Iceberg: Do not use HMS stats when statsSource is Iceberg #5400

deniskuzZ · 2024-08-21T15:37:10Z

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

sonarqubecloud · 2024-08-22T16:24:48Z

Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

zhangbutao · 2024-08-23T02:31:05Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/StatsOptimizer.java

        }
-        rowCnt = Long.valueOf(tbl.getProperty(StatsSetupConst.ROW_COUNT));
+        Map<String, String> basicStats = MetaStoreUtils.isNonNativeTable(tbl.getTTable()) ?
+          tbl.getStorageHandler().getBasicStatistics(tbl) : tbl.getParameters();


Discuss:
Shoule we regard the stats is always accurate when statsSource is iceberg?
If so, we need always to keep the configuration iceberg.hive.keep.stats true when statsSource is iceberg, so that we can optimization the count(*) when statsSource is iceberg by StatsOptimizer. Same idea i wanted to do is #5215

hive/iceberg/iceberg-catalog/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java

Line 174 in 4f7200d

boolean keepHiveStats = conf.getBoolean(ConfigProperties.KEEP_HIVE_STATS, false);

hive/iceberg/iceberg-catalog/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java

Lines 225 to 227 in 4f7200d

if (!keepHiveStats) {

StatsSetupConst.setBasicStatsState(tbl.getParameters(), StatsSetupConst.FALSE);

StatsSetupConst.clearColumnStatsState(tbl.getParameters());

hive/ql/src/java/org/apache/hadoop/hive/ql/optimizer/StatsOptimizer.java

Line 945 in 4f7200d

if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(tbl, tbl.getParameters())) {

Oh sorry, I missed the #5215
I am not very certain how accurate is the TOTAL_RECORDS_PROP from the snapshot summary especially when there are deletes.
Since we have a statsSource flag I just wanted to be consistent where we take a stats.

yes, i think iceberg.hive.keep.stats should be enabled when stats source is not iceberg

@zhangbutao, could you please check the comments in #5215 and maybe incorporate changes from this PR into yours

I am not very certain how accurate is the TOTAL_RECORDS_PROP from the snapshot summary especially when there are deletes.

Iceberg table with equal delets should not be optimized by the count query optimization, so we can skip to get the stats or return null in case of existing deletes.

yes, i think iceberg.hive.keep.stats should be enabled when stats source is not iceberg

iceberg.hive.keep.stats should be always enabled(true) when stats source is iceberg. iceberg.hive.keep.stats true means that the stats is accurate.
HMS stats for iceberg can be not accurate if some other engines(Spark、Trino) write the table but not update the HMS stats.
But if the statsSource is iceberg, the stats is retrieved from iceberg SnapshotSummary which is real time and accurate.

could you please check the comments in #5215 and maybe incorporate changes from this PR into yours

#5215 I want to optimize the count query with no care the values of iceberg.hive.keep.stats. But i am not do the limit like your change that not use HMS stats when statsSource is Iceberg. I am ok to incorporate this thange as well as some other supplements to #5215, we can continue to disscuss there.

zhangbutao · 2024-08-26T08:09:24Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java

      }
    }
-    return false;
+    return true;


In case of delete files, if Iceberg statsSource can not compute query( eg. count(*)) using stats, i think HMS stats can't either. They both come from the same place -- SnapshotSummary.

We should not give a wrong impression that HMS can give the accurate stats.

even if we do alter table compute stats?
need to check how HMS stats works for ACID table deletes, does it stay accurate or not

Good catch!
In case of delete files, analyze table compute stats job can get the accurate stats as it will launch tez task to compute the stats.

And after the job analyze table compute stats, the HMS stats will be updated & accurate and iceberg.hive.keep.stats will be true, so we can use the HMS stats to optimize the count query.

But if the statsSource is Iceberg & in case of delete files, even we have done the job analyze table compute stats, we won't update the Iceberg SnapshotSummary, so we can not optimize the count query.

This will look a little weird. Users do a job analyze table compute stats to update the stats, but they can not optimize the count query if the statsSource is Iceberg & in case of delete files.

zhangbutao · 2024-08-26T08:15:10Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java

  @Override
  public boolean canComputeQueryUsingStats(org.apache.hadoop.hive.ql.metadata.Table hmsTable) {
-    if (getStatsSource().equals(HiveMetaHook.ICEBERG) && hmsTable.getMetaTable() == null) {
+    if (hmsTable.getMetaTable() != null) {


Query against Iceberg Branch/Tag can also benefit from the stats. We can optimize this later.

zhangbutao · 2024-08-26T08:43:18Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/StatsOptimizer.java

        }
-        rowCnt = Long.valueOf(tbl.getProperty(StatsSetupConst.ROW_COUNT));
+        Map<String, String> basicStats = MetaStoreUtils.isNonNativeTable(tbl.getTTable()) ?
+          tbl.getStorageHandler().getBasicStatistics(tbl) : tbl.getParameters();


I am not very certain how accurate is the TOTAL_RECORDS_PROP from the snapshot summary especially when there are deletes.

Iceberg table with equal delets should not be optimized by the count query optimization, so we can skip to get the stats or return null in case of existing deletes.

yes, i think iceberg.hive.keep.stats should be enabled when stats source is not iceberg

iceberg.hive.keep.stats should be always enabled(true) when stats source is iceberg. iceberg.hive.keep.stats true means that the stats is accurate.
HMS stats for iceberg can be not accurate if some other engines(Spark、Trino) write the table but not update the HMS stats.
But if the statsSource is iceberg, the stats is retrieved from iceberg SnapshotSummary which is real time and accurate.

could you please check the comments in #5215 and maybe incorporate changes from this PR into yours

#5215 I want to optimize the count query with no care the values of iceberg.hive.keep.stats. But i am not do the limit like your change that not use HMS stats when statsSource is Iceberg. I am ok to incorporate this thange as well as some other supplements to #5215, we can continue to disscuss there.

github-actions · 2024-11-03T00:28:01Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the [email protected] list if the patch is in need of reviews.

Do not use HMS stats when statsSource is Iceberg

da0518e

deniskuzZ changed the title ~~Do not use HMS stats when statsSource is Iceberg~~ Iceberg: Do not use HMS stats when statsSource is Iceberg Aug 21, 2024

asf-ci-hive added the tests pending label Aug 21, 2024

deniskuzZ mentioned this pull request Aug 21, 2024

HIVE-27421: Do not set column stats in metastore when non-native table can store column stats in its own format #4397

Closed

asf-ci-hive added tests passed and removed tests pending labels Aug 21, 2024

fix

09c586a

asf-ci-hive added tests pending tests failed and removed tests passed tests pending tests failed labels Aug 21, 2024

asf-ci-hive added tests passed and removed tests pending labels Aug 22, 2024

zhangbutao reviewed Aug 23, 2024

View reviewed changes

deniskuzZ marked this pull request as draft August 24, 2024 10:41

zhangbutao reviewed Aug 26, 2024

View reviewed changes

deniskuzZ changed the title ~~Iceberg: Do not use HMS stats when statsSource is Iceberg~~ WIP: Iceberg: Do not use HMS stats when statsSource is Iceberg Sep 3, 2024

deniskuzZ changed the title ~~WIP: Iceberg: Do not use HMS stats when statsSource is Iceberg~~ [WIP] Iceberg: Do not use HMS stats when statsSource is Iceberg Sep 3, 2024

zhangbutao pushed a commit to zhangbutao/hive that referenced this pull request Oct 11, 2024

Do not use HMS stats when statsSource is Iceberg apache#5400

1347fef

zhangbutao mentioned this pull request Oct 11, 2024

HIVE-28268: Iceberg: Retrieve row count from iceberg SnapshotSummary in case of iceberg.hive.keep.stats=false #5215

Merged

zhangbutao pushed a commit to zhangbutao/hive that referenced this pull request Oct 11, 2024

Do not use HMS stats when statsSource is Iceberg apache#5400

7b2ffb2

zhangbutao pushed a commit to zhangbutao/hive that referenced this pull request Oct 25, 2024

Do not use HMS stats when statsSource is Iceberg apache#5400

22a6a98

github-actions bot added the stale label Nov 3, 2024

github-actions bot closed this Nov 11, 2024

	if (!keepHiveStats) {
	StatsSetupConst.setBasicStatsState(tbl.getParameters(), StatsSetupConst.FALSE);
	StatsSetupConst.clearColumnStatsState(tbl.getParameters());

[WIP] Iceberg: Do not use HMS stats when statsSource is Iceberg #5400

[WIP] Iceberg: Do not use HMS stats when statsSource is Iceberg #5400

Uh oh!

Conversation

deniskuzZ commented Aug 21, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

Uh oh!

sonarqubecloud bot commented Aug 22, 2024

Quality Gate passed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

deniskuzZ Aug 28, 2024 •

edited

Loading