HIVE-27421: Do not set column stats in metastore when non-native table can store column stats in its own format #4397

SourabhBadhya · 2023-06-08T14:34:59Z

What changes were proposed in this pull request?

Do not set column stats in metastore when non-native table can store column stats in its own format

Why are the changes needed?

Non-native table formats like Iceberg has the capability to store column stats in its own format (for Iceberg: Its stored in Puffin files).

However, these stats are stored in metastore as well after setting the column stats in its own format. We must avoid setting column stats in 2 places and must set only in a single place.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Qtest

zhangbutao · 2023-06-08T15:50:02Z

ql/src/java/org/apache/hadoop/hive/ql/stats/ColStatsProcessor.java

+        // Set table or partition column statistics in metastore.
+        db.setPartitionColumnStatistics(request);
      }
-      db.setPartitionColumnStatistics(request);


I'm a little confused about the change.
If we can not get stats from puffine due to some exception, we can fallback get stats from metastore. So i think maybe write stats into the two places is meaningful. Please correct me if i misunderstand. Thansk.

@zhangbutao I agree with your point. However, storing stats in 2 places has its pros & cons -
Pros -

We can fallback to metastore by changing the config - hive.iceberg.stats.source=metastore if we are not able to get stats from Puffin files.

Cons -

Any change in Puffin files by external clients is not visible to metastore.

Performance effect of executing these metastore DB calls to store column stats.

In the approach mentioned in the PR, if users want to use metastore to get stats if they are not able to get stats from Puffin, then set hive.iceberg.stats.source=metastore and execute ANALYZE TABLE <tableName> COMPUTE STATISTICS FOR COLUMNS. (This will have an overhead of one more ANALYZE query).

I will leave it to the community to decide if its best to store stats in 2 places or storing it in a single place is sufficient. If the community thinks that this it is best to store in 2 places, then I won't proceed further. Otherwise, I will continue with the patch.

If we can not get stats from puffine due to some exception, we can fallback get stats from metastore. So i think maybe write stats into the two places is meaningful

Storing at two places have additional costs during write & currently we have two modes, "iceberg" & "metastore", so both denotes where to store the stats.

Storing at both sides, seems to be a third mode, like "both" and presently we don't have a fallback logic either during read side, that if puffin file are inaccessible then go to metastore kind of thing.

May be if we want such a thing, we can have a new mode, if we feel that is required in future stages.

As of now, I think, "iceberg" mode should store only in puffin and "metastore" mode should store only in "metastore"

hive/ql/src/java/org/apache/hadoop/hive/ql/optimizer/StatsOptimizer.java

Lines 308 to 311 in 1060039

Long rowCnt = getRowCnt(pctx, tsOp, tbl);

// if we can not have correct table stats, then both the table stats and column stats are not useful.

if (rowCnt == null) {

hive/ql/src/java/org/apache/hadoop/hive/ql/optimizer/StatsOptimizer.java

Lines 932 to 950 in 1060039

private Long getRowCnt(

ParseContext pCtx, TableScanOperator tsOp, Table tbl) throws HiveException {

Long rowCnt = 0L;

if (tbl.isPartitioned()) {

for (Partition part : pctx.getPrunedPartitions(

tsOp.getConf().getAlias(), tsOp).getPartitions()) {

if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(part.getTable(), part.getParameters())) {

return null;

}

long partRowCnt = Long.parseLong(part.getParameters().get(StatsSetupConst.ROW_COUNT));

rowCnt += partRowCnt;

}

} else { // unpartitioned table

if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(tbl, tbl.getParameters())) {

return null;

}

rowCnt = Long.valueOf(tbl.getProperty(StatsSetupConst.ROW_COUNT));

}

return rowCnt;

Currently, Like this example HIVE-27347 always uses the iceberg basic stats from metatstore to optimize count(*) query. We should consider how to do this if only using puffin stats.

created #5400. to address above

above item is resolved now, so we could proceed with the merge

sonarqubecloud · 2023-06-08T17:25:32Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

github-actions · 2023-08-10T00:21:35Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the [email protected] list if the patch is in need of reviews.

…tore stats in its own format

sonarqubecloud · 2023-08-16T13:35:15Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

The version of Java (11.0.8) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17.
Read more here

github-actions · 2023-10-16T00:19:25Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the [email protected] list if the patch is in need of reviews.

kgyrtkirk added the tests pending label Jun 8, 2023

SourabhBadhya changed the title ~~HIVE-27421: Do not set stats in metastore when non-native table can store stats in its own format~~ HIVE-27421: Do not set column stats in metastore when non-native table can store stats in its own format Jun 8, 2023

SourabhBadhya changed the title ~~HIVE-27421: Do not set column stats in metastore when non-native table can store stats in its own format~~ HIVE-27421: Do not set column stats in metastore when non-native table can store column stats in its own format Jun 8, 2023

zhangbutao reviewed Jun 8, 2023

View reviewed changes

kgyrtkirk added tests unstable and removed tests pending labels Jun 8, 2023

SourabhBadhya mentioned this pull request Jun 26, 2023

HIVE-27455: Iceberg: Set COLUMN_STATS_ACCURATE after writing stats for Iceberg tables #4440

Merged

github-actions bot added the stale label Aug 10, 2023

deniskuzZ removed the stale label Aug 16, 2023

HIVE-27421: Do not set stats in metastore when non-native table can s…

d5bb8c7

…tore stats in its own format

SourabhBadhya force-pushed the HIVE-27421 branch from 6bcb236 to d5bb8c7 Compare August 16, 2023 11:42

asf-ci-hive added tests pending and removed tests unstable labels Aug 16, 2023

asf-ci-hive added tests unstable and removed tests pending labels Aug 16, 2023

github-actions bot added the stale label Oct 16, 2023

github-actions bot closed this Oct 24, 2023

deniskuzZ reopened this Feb 26, 2025

asf-ci-hive added tests pending tests failed and removed tests unstable tests pending labels Feb 26, 2025

github-actions bot closed this Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HIVE-27421: Do not set column stats in metastore when non-native table can store column stats in its own format #4397

HIVE-27421: Do not set column stats in metastore when non-native table can store column stats in its own format #4397

Uh oh!

SourabhBadhya commented Jun 8, 2023 •

edited

Loading

Uh oh!

zhangbutao Jun 8, 2023

Uh oh!

SourabhBadhya Jun 9, 2023 •

edited

Loading

Uh oh!

ayushtkn Jun 9, 2023

Uh oh!

zhangbutao Jun 10, 2023

Uh oh!

deniskuzZ Aug 21, 2024

Uh oh!

deniskuzZ Feb 26, 2025

Uh oh!

sonarqubecloud bot commented Jun 8, 2023

Uh oh!

github-actions bot commented Aug 10, 2023

Uh oh!

sonarqubecloud bot commented Aug 16, 2023

Uh oh!

github-actions bot commented Oct 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants


	Long rowCnt = getRowCnt(pctx, tsOp, tbl);
	// if we can not have correct table stats, then both the table stats and column stats are not useful.
	if (rowCnt == null) {

	private Long getRowCnt(
	ParseContext pCtx, TableScanOperator tsOp, Table tbl) throws HiveException {
	Long rowCnt = 0L;
	if (tbl.isPartitioned()) {
	for (Partition part : pctx.getPrunedPartitions(
	tsOp.getConf().getAlias(), tsOp).getPartitions()) {
	if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(part.getTable(), part.getParameters())) {
	return null;
	}
	long partRowCnt = Long.parseLong(part.getParameters().get(StatsSetupConst.ROW_COUNT));
	rowCnt += partRowCnt;
	}
	} else { // unpartitioned table
	if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(tbl, tbl.getParameters())) {
	return null;
	}
	rowCnt = Long.valueOf(tbl.getProperty(StatsSetupConst.ROW_COUNT));
	}
	return rowCnt;

HIVE-27421: Do not set column stats in metastore when non-native table can store column stats in its own format #4397

HIVE-27421: Do not set column stats in metastore when non-native table can store column stats in its own format #4397

Uh oh!

Conversation

SourabhBadhya commented Jun 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zhangbutao Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

SourabhBadhya Jun 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayushtkn Jun 9, 2023

Choose a reason for hiding this comment

Uh oh!

zhangbutao Jun 10, 2023

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Aug 21, 2024

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Jun 8, 2023

Uh oh!

github-actions bot commented Aug 10, 2023

Uh oh!

sonarqubecloud bot commented Aug 16, 2023

Uh oh!

github-actions bot commented Oct 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

SourabhBadhya commented Jun 8, 2023 •

edited

Loading

SourabhBadhya Jun 9, 2023 •

edited

Loading