Skip to content

Conversation

@SourabhBadhya
Copy link
Contributor

@SourabhBadhya SourabhBadhya commented Jun 8, 2023

What changes were proposed in this pull request?

Do not set column stats in metastore when non-native table can store column stats in its own format

Why are the changes needed?

Non-native table formats like Iceberg has the capability to store column stats in its own format (for Iceberg: Its stored in Puffin files).

However, these stats are stored in metastore as well after setting the column stats in its own format. We must avoid setting column stats in 2 places and must set only in a single place.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Qtest

@SourabhBadhya SourabhBadhya changed the title HIVE-27421: Do not set stats in metastore when non-native table can store stats in its own format HIVE-27421: Do not set column stats in metastore when non-native table can store stats in its own format Jun 8, 2023
@SourabhBadhya SourabhBadhya changed the title HIVE-27421: Do not set column stats in metastore when non-native table can store stats in its own format HIVE-27421: Do not set column stats in metastore when non-native table can store column stats in its own format Jun 8, 2023
// Set table or partition column statistics in metastore.
db.setPartitionColumnStatistics(request);
}
db.setPartitionColumnStatistics(request);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about the change.
If we can not get stats from puffine due to some exception, we can fallback get stats from metastore. So i think maybe write stats into the two places is meaningful. Please correct me if i misunderstand. Thansk.

Copy link
Contributor Author

@SourabhBadhya SourabhBadhya Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangbutao I agree with your point. However, storing stats in 2 places has its pros & cons -
Pros -

  1. We can fallback to metastore by changing the config - hive.iceberg.stats.source=metastore if we are not able to get stats from Puffin files.

Cons -

  1. Any change in Puffin files by external clients is not visible to metastore.
  2. Performance effect of executing these metastore DB calls to store column stats.

In the approach mentioned in the PR, if users want to use metastore to get stats if they are not able to get stats from Puffin, then set hive.iceberg.stats.source=metastore and execute ANALYZE TABLE <tableName> COMPUTE STATISTICS FOR COLUMNS. (This will have an overhead of one more ANALYZE query).

I will leave it to the community to decide if its best to store stats in 2 places or storing it in a single place is sufficient. If the community thinks that this it is best to store in 2 places, then I won't proceed further. Otherwise, I will continue with the patch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can not get stats from puffine due to some exception, we can fallback get stats from metastore. So i think maybe write stats into the two places is meaningful

Storing at two places have additional costs during write & currently we have two modes, "iceberg" & "metastore", so both denotes where to store the stats.

Storing at both sides, seems to be a third mode, like "both" and presently we don't have a fallback logic either during read side, that if puffin file are inaccessible then go to metastore kind of thing.

May be if we want such a thing, we can have a new mode, if we feel that is required in future stages.

As of now, I think, "iceberg" mode should store only in puffin and "metastore" mode should store only in "metastore"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long rowCnt = getRowCnt(pctx, tsOp, tbl);
// if we can not have correct table stats, then both the table stats and column stats are not useful.
if (rowCnt == null) {

private Long getRowCnt(
ParseContext pCtx, TableScanOperator tsOp, Table tbl) throws HiveException {
Long rowCnt = 0L;
if (tbl.isPartitioned()) {
for (Partition part : pctx.getPrunedPartitions(
tsOp.getConf().getAlias(), tsOp).getPartitions()) {
if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(part.getTable(), part.getParameters())) {
return null;
}
long partRowCnt = Long.parseLong(part.getParameters().get(StatsSetupConst.ROW_COUNT));
rowCnt += partRowCnt;
}
} else { // unpartitioned table
if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(tbl, tbl.getParameters())) {
return null;
}
rowCnt = Long.valueOf(tbl.getProperty(StatsSetupConst.ROW_COUNT));
}
return rowCnt;

Currently, Like this example HIVE-27347 always uses the iceberg basic stats from metatstore to optimize count(*) query. We should consider how to do this if only using puffin stats.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created #5400. to address above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above item is resolved now, so we could proceed with the merge

@sonarqubecloud
Copy link

sonarqubecloud bot commented Jun 8, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the [email protected] list if the patch is in need of reviews.

@github-actions github-actions bot added the stale label Aug 10, 2023
@deniskuzZ deniskuzZ removed the stale label Aug 16, 2023
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

warning The version of Java (11.0.8) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17.
Read more here

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the [email protected] list if the patch is in need of reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants