-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-28268: Iceberg: Retrieve row count from iceberg SnapshotSummary in case of iceberg.hive.keep.stats=false #5215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0c14207 to
1a953e8
Compare
1a953e8 to
77d9a7e
Compare
| filterExpr: (a = 22) (type: boolean) | ||
| Snapshot ref: branch_test1 | ||
| Statistics: Num rows: 3 Data size: 291 Basic stats: COMPLETE Column stats: COMPLETE | ||
| Statistics: Num rows: 5 Data size: 485 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this PR, we always get row count of branch/tag/timetravel by the current snapshot summary, which is not right.
77d9a7e to
441db00
Compare
441db00 to
9971db5
Compare
9971db5 to
0ffc9df
Compare
1347fef to
7b2ffb2
Compare
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
Outdated
Show resolved
Hide resolved
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveStorageHandler.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/StatsOptimizer.java
Outdated
Show resolved
Hide resolved
…ceberg.hive.keep.stats=false
7b2ffb2 to
deb46ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1, pending tests
thanks @zhangbutao for addressing the review comment quickly!
|



What changes were proposed in this pull request?
At present, in case of
iceberg.hive.keep.stats=true&hive.compute.query.using.stats=true, HS2 will do a fetch task to get iceberg table'snumRowsproperty from HMS to optimizecountquery.If
iceberg.hive.keep.stats=false, HS2 will always launch tez task to compute table's row count when filing acountquery.However, as we know, iceberg table's metadata has some stats information, we can also just start a fetch task to retrieve the row count from iceberg's snapshot summary when
iceberg.hive.keep.stats=falseor no stats stored in hms. This can avoid launching tez task to compute the table's row count.BTW, timetravel or branch/tag has different stats from current snapshot, so we need to get the specified snapshotid based on the different iceberg version. Otherwise, we will get the wrong stats when querying the time travel/branch/tag.
Why are the changes needed?
Does this PR introduce any user-facing change?
No
Is the change a dependency upgrade?
No
How was this patch tested?
Qtest