Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Handle partition evolution case in PartitionStatsUtil#computeStats #12137

Merged
merged 1 commit into from
Feb 20, 2025

Conversation

deniskuzZ
Copy link
Member

@deniskuzZ deniskuzZ commented Jan 30, 2025

Description:
PartitionMap creates internal wrappers based on table specs partitionType, however, we are passing coercedPartition with unified partition type considering all specs as a KEY, leading to wrong evaluation.

#12168

Repro:
https://github.com/apache/hive/blob/master/iceberg/iceberg-handler/src/test/queries/positive/iceberg_bucket_map_join_4.q

Testing:

  • TestPartitionStatsUtil#testPartitionStatsWithSchemaEvolution2

@github-actions github-actions bot added the core label Jan 30, 2025
@deniskuzZ deniskuzZ changed the title BugFix: PartitionStatsUtil#computeStats returns incomplete stats in case of partition evolution Core: BugFix: PartitionStatsUtil#computeStats returns incomplete stats in case of partition evolution Jan 30, 2025
@deniskuzZ
Copy link
Member Author

deniskuzZ commented Jan 31, 2025

@ajantha-bhat, @aokolnychyi, could you please take a look?

@deniskuzZ
Copy link
Member Author

ping @ajantha-bhat

partitionData.set(0, c2);
partitionData.set(1, c3);
return partitionData;
@org.junit.Test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use junit5 like other tests. (Used wrong annotation?)

I don't think the tests are running right now.
So, please fix it and I will try to understand the test case.

The original intention is to store the unified tuple as the key itself. So, it can handle all the partition evolution. So, from the test I am not yet clear what was expected and what went wrong easily.

Copy link
Member Author

@deniskuzZ deniskuzZ Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @ajantha-bhat for checking the PR, I almost gave up expecting an answer :)

  1. removed the wrong test annotation
  2. I understood the intention, however, there is a bug with the key matching: PartitionMap is initialized with individual partition specs [], but the supplied key is a unified tuple.

For the created test case, the existing code returns just 3 partition values instead of 6. The test simulates partition evolution by changing the number of buckets from 2 to 4: spec(c2, bucket(c1, 2)) -> spec(c2, bucket(c1, 4))

[foo, 0, null]
[foo, 1, null]
[bar, null, 0]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expecting actual:
  [(PartitionData{c2=foo, c1_bucket=0, c1_bucket_4=null}, 0, 100290L, 1, 38596L, 0L, 0, 0L, 0, 0L, 1739473244037L, 1L),
    (PartitionData{c2=foo, c1_bucket=1, c1_bucket_4=null}, 0, 100293L, 1, 43840L, 0L, 0, 0L, 0, 0L, 1739473244037L, 1L),
    (PartitionData{c2=bar, c1_bucket=null, c1_bucket_4=0}, 1, 402736L, 4, 46675L, 0L, 0, 0L, 0, 0L, 1739473244341L, 2L)]
to contain exactly in any order:
  [(PartitionData{c2=foo, c1_bucket=0, c1_bucket_4=null}, 0, 100290L, 1, 38596L, 0L, 0, 0L, 0, 0L, 1739473244037L, 1L),
    (PartitionData{c2=foo, c1_bucket=1, c1_bucket_4=null}, 0, 100293L, 1, 43840L, 0L, 0, 0L, 0, 0L, 1739473244037L, 1L),
    (PartitionData{c2=bar, c1_bucket=null, c1_bucket_4=0}, 1, 100747L, 1, 5616L, 0L, 0, 0L, 0, 0L, 1739473244341L, 2L),
    (PartitionData{c2=bar, c1_bucket=null, c1_bucket_4=1}, 1, 100122L, 1, 7260L, 0L, 0, 0L, 0, 0L, 1739473244341L, 2L),
    (PartitionData{c2=bar, c1_bucket=null, c1_bucket_4=2}, 1, 100914L, 1, 13793L, 0L, 0, 0L, 0, 0L, 1739473244341L, 2L),
    (PartitionData{c2=bar, c1_bucket=null, c1_bucket_4=3}, 1, 100953L, 1, 20006L, 0L, 0, 0L, 0, 0L, 1739473244341L, 2L)]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see,

PartitionMap is initialized with individual partition specs [], but the supplied key is a unified tuple

I assumed partitionMap will use the key supplied. Now I see that it is based on the individual spec

return StructLikeMap.create(spec.partitionType());

Since, we don't combine stats for different spec id, your solution works.

Thanks for the fix.

@deniskuzZ deniskuzZ force-pushed the basic_stats_collection_fix branch from 63d1a22 to a36a8bd Compare February 13, 2025 19:28
Copy link
Member

@ajantha-bhat ajantha-bhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@aokolnychyi: Could you please help in merging this PR?

@ajantha-bhat
Copy link
Member

PR title can be Core: Handle partition evolution case in partition stats compute

@ajantha-bhat ajantha-bhat requested a review from pvary February 19, 2025 07:03
@deniskuzZ deniskuzZ changed the title Core: BugFix: PartitionStatsUtil#computeStats returns incomplete stats in case of partition evolution Core: Handle partition evolution case in PartitionStatsUtil#computeStats Feb 19, 2025
@deniskuzZ deniskuzZ changed the title Core: Handle partition evolution case in PartitionStatsUtil#computeStats Core: Handle partition evolution in PartitionStatsUtil#computeStats Feb 19, 2025
@deniskuzZ deniskuzZ changed the title Core: Handle partition evolution in PartitionStatsUtil#computeStats Core: Handle partition evolution case in PartitionStatsUtil#computeStats Feb 19, 2025
statsMap.computeIfAbsent(specId, key, () -> new PartitionStats(key, specId));
statsMap.computeIfAbsent(
specId,
((PartitionData) file.partition()).copy(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about these casts.

Could keyTemplate.copyFor(file.partition()) work?

Copy link
Member Author

@deniskuzZ deniskuzZ Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @pvary, thanks for taking a look.
keyTemplate is a unified tuple struct, so it can't be used. btw, if you grep iceberg code, you'll find a number of same casts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the tests are broken by the infra 😢

Could you help me understand the original issue? I thought that I understand, but seems like I still miss something.

I get that there is difference between the partitionType (which is the unified spec), the coercedPartition which is the partition key for the unified partition type.

What is the key we would like to see in the statsMap?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed in private. @pvary, please let me know if any questions left.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this with @RussellSpitzer, and he confirmed that casting is indeed correct.

@deniskuzZ deniskuzZ force-pushed the basic_stats_collection_fix branch from a4c0df5 to 6399f60 Compare February 19, 2025 13:49
Copy link
Contributor

@pvary pvary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM
When the infra recovers, and ww have a green run, we can commit

@deniskuzZ deniskuzZ force-pushed the basic_stats_collection_fix branch from 6399f60 to 986a53f Compare February 20, 2025 10:31
@pvary pvary merged commit de64415 into apache:main Feb 20, 2025
42 checks passed
@pvary
Copy link
Contributor

pvary commented Feb 20, 2025

Merged to master.
Thanks @deniskuzZ for the fix and @ajantha-bhat for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants