Lazily load hive partition information#10215
Conversation
bd737f6 to
b044ed3
Compare
plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHiveConnectorTest.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHiveConnectorTest.java
Outdated
Show resolved
Hide resolved
426ffc3 to
d7d383b
Compare
d7d383b to
5e0f123
Compare
plugin/trino-hive/src/test/java/io/trino/plugin/hive/optimizer/TestHivePlans.java
Outdated
Show resolved
Hide resolved
5e0f123 to
79c8012
Compare
|
@sopel39 AC |
79c8012 to
34c436f
Compare
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionResult.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
It's odd that we don't need to filter partition names while we had to do .filter(partition -> partitionMatches(partitionColumns, effectivePredicate, predicate, partition)) above.
Comment would be nice
There was a problem hiding this comment.
ping, I still don't understand why we don't have to filter here.
Why there can be partition names and not getPartitions?
There was a problem hiding this comment.
partitionMatches effectively converts the partition names from String to HivePartitionInformation and then tries to match with the filter predicate.
Why there can be partition names and not getPartitions?
The contract is that a table can either load a partition details (if it is less than a threshold) or could maintain it as rawStringif it crosses a threshold. We don't have an intermediate state here
There was a problem hiding this comment.
partitionMatches effectively converts the partition names from String to HivePartitionInformation and then tries to match with the filter predicate.
So in this branch:
partitionNames = hiveTableHandle.getPartitionNames().get();
we don't do any filtering. Does it mean we return excessive partitionNames?
There was a problem hiding this comment.
we don't do any filtering.
We do kind of partitial filterting, like if we have a TupleDomain (with Domain on Partition columns) then the filtering would be applied at a metastore layer but we won't perform partitionMatches (and materializing it into HivePartition)
Does it mean we return excessive partitionNames?
Since we dont invoke partitionMatches there is a chance that we could return excessive partition names.
There was a problem hiding this comment.
Since we dont invoke partitionMatches there is a chance that we could return excessive partition names.
That doesn't seem correct, does it?
There was a problem hiding this comment.
I think in this case we don't have specify the enforcedTupleDomain as TupleDomain so that the filter expression is not lost. It will be applied during the next applyFilter optimizer. WDYT ?
There was a problem hiding this comment.
I think in this case we don't have specify the enforcedTupleDomain as TupleDomain so that the filter expression is not lost. It will be applied during the next applyFilter optimizer. WDYT ?
Depends on contract. If we say getPartitions returns partitions that match constrant, then it should be the case
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java
Outdated
Show resolved
Hide resolved
|
@sopel39 Added comments. |
There was a problem hiding this comment.
Is it filtered by table predicate?
Can partitionNames be loaded independently of partitions?
There was a problem hiding this comment.
Is it filtered by table predicate?
Yes but partially.
Can partitionNames be loaded independently of partitions?
If partitions is loaded then partitionNames is reset to Optional.empty
There was a problem hiding this comment.
Could you add that as a comment?
There was a problem hiding this comment.
Is it filtered by table predicate?
Yes but partially.
that must be documented
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveTableHandle.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
ping, I still don't understand why we don't have to filter here.
Why there can be partition names and not getPartitions?
There was a problem hiding this comment.
Is it filtered by table predicate?
Yes but partially.
that must be documented
There was a problem hiding this comment.
use assertThat(query as in the next assertion
There was a problem hiding this comment.
You can add this test in the prep commit, like you did with the other test.
i understand the query fails before changes?
There was a problem hiding this comment.
Document why we're making this choice.
In any case, putting this boolean in HivePartitionResult is wrong.
The user of HivePartitionResult should do this logic (getPartitionsAsList or caller of it)
Remove HivePartitionResult.canParsePartitions fielld
There was a problem hiding this comment.
This looks wasteful. If we knot the partitions (the objects), we don't need the names anymore.
There was a problem hiding this comment.
since we will calculate List<HivePartition> partitionList, the partitionNames list won't be needed.
no need to materialize the list.
There was a problem hiding this comment.
This looks as potentially expensive and is also being thrown away.
Document why we believe this is not a problem.
There was a problem hiding this comment.
I think this is being seen in the current master too. Since getTableProperties doesn't allow us to update TableHandle this issue is seen.
There was a problem hiding this comment.
can partitionColumns be empty here? The outer block of code looks like dealing with partitioned table
There was a problem hiding this comment.
In case of unpartitioned table too we might get a single entry for HivePartition, so we need this check.
There was a problem hiding this comment.
Thanks for adding a comment.
as a followup please refactor this code. Doing some partition related work over 15 lines only to eventually check that... table isn't actually partitioned. We should reverse the checks
|
Just a general question, when is |
|
@alexjo2144 seems the only use of |
|
Thanks Piotr, I ask mainly because of the check in |
may or may not |
0ef477f to
68a9551
Compare
There was a problem hiding this comment.
In case of unpartitioned table too we might get a single entry for HivePartition, so we need this check.
There was a problem hiding this comment.
I think this is being seen in the current master too. Since getTableProperties doesn't allow us to update TableHandle this issue is seen.
68a9551 to
ba60f3a
Compare
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Why is this block conditional on partitions being loaded? Only the columns list is used here, not the values of the partitions.
There was a problem hiding this comment.
If the HiveTableHandle has only raw partition names, then there is a good chance that the Domain for partition columns partially enforced, so we compute the enforced tuple domain post materialization of HivePartition.
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveTableHandle.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
The condition here (partitionNames list size, constraint.predicate() being present) doesn't not match the objects being used in the enclosed code (partitionColumns). I don't find the code-level documentation explanatory for this. Can you elaborate?
Also, the logic partitionNames.orElseGet(ImmutableList::of).size() <= maxPartitions is a tricky hack, as partitionNames being empty doesn't mean no partitions being scanned (empty list).
Split into .isEmpty / .isPresent check and separate size check.
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
"post"?
did you mean
// Note that the computation is not persisted in the table handle, so can be redone many times
?
There was a problem hiding this comment.
i am not sure this is a great idea (would rather have a new API rather than make a seemingly read-only method not be a read-only method). However, if this is something to address, we should have an issue.
Also, can this cause planning time performance degradation? @sopel39 i think it can.
Note that this new code triggers not only for queries that would fail, but also for queries where we loaded many partition names, and didn't convert them and narrow down to partition objects, out of fear that there will be too many.
Should we have a kill switch?
There was a problem hiding this comment.
Also, document why we're doing this. One can intuitively want to return "no known constraint here", and i currently wouldn't be able to explain why not.
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Why does the query pass only if you disable join reordering?
There was a problem hiding this comment.
When we disable join reordering, the table statistics are not computed during the planning, so the query crosses the planning phase but it fails during execution.
There was a problem hiding this comment.
- this should be a code comment
- that means the actual user problem has not been solved, right?
There was a problem hiding this comment.
that means the actual user problem has not been solved, right?
No. The actual problem that we are trying to solve is -
Query currently fails when we apply a criteria like partition_column like '%abc%' - if the initial number of partitions crosses the threshold - even if only a lesser number of partitions satisfies it.
There was a problem hiding this comment.
Where is the test showing the successful execution of a query benefiting from the changes here?
i see one in io.trino.plugin.hive.BaseHiveConnectorTest#testPartitionPerScanLimitWithMultiplePartitionColumns but it's simple SELECT (good test case to test)
i also see io.trino.plugin.hive.optimizer.TestHivePlans#testQueryScanningForTooManyPartitions covering Joins, but it fails with defaults, and requires disabling join reordering to pass.
The solution should work with defaults setting, or any other setting that are probable to be used by a user.
There was a problem hiding this comment.
We have fixed for fetching table statistics - now it passes during planning while it fails during execution - due to scanning maximum partitions.
b706a93 to
741ac2c
Compare
Praveen2112
left a comment
There was a problem hiding this comment.
@findepi Thanks for the feedback, have applied the comments.
There was a problem hiding this comment.
When we disable join reordering, the table statistics are not computed during the planning, so the query crosses the planning phase but it fails during execution.
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveTableHandle.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Yeah !! In case of HiveTableHandle it is a bit strict (like it fails if we have both partitionNames and partitions).. we check it when we are initializing.
There was a problem hiding this comment.
Constraint.alwaysTrue sets the predicate as Optional#empty but we need to pass an default predicate for the API..
Can we modify Constraint.alwaysTrue like we do for Constraint.alwaysFalse ?
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Yes so in that case we try to compute the remaining tuple domain.
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Why didn't you add same check in HivePartitionResult::new?
There was a problem hiding this comment.
Partitions are not loaded, since it's a throw-away information.
It's conceivable that hiveTableHandle is later consumed with a slightly different logic, making the enforcedTupleDomain untrue.
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Can we modify
Constraint.alwaysTruelike we do forConstraint.alwaysFalse?
that's effective today, but basing on unwritten non-API assumptions sounds like a brittle hack.
There was a problem hiding this comment.
That's a prescribed solution, and one which i am not sold on yet.
File a ticket about this problem, but avoiding prescribing solutions, to avoid channelling the discussion.
Link it here.
There was a problem hiding this comment.
- this should be a code comment
- that means the actual user problem has not been solved, right?
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
There are some callers of getPartitions which don't then call this method before collecting the Partitions. getTableStatistics for example, should this be checked there?
There was a problem hiding this comment.
Is this change worth doing until we have that done? Seems like loading the partition information once eagerly is better than lazily doing it multiple times
There was a problem hiding this comment.
I'm just asking if this change to lazily load the partition information is actually an improvement until that linked issue is completed.
There was a problem hiding this comment.
Or if that issue is a blocker for this to be merged.
There was a problem hiding this comment.
We lazily evaluate so that it will be loaded after all the filters are pushed to the Hive.
|
@alexjo2144 AC |
bdbd541 to
47f12ba
Compare
There was a problem hiding this comment.
please switch to Constraint.alwaysTrue()
plugin/trino-hive/src/test/java/io/trino/plugin/hive/optimizer/TestHivePlans.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Still, can we have a check there?
95cc629 to
47f12ba
Compare
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Thanks for adding a comment.
as a followup please refactor this code. Doing some partition related work over 15 lines only to eventually check that... table isn't actually partitioned. We should reverse the checks
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
if (hiveTableHandle.getPartitions().isPresent()) then why do we pass partitionNames further, as if we're preserving some value?
move the assignment under if (hiveTableHandle.getPartitions().isPresent()) block, and set the value to Optional.empty() explicitly
There was a problem hiding this comment.
We can set it as Optional.empty() and no need to move the assignment.
There was a problem hiding this comment.
enforcedConstraint or effectivePredicate ?
it's not enforced yet...
also includes Domains on non-partitioning columns, right? (otherwise you wouldn't do partitions.getEffectivePredicate().filter((column, domain) -> partitionColumns.contains(column)) below)
There was a problem hiding this comment.
Yeah. I think it has to be TupleDomain#all
There was a problem hiding this comment.
if we skip the if (canPartitionsBeLoaded(partitions) || constraint.predicate().isPresent()) block above (i.e. condition was false), then this variable contains partitions.getEffectivePredicate() which looks like not enforced and potentially containing Domains on non-partitioning columns (so something that won't be enforced)
this is here passed to HiveTableHandle#enforcedConstraint
There was a problem hiding this comment.
then this variable contains partitions.getEffectivePredicate()
Actually it would contain TupleDomain#all - but now it will be initialized to HiveTableHandle#enforcedConstraint
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HivePartitionManager.java
Outdated
Show resolved
Hide resolved
* Test to ensure filter on build side table is derived from table properties * Test to ensure query fails if it scans too many partitions
16a81d2 to
8afe1db
Compare
|
@findepi AC. Since there was a conflict had to rebase it. |
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/optimizer/TestHivePlans.java
Outdated
Show resolved
Hide resolved
|
@findepi AC |
We defer the initial loading of HivePartitionInformation if the number of partitions crosses a limit. This allows further invocation of applyFilter which could reduce the number of partitions to be scanned.
c68acf2 to
d7a6b9c
Compare
No description provided.