[HUDI-3839] Fixing incorrect selection of MT partitions to be updated#5274
[HUDI-3839] Fixing incorrect selection of MT partitions to be updated#5274codope merged 3 commits intoapache:masterfrom
Conversation
| } | ||
| } | ||
|
|
||
| private Set<String> getMetadataPartitionsToUpdate() { |
There was a problem hiding this comment.
I don't think we can remove this. this is catered towards async indexing flows.
There was a problem hiding this comment.
This logic is broken currently -- it will not update files partition if Column Stats is enabled.
Can you elaborate on how you think it would affect Async indexing flows?
There was a problem hiding this comment.
I will try to explain to the best of my understanding. but will let @codope chime in as well.
Case 1:
Existing MDT from 0.10.0, gets upgraded to 0.11 w/o enabling any new partitions.
on first commit, after realizing FILES partition is already initialized, we will update the table config w/ "FILES" for completed MDT partitions.
Case 2:
Existing MDT from 0.10.0, gets upgraded to 0.11 w/ all partitions enabled (synchronous flow).
On first commit, we will realize 2 new columns (col stats and bloom filter) are added and will initialize the new partitions. at the end of it, we will update the table Config w/ all 3 partitions to completed MDT partitions.
Case3:
For a fresh table, use wishes to enable async indexing for col stats and bloom filter. w/ regular writer, async indexing has to be enabled for these 2 partitions. So, with a diff process altogether, user is expected to schedule and execute the index building. During scheduling, both partitions (col stats and bloom filter) will be added to table config for the list of MDT partitions being built. Once this is updated, with regular writer process, a data table commit when getting applied to MDT, will update all 3 partitions in MDT (FILES as part of completed MDT partitions and other 2 as part of MDT partitions being built out). This is case where we are in need of getMetadataPartitionsToUpdate() for writers to know what all partitions to update.
I listed Case 1 and Case2 just for completeness. but case 3 is where we might be in need of partitions being built out.
There was a problem hiding this comment.
Sorry, i'm not sure i understand what's not going to work in these scenarios.
Let's take a closer look at what i'm removing:
// fetch partitions to update from table config
Set<String> partitionsToUpdate = getCompletedMetadataPartitions(dataMetaClient.getTableConfig());
// add inflight indexes as well because the file groups have already been initialized, so writers can log updates
partitionsToUpdate.addAll(getInflightMetadataPartitions(dataMetaClient.getTableConfig()));
if (!partitionsToUpdate.isEmpty()) {
return partitionsToUpdate;
}
// fallback to all enabled partitions if table config returned no partitions
return getEnabledPartitionTypes().stream().map(MetadataPartitionType::getPartitionPath).collect(Collectors.toSet());
As you can see, we always add both already completed and inflight partitions as the ones in need of update.
There was a problem hiding this comment.
@alexeykudinkin The motivation behind the change was that the table config should be the source of truth for what partitions are available for update or reads. Case 3 is a valid one. Note that even if an MDT partition is inflight (by async indexer), it must have already been initialized and hence ready to accept updates. You mentioned
it will not update files partition if Column Stats is enabled.
Can you explain the scenario when this happened? From my understanding and testing, files partition should already be present in the table config because the files partition is initialized synchronously with the initialization of metadata writer. Perhaps, we're missing some corner case.
There was a problem hiding this comment.
Discussed this case offline. This happened when you create a managed table from an existing hudi table. The existing table config was updated correctly but the managed table did not inherit that config.
Tips
What is the purpose of the pull request
Fixing incorrect selection of MT partitions to be updated.
Brief change log
See above.
Verify this pull request
This pull request is already covered by existing tests, such as (please describe tests).
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.