[HUDI-7264] In a Query-Only Spark Session, the Latest Visible Commit Is Not Updated #10419

majian1998 · 2023-12-27T11:26:14Z

In the current version, HoodieFileIndex is a member variable of HoodieBaseRelation. PR #7871 has made Hudi's acquisition of Relation behave more like Spark's. However, in Spark, the relation is cached as follows:

catalog.getCachedPlan(qualifiedTableName, () => {
  val dataSource = DataSource(
    sparkSession,
    userSpecifiedSchema = if (table.schema.isEmpty) None else Some(table.schema),
    partitionColumns = table.partitionColumnNames,
    bucketSpec = table.bucketSpec,
    className = table.provider.get,
    options = dsOptions,
    catalogTable = Some(table)
  )
  LogicalRelation(dataSource.resolveRelation(checkFilesExist = false), table)
})

This results in the continuous use of the same HoodieFileIndex instance.

However, HoodieFileIndex contains cached items like cachedAllPartitionPaths and cachedAllInputFileSlices, which only reset upon creating a new HoodieFileIndex. A sparkSession will only execute refreshTable when actions like 'insert' are performed. If HoodieFileIndex is never refreshed, then a SparkSession that only executes queries will always see the version that was cached during the initial query. This is not the expected behavior. In practice, Delta Lake seems to attempt updating the snapshot with each listFiles operation. After PR #7871, Hudi would recreate the relation with each query, obtaining the latest snapshot. Therefore, I believe there should be an assessment at the start of listFiles to determine whether the cache (snapshot) needs to be updated.

Change Logs

Determine Whether to Update Cache Based on Timeline Changes Before Listing Files

Impact

None

Risk level (write none, low medium or high below)

medium

Documentation Update

None

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

hudi-bot · 2023-12-27T13:35:51Z

CI report:

9cc01fb Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

bvaradar · 2023-12-27T23:12:20Z

@majian1998

After PR https://github.com/apache/hudi/pull/7871, Hudi would recreate the relation with each query, obtaining the latest snapshot. Therefore, I believe there should be an assessment at the start of listFiles to determine whether the cache (snapshot) needs to be updated.

Can you check if HoodieFileIndex is used in other places apart from spark sql execution.

Also, if the relation is getting recreated with each query, wouldn't the file-index be force-refreshed ? Not sure if I am following your comment correctly.

As the check has the cost of updating the timeline, wondering if this is needed, should we explicitly perform the check at the caller side when we know we need to refresh ?

majian1998 · 2023-12-28T03:14:39Z

@bvaradar:
If the relation is recreated, the file index will be forcefully refreshed (the object is rebuilt).
In fact, this modification has, on one hand, indeed increased the cost for all queries in order to address certain situations, and on the other hand, it has significantly altered the logic for some queries, leading to some test failures. Therefore, I will temporarily close this PR and record any interesting phenomena/potential errors identified during this process in a Jira issue. If you're interested, I can cc you on it.

[HUDI-7264] Determine Whether to Update Cache Before Listing Files

9cc01fb

majian1998 closed this Dec 28, 2023

hudi-bot mentioned this pull request Dec 9, 2025

In a Query-Only Spark Session, the Latest Visible Commit Is Not Updated #16359

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HUDI-7264] In a Query-Only Spark Session, the Latest Visible Commit Is Not Updated #10419

[HUDI-7264] In a Query-Only Spark Session, the Latest Visible Commit Is Not Updated #10419

Uh oh!

majian1998 commented Dec 27, 2023

Uh oh!

hudi-bot commented Dec 27, 2023

Uh oh!

bvaradar commented Dec 27, 2023

Uh oh!

majian1998 commented Dec 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[HUDI-7264] In a Query-Only Spark Session, the Latest Visible Commit Is Not Updated #10419

[HUDI-7264] In a Query-Only Spark Session, the Latest Visible Commit Is Not Updated #10419

Uh oh!

Conversation

majian1998 commented Dec 27, 2023

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented Dec 27, 2023

CI report:

Uh oh!

bvaradar commented Dec 27, 2023

Uh oh!

majian1998 commented Dec 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants