Skip to content

Conversation

@majian1998
Copy link
Contributor

In the current version, HoodieFileIndex is a member variable of HoodieBaseRelation. PR #7871 has made Hudi's acquisition of Relation behave more like Spark's. However, in Spark, the relation is cached as follows:

catalog.getCachedPlan(qualifiedTableName, () => {
  val dataSource = DataSource(
    sparkSession,
    userSpecifiedSchema = if (table.schema.isEmpty) None else Some(table.schema),
    partitionColumns = table.partitionColumnNames,
    bucketSpec = table.bucketSpec,
    className = table.provider.get,
    options = dsOptions,
    catalogTable = Some(table)
  )
  LogicalRelation(dataSource.resolveRelation(checkFilesExist = false), table)
})

This results in the continuous use of the same HoodieFileIndex instance.

However, HoodieFileIndex contains cached items like cachedAllPartitionPaths and cachedAllInputFileSlices, which only reset upon creating a new HoodieFileIndex. A sparkSession will only execute refreshTable when actions like 'insert' are performed. If HoodieFileIndex is never refreshed, then a SparkSession that only executes queries will always see the version that was cached during the initial query. This is not the expected behavior. In practice, Delta Lake seems to attempt updating the snapshot with each listFiles operation. After PR #7871, Hudi would recreate the relation with each query, obtaining the latest snapshot. Therefore, I believe there should be an assessment at the start of listFiles to determine whether the cache (snapshot) needs to be updated.

Change Logs

Determine Whether to Update Cache Based on Timeline Changes Before Listing Files

Impact

None

Risk level (write none, low medium or high below)

medium

Documentation Update

None

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@bvaradar
Copy link
Contributor

@majian1998

After PR https://github.com/apache/hudi/pull/7871, Hudi would recreate the relation with each query, obtaining the latest snapshot. Therefore, I believe there should be an assessment at the start of listFiles to determine whether the cache (snapshot) needs to be updated.

Can you check if HoodieFileIndex is used in other places apart from spark sql execution.

Also, if the relation is getting recreated with each query, wouldn't the file-index be force-refreshed ? Not sure if I am following your comment correctly.

As the check has the cost of updating the timeline, wondering if this is needed, should we explicitly perform the check at the caller side when we know we need to refresh ?

@majian1998
Copy link
Contributor Author

@bvaradar:
If the relation is recreated, the file index will be forcefully refreshed (the object is rebuilt).
In fact, this modification has, on one hand, indeed increased the cost for all queries in order to address certain situations, and on the other hand, it has significantly altered the logic for some queries, leading to some test failures. Therefore, I will temporarily close this PR and record any interesting phenomena/potential errors identified during this process in a Jira issue. If you're interested, I can cc you on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants