Skip to content

Conversation

@TheR1sing3un
Copy link
Member

In the consistent-bucket mode, the creation of the hash_metadata_file may be read by another task or job. In this case, the Metadata may be loaded incorrectly because the hash_metadata_file may be read in the intermediate state.
For example, two task load p_date=20241202 partition's hash-metadata:

  1. task-a load from fs but found nothing.
  2. task-a start to create initial-hash-metadata-file
  3. initial-hash-metadata-file created
  4. task-b load form fs and found created but empty initial-hash-metadata-file
  5. task-b throw exception
  6. task-a write metadata content to initial-hash-metadata-file

So task-b can see the ntermediate state of hash-metadata-file.

I wrote a UT to verify it.
image

image

Change Logs

  1. using rename to create consistent-hash-metadata file

Describe context and summary for this change. Highlight if any code was copied.

Impact

Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

1. using rename to create consistent-hash-metadata file

Signed-off-by: TheR1sing3un <[email protected]>
@TheR1sing3un
Copy link
Member Author

@hudi-bot run azure

…ting uncompleted file

1. using HoodieStorage.createImmutableFileInPath to avoid visiting uncompleted file

Signed-off-by: TheR1sing3un <[email protected]>
OutputStream fsout = null;
StoragePath tmpPath = null;

boolean needTempFile = needCreateTempFile();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need an explicit flag for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need an explicit flag for this?

HoodieStorage::needCreateTempFile only be true for HDFS, but for other fs, such as local fs, we should also avoid visiting the intermediate state of the file. So completely, regardless of the underlying file system, we will always use the temp file method to create files in this scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have assumption that the storage has two categories: HDFS and Object Store, the later should always be atomic for file creation so that the renaming is only needful for HDFS.

If you think there is other fs system we need to support, at least to fix the HoodieStorage::needCreateTempFile itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have assumption that the storage has two categories: HDFS and Object Store, the later should always be atomic for file creation so that the renaming is only needful for HDFS.

If you think there is other fs system we need to support, at least to fix the HoodieStorage::needCreateTempFile itself.

I fix HoodieStorage::needCreateTempFile to for all storage without write-transaction should create temp file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can not do this, the flag isWriteTransactional does not set up right for many object stores. Let's just keep it as it is, local fs is only for testing purpose so we should be good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can not do this, the flag isWriteTransactional does not set up right for many object stores. Let's just keep it as it is, local fs is only for testing purpose so we should be good.

Got it! But can I just add a judgment of whether it's local fs or not? Because ut's tests rely on local fs, if we do not add this judgment, ut will have unexpected logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not change this logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not change this logic.

Ok, I have restored the original logic~

1. for storage without write-transaction should create temp file

Signed-off-by: TheR1sing3un <[email protected]>
@TheR1sing3un TheR1sing3un force-pushed the fix_incompleted_hash_metadata branch from 49eda0a to 4f75f9f Compare December 12, 2024 02:53
…of local fs

1. keep original `needCreateTempFile` logic but add judgement of local fs

Signed-off-by: TheR1sing3un <[email protected]>
1. keep original `needCreateTempFile` logic

Signed-off-by: TheR1sing3un <[email protected]>
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405 danny0405 merged commit 5fed226 into apache:master Dec 12, 2024
41 of 43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants