Skip to content

Conversation

@satishkotha
Copy link
Member

@satishkotha satishkotha commented Aug 27, 2020

What is the purpose of the pull request

Add replace a top level action. Implement insert_overwrite operation on top of replace action

Brief change log

  1. All post commit actions work on top of HoodieCommitMetadata. Create HoodieReplaceMetadata to be subclass of HoodieCommitMetadata
  2. Add insertOverwrite as new operation on HoodieWriteClient. insertOverwrite uses REPLACE action to mark all existing file groups as 'invalid'
  3. Change archival to delete replaced files before archiving REPLACE action metadata

There are two other things that needs to be addressed:

  1. Everywhere, we try to invoke 'getCommitsTimeline'/'filterCommits', we need to review and make sure caller can handle replace actions. OR create new methods and refactor all invocations of getCommitsTimeline call new methods.
  2. FileSystemView#getAllFileGroups (and other methods in view) by default excludes file groups that have been replaced. We have to make sure callers can handle file groups that existed before, but have been replaced. (Example: say, there is a CompactionPlan that includes file id f1. f1 is replaced at a later instant. If compaction is run after replace, it wont be able to find corresponding file group. We need to make sure compaction can make progress instead of throwing errors here.)

Verify this pull request

This change added tests. Verified basic actions using quick start and docker setup. Added tests for insertOverwrite and FileSystemView changes. Need to add additional tests for backward compatibility. Also need to change integration tests to include insertOverwrite operations. Created followup tickets here https://issues.apache.org/jira/browse/HUDI-868

This is an example .hoodie folder from quick start setup:
-rw-r--r-- 1 satishkotha wheel 1933 Aug 27 16:39 20200827163904.commit
-rw-r--r-- 1 satishkotha wheel 0 Aug 27 16:39 20200827163904.commit.requested
-rw-r--r-- 1 satishkotha wheel 1015 Aug 27 16:39 20200827163904.inflight
-rw-r--r-- 1 satishkotha wheel 2610 Aug 27 16:39 20200827163927.replace
-rw-r--r-- 1 satishkotha wheel 1024 Aug 27 16:39 20200827163927.replace.inflight
-rw-r--r-- 1 satishkotha wheel 0 Aug 27 16:39 20200827163927.replace.requested
drwxr-xr-x 2 satishkotha wheel 64 Aug 27 16:39 archived
-rw-r--r-- 1 satishkotha wheel 235 Aug 27 16:39 hoodie.properties

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@satishkotha
Copy link
Member Author

@vinothchandar @bvaradar FYI. There are few things that I'm not fully happy with. But would like to get initial feedback and get agreement on high level approach.

@satishkotha satishkotha force-pushed the sk/replaceTopHudi branch 2 times, most recently from d259d1f to c864f50 Compare August 28, 2020 05:28
Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@satishkotha : I made an initial pass and have given comments.

Regarding Metadata, Lets,

  1. Create HoodieReplaceMetadata : You need a json representation (active timeline) and an avro representation (for archived).
  2. Have HoodieReplaceMetadata extend HoodieCommitMetadata in json structure
  3. Ensure the avro representation has the same structure
  4. Add test-cases to ensure you are able to read committed replace metadata.

The above mechanism seems to be the cleanest possible way without boiling the ocean w.r.t commit logic().

  1. Add test-cases to ensure you are able to read

Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of next steps after this PR,

  1. Support for DeltaStreamer.
  2. CLI commands to display replace stats.
  3. Documentation Update for Insert Overwrite.

Please file jira to track. Also add any other things that I have missed.

@bvaradar
Copy link
Contributor

@satishkotha : How is incremental Reading handled.Are we going to support it ? If not, Are you going to throw exceptions ? As we are cleaning up replaced files during archiving only but perform normal cleanup during cleaner operations, there will be cases when file versions of new files getting cleaned up before the old file versions getting removed.

@satishkotha
Copy link
Member Author

@bvaradar I addressed your comments.I also created followup items here https://issues.apache.org/jira/browse/HUDI-868 as you suggested.

Please take another look when possible.

@satishkotha
Copy link
Member Author

satishkotha commented Sep 4, 2020

@vinothchandar @bvaradar Addressed most of your suggestions. Couple other followup items I need help from you on:

  1. You suggested to remove HoodieReplaceStat. I ran into minor implementation issue removing it. Basically, HoodieWriteClient operations return JavaRDD[WriteStatus]. SparkSqlWriter uses these WriteStatus to create metadata (.commit/.replace etc). Each WriteStatus comes with HoodieWriteStat (which is expected to be non-null in many places). This HoodieWriteStat is used for many post commit operations. So if we want to remove HoodieReplaceStat, we can either
    a) change signature of WriteClient operations to return a new structured object instead of just returning JavaRDD[WriteStatus]. This object would contain List[HoodieFileGroupId] for tracking file groups replaced and JavaRDD[WriteStatus] for newly created file groups. We have to change post commit operations to look at this new object instead of WriteStatus.
    OR
    b) Return a WriteStatus for replaced file groups too. WriteClient operations can continue to return JavaRDD[WriteStatus]. Each WriteStatus has HoodieWriteStat which can be a token value (null?) for replaced file groups.

Either way, this is somewhat involved change, so would like to get your feedback before starting implementation. What do you guys think?

  1. Deleting replaced file groups during archival vs clean. I've this deletion logic implemented in archival per our earlier conversation. But, as I mentioned, this may lead to storage inefficiency. For example, a) clean retain is set to 1 commit. b) archival is set to be done after 24 commits. We keep all the data for replaced files until archival happens.

Let me know if you guys have any other comments.

@vinothchandar
Copy link
Member

You suggested to remove HoodieReplaceStat

I think the suggestion was to simplify HoodieReplaceMetadata such that it only contains the extra information about replaced file groups. and use the HoodieCommitMetadata and its HoodieWriteStat for tracking the new file groups written.
We could have HoodieReplaceStat to be part of the WriteStatus itself for tracking the additional information about replaced file groups?

On cleaning vs archival, it would be good if we can implement this in cleaning. But can that be a follow-on item? Practically speaking, typical deployments don't configure cleaning that low.

@satishkotha
Copy link
Member Author

@vinothchandar As discussed, i added boolean in WriteStatus and removed HoodieReplaceStat. See this diff. I committed it as a separate git-sha because this still looks somewhat awkward IMO. Please take a look and I can revert or reimplement in a different way

Also, created https://issues.apache.org/jira/browse/HUDI-1276 for cleaning replaced file during clean.

I also renamed 'replace' to 'replacecommit' everywhere as you suggested.

Please let me know if you have additional comments/suggestions

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just getting started

@satishkotha satishkotha force-pushed the sk/replaceTopHudi branch 4 times, most recently from 9de32ee to a546978 Compare September 18, 2020 17:14
@bvaradar
Copy link
Contributor

@satishkotha : Please ping me in the PR when you have updates and I can give incremental comments if needed.

@satishkotha
Copy link
Member Author

satishkotha commented Sep 21, 2020

@satishkotha : Please ping me in the PR when you have updates and I can give incremental comments if needed.

@bvaradar Incremental FileSystem resotre is the only big pending item. I'll get to it in later part of this week.

@satishkotha satishkotha force-pushed the sk/replaceTopHudi branch 3 times, most recently from 7f3de1b to 980fed7 Compare September 23, 2020 20:39
Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@satishkotha : Please go ahead and add unit-tests.

@bvaradar
Copy link
Contributor

bvaradar commented Sep 28, 2020

@satishkotha : when the conflicts are resolved and comments addressed let me know and I will take a final pass to land.

@bvaradar bvaradar mentioned this pull request Sep 28, 2020
5 tasks
@satishkotha satishkotha changed the title [HUDI-1072][WIP] Introduce REPLACE top level action [HUDI-1072] Introduce REPLACE top level action Sep 29, 2020
Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @satishkotha Looks good assuming the follow-up tasks : https://issues.apache.org/jira/browse/HUDI-1042

@bvaradar bvaradar merged commit a99e93b into apache:master Sep 30, 2020
prashantwason pushed a commit to prashantwason/incubator-hudi that referenced this pull request Feb 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants