Skip to content

[HUDI-544] Archived commits command code cleanup#1242

Merged
n3nash merged 2 commits intoapache:masterfrom
hddong:hudi-544
Sep 25, 2020
Merged

[HUDI-544] Archived commits command code cleanup#1242
n3nash merged 2 commits intoapache:masterfrom
hddong:hudi-544

Conversation

@hddong
Copy link
Contributor

@hddong hddong commented Jan 17, 2020

What is the purpose of the pull request

Now, archive path have two different default value: "archived" and "". It cause a bug.

Brief change log

  • Change default path to archive
  • Cli read archive by path stored in .hoodie
  • User define archive path

Verify this pull request

This pull request is a trivial rework / code cleanup without any test coverage.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@vinothchandar vinothchandar self-assigned this Jan 18, 2020
@hddong
Copy link
Contributor Author

hddong commented Jan 19, 2020

@vinothchandar It's ok now, please have a review when you are free.

@hddong hddong changed the title HUDI-544 Adjust the read and write path of archive [HUDI-544] Adjust the read and write path of archive Jan 20, 2020
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need @bvaradar @n3nash to chime in here... Assigning to @n3nash to make the call here..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already overridden at the datasource and deltastreamer level.. Please revert this since it can affect all the old tables written into . ..

@n3nash can you advise here.. only reason to not change is all the old data at uber..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar that's correct, we don't override this and all commits archived data at Uber is written to .hoodie/.

@hddong this change doesn't seem necessary but has serious concerns for older data written (even by other users who might not be overriding the archived folder name), could you please revert it ?

@hddong
Copy link
Contributor Author

hddong commented Jan 31, 2020

@n3nash @vinothchandar thanks for review. I had revert all which relate to DEFAULT_ARCHIVELOG_FOLDER.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If remove this option, user can not use specified path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If remove this option, user can not use specified path.

IMO, it is not need here, archiveFolder is stored in .hoodie.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config is here to allow for users to be able to provide a full path of the files under the archive folder and read it, lets leave it here - Btw, these commands are being reworked in this PR - #1274.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If leave it there, this pr will has nothing to be commited, just enhancing #1274

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming HoodieCLI.getTableMetaClient().getArchivePath() returns the path with .hoodie ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public String getArchivePath() {
  String archiveFolder = tableConfig.getArchivelogFolder();
  if (archiveFolder.equals(HoodieTableConfig.DEFAULT_ARCHIVELOG_FOLDER)) {
    return getMetaPath();
  } else {
    return getMetaPath() + "/" + archiveFolder;
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming HoodieCLI.getTableMetaClient().getArchivePath() returns the path with .hoodie ?

Yes, it read from metadata and include .hoodie.

Copy link
Contributor

@n3nash n3nash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hddong one minor comment after which we can merge it

@hddong
Copy link
Contributor Author

hddong commented Feb 4, 2020

@n3nash addressed the comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can affect all the old tables read archives.

Copy link
Contributor

@n3nash n3nash Feb 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

archivePath = new Path(metaClient.getArchivePath() + "/.commits_.archive*") is equivalent of new Path(basePath + "/.hoodie/.commits_.archive*"). the metaClient.getArchivePath() should return basePath + "/.hoodie" for all old tables.
@hmatu what concerns do you have ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n3nash, they are different,
one: /table/.hoodie/archived/.commits_.archive*
another: /table/.hoodie/.commits_archive*

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The right way is add archiveFolderPattern to show archived commits command
like show archived commit stats dose.

Copy link
Contributor Author

@hddong hddong Feb 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmatu @n3nash It return /table/.hoodie/.commits_archive* if old tables use DEFAULT '' and return /table/.hoodie/archived/.commits_.archive* if old tables use path archived to archive. So it will return the correct path with archive path stored in .hoodie.
On the other hand,archiveFolderPattern allow for users to be able to provide a full path of the files under the archive folder and read it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your explain, but I still think it's a right way that add archiveFolderPattern to show archived commits command like show archived commit stats dose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmatu I think this is just a clean up of code to remove the hard-coded "archived" and providing it through the archive folder name, this is fine IMO since this does not break any backwards compatibility @hddong confirm ?
What you are referring to is also correct, but that should be another PR to allow for reading a path pattern for show archived commits just like others have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n3nash, this pr aims to "Adjust the read and write path of archive". IMO, it is unnecessary to create an new pr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmatu I agree, I think this PR is just doing some basic cleanup so we should just rename this PR and merge it and fix the adjusting of the read/write path in a different one.
@hddong please rename this PR as "archived commits command code cleanup" and we can merge this

@hmatu
Copy link
Contributor

hmatu commented Feb 4, 2020

@hddong @n3nash @vinothchandar, ArchivedCommitsCommand contains two bellow commands,
if just using metaClient.getArchivePath() + "/.commits_.archive*", it will affect old tables.

Compare to changes:

  • metaClient.getArchivePath() + "/.commits_.archive*"
  • basePath + "/.hoodie/.commits_.archive*"

These two archive paths are different:

  • /table/.hoodie/archived/.commits_.archive*
  • /table/.hoodie/.commits_archive*

So the better way is add archiveFolderPattern to show archived commits command.

show archived commit stats
show archived commits

Copy link
Contributor

@n3nash n3nash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hddong left a comment for you to confirm and then can merge this one

@n3nash
Copy link
Contributor

n3nash commented Feb 14, 2020

@hddong please take a look at the last comment and squash all your commits please

@hddong hddong changed the title [HUDI-544] Adjust the read and write path of archive [HUDI-544] Archived commits command code cleanup Feb 17, 2020
@hddong
Copy link
Contributor Author

hddong commented Feb 17, 2020

@n3nash Sorry to feedback late. Have renamed this PR. BTW, I think new Path(basePath + "/.hoodie/.commits_.archive*") will cause some bug like HUDI-540.

@hddong
Copy link
Contributor Author

hddong commented Feb 26, 2020

@n3nash @vinothchandar please review this agian.

@vinothchandar
Copy link
Member

@n3nash is shepherding this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hddong I'm accepting and will merge after you create another ticket for us to add to the release notes that if someone was actually overriding HOODIE_ARCHIVELOG_FOLDER_PROP_NAME that was not being honored before and now will be honored for those cases it can break (since this is the right thing to do)

Copy link
Contributor

@n3nash n3nash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hddong left 1 comment, ptal and then I'll merge it once you've created that ticket and squash all your commits to 1

@hddong hddong force-pushed the hudi-544 branch 2 times, most recently from 316dc5c to ad68190 Compare April 2, 2020 03:27
@hddong
Copy link
Contributor Author

hddong commented Apr 2, 2020

@n3nash squashed all commits to 1. I had do something for create another ticket for us, but I am not sure if it is right.

@n3nash
Copy link
Contributor

n3nash commented Jun 12, 2020

@hddong

@hddong I'm accepting and will merge after you create another ticket for us to add to the release notes that if someone was actually overriding HOODIE_ARCHIVELOG_FOLDER_PROP_NAME that was not being honored before and now will be honored for those cases it can break (since this is the right thing to do)

This is what I had commented. So this change will break if someone was overriding this prop name and expecting that path to be used (but it was actually not since it was never used). can you open a ticket for us to add this to the release notes and we can merge then.

@hddong
Copy link
Contributor Author

hddong commented Jun 13, 2020

@n3nash : Sorry, I understand the reason, but I don't quite understand what the mean of open a ticket and how to open a ticket for you. Can you give me some detailed instructions?

@n3nash
Copy link
Contributor

n3nash commented Jul 8, 2020

@hddong Please create a JIRA ticket here -> https://issues.apache.org/jira/projects/HUDI/issues and add the tag of documentation/release notes update.

@hddong
Copy link
Contributor Author

hddong commented Jul 10, 2020

@n3nash : Had rebase this PR and create a new jira https://issues.apache.org/jira/browse/HUDI-1085.

@n3nash
Copy link
Contributor

n3nash commented Aug 5, 2020

@hddong Sorry this fell through, please rebase to resolve conflicts and I will merge this asap

@hddong
Copy link
Contributor Author

hddong commented Aug 6, 2020

@n3nash : had rebase this, please have a review when free.

@hddong hddong force-pushed the hudi-544 branch 2 times, most recently from 47ca9e0 to 1808c57 Compare August 12, 2020 10:06
@hddong
Copy link
Contributor Author

hddong commented Aug 13, 2020

@n3nash : had rebase this again, please have a review when free.

@n3nash
Copy link
Contributor

n3nash commented Sep 11, 2020

@hddong Extremely sorry, this fell through the crack, please rebase and I will merge this right after.

@vinothchandar
Copy link
Member

@n3nash if you wish you can also rebase this yourself and push. See https://cwiki.apache.org/confluence/display/HUDI/Resources#Resources-PushingChangesToPRs for a how-to

@hddong
Copy link
Contributor Author

hddong commented Sep 21, 2020

@n3nash: Had rebase this.

@n3nash n3nash merged commit 2eaba09 into apache:master Sep 25, 2020
prashantwason pushed a commit to prashantwason/incubator-hudi that referenced this pull request Feb 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants