Skip to content

Conversation

@rkkalluri
Copy link

What is the purpose of the pull request

Switching from non-partitioned to partitioned key gen is currently not throw any exception and dumping partitioned data next to the previously unpartitioned data files.

The purpose of this pull request is to valid switching from non-partitioned to partitioned key gen mechanism

Brief change log

  • HoodieWriterUtils.getOriginKeyGenerator is now modified to return the default value for KEYGENERATOR_CLASS_NAME when it is not provided as a config, this will allow the validations to correctly catch the invalid key gen switch

Verify this pull request

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added TestHoodieSparkSqlWriter.testNonpartitonedToDefaultKeyGen to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • [ x] Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment. LGTM
@xushiyan @YannByron : you folks wanna take a look at changes for extra validation added for sql dml.

def getOriginKeyGenerator(parameters: Map[String, String]): String = {
val kg = parameters.getOrElse(KEYGENERATOR_CLASS_NAME.key(), null)
//first check table config for key generator
var kg = parameters.getOrElse(HoodieTableConfig.KEY_GENERATOR_CLASS_NAME.key, null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kg -> keyGenClass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we directly do getOrDefault in L120?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean like this ?

var kg = parameters.getOrElse(HoodieTableConfig.KEY_GENERATOR_CLASS_NAME.key, parameters.getOrElse(KEYGENERATOR_CLASS_NAME.key(), KEYGENERATOR_CLASS_NAME.defaultValue()))

The only drawback to this is it might affect readability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. but getOrElse or getOrDefault is used widely across the code base. should be ok

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code has been reverted since we dont need to change the logic here.

@nsivabalan nsivabalan changed the title [HUDI-3726] Switching from non-partitioned to partitioned key gen [HUDI-3726] Harden constraints around switching between different key generators Apr 2, 2022
@YannByron
Copy link
Contributor

LGTM. @rkkalluri

@rkkalluri
Copy link
Author

@hudi-bot run azure

@xushiyan
Copy link
Member

xushiyan commented Apr 3, 2022

@rkkalluri can you rebase instead of merging from master? it's hard to review the diff and the commit history

@rkkalluri rkkalluri force-pushed the hudi-3726-nonpartitioned-to-partitioned branch 2 times, most recently from 529fbf1 to 10a638b Compare April 4, 2022 02:31
@rkkalluri
Copy link
Author

@xushiyan @nsivabalan can you guys take a look at this PR again. I will rebase it once I see some comments again. Currently the build is passing and there are no conflicts.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. but do file a follow up jira for validation switching between any key gens in general

@rkkalluri rkkalluri force-pushed the hudi-3726-nonpartitioned-to-partitioned branch 2 times, most recently from 154e5ee to 264c63a Compare April 6, 2022 13:38
@rkkalluri rkkalluri force-pushed the hudi-3726-nonpartitioned-to-partitioned branch from 264c63a to 1864fc9 Compare April 6, 2022 15:14
@hudi-bot
Copy link
Collaborator

hudi-bot commented Apr 6, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan merged commit 939b3d1 into apache:master Apr 6, 2022
@nsivabalan
Copy link
Contributor

Good job on the first patch 👏

@rkkalluri rkkalluri deleted the hudi-3726-nonpartitioned-to-partitioned branch April 6, 2022 17:39
Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks helpful. 👏🏼 on the first patch.

Comment on lines +186 to +190
if (null != tableConfigKeyGen && null != datasourceKeyGen) {
val nonPartitionedTableConfig = tableConfigKeyGen.equals(classOf[NonpartitionedKeyGenerator].getCanonicalName)
val simpleKeyDataSourceConfig = datasourceKeyGen.equals(classOf[SimpleKeyGenerator].getCanonicalName)
if (nonPartitionedTableConfig && simpleKeyDataSourceConfig) {
diffConfigs.append(s"KeyGenerator:\t$datasourceKeyGen\t$tableConfigKeyGen\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if there are more case we need to catch here, as this check is very specific. What about the cases where users subclassed the built-in keygen ? Is there any generic way to prevent discrepancy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I also prefer to check all possible combination of switches.
I have created a follow up task https://issues.apache.org/jira/browse/HUDI-3820
@rkkalluri : there are 2 work items in there. a: adding validations and tests for switching between diff key gens. b: with insert_overwrite_table operation, we should not do the validation and over-write table config if key gen is changed.

xushiyan pushed a commit that referenced this pull request Apr 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants