Skip to content

Conversation

@lokeshj1703
Copy link
Collaborator

@lokeshj1703 lokeshj1703 commented Jan 13, 2023

Change Logs

Recently we added support for auto generation of record keys for hudi tables. But some configs need to tweaked when using such auto generation of record keys.

  1. Dis-allow de-dup. If de-dup is enabled (combine before insert), we will fail the write.
  2. Fail if "upsert" is explicitly set as operaiton type. With auto generation of record keys, we can't support upsert.
  3. Fail if "hoodie.merge.allow.duplicate.on.inserts" is not enabled so that hudi does not unintentionaly de-dup due to small file handling.
  4. Fail if someone choose to use MOR table type:- 2 reasons. a: there are no updates and so no point in choosing MOR. b: preCombine is a mandatory field w/ MOR table. but for table w/ auto generated record keys, precombine if not required to be set.
  5. Fail if preCombine or recordkey field is set.

This patch also fixes auto generation of record keys w/ spark-sql writes. Added tests for the same. (Create table + insert and CTAS)

Impact

Enables smoother usage of auto record key generation.

Risk level (write none, low medium or high below)

Medium (Added tests)

Documentation Update

Adds support for KeylessGenerator for insert operations. This ensures user doesn't need to configure record key for inserts in an immutable dataset.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@lokeshj1703
Copy link
Collaborator Author

@nsivabalan Can you please take a look?

@danny0405 danny0405 added writer-core priority:high Significant impact; potential bugs and removed priority:high Significant impact; potential bugs labels Jan 13, 2023
private var asyncCompactionTriggerFnDefined: Boolean = false
private var asyncClusteringTriggerFnDefined: Boolean = false

def changeOperationToInsertIfRequired(writeOperationType: WriteOperationType, hoodieConfig: HoodieConfig)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik in this case when changing op from upsert to insert one should also make sure that insert mode (SQL_INSERT_MODE) is set to "upsert", otherwise either duplicate record will be created (non-strict mode) or error will be thrown (strict mode)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kazdy : Would you be able to help us w/ a contribution on this. If you take this branch and work on top of it. As of this patch, we have tested auto generation of record keys works for spark-sql (CTAS and create table + inserts ). So, would appreciate if you can put up a fix on the strict/non-strict mode.

@lokeshj1703 lokeshj1703 force-pushed the HUDI-2681 branch 2 times, most recently from 304400b to a586f65 Compare January 14, 2023 07:29
@kazdy
Copy link
Contributor

kazdy commented Jan 18, 2023

@lokeshj1703 I also noticed that in flink no_precombine is already supported and is guarded behind a config.
Maybe it would be good to do the same in spark ds?

public static final String NO_PRE_COMBINE = "no_precombine";

@nsivabalan nsivabalan changed the title [HUDI-2681] Make hoodie record_key and preCombine_key optional [HUDI-2681] Some fixes and tweaks to configs when auto generation of record keys is enabled Jan 18, 2023
@lokeshj1703 lokeshj1703 changed the title [HUDI-2681] Some fixes and tweaks to configs when auto generation of record keys is enabled [HUDI-2681] Some fixes and config validation when auto generation of record keys is enabled Jan 22, 2023
throw new HoodieKeyGeneratorException(s"Config ${DataSourceWriteOptions.TABLE_TYPE.key()} should be set to " +
s"COW_TABLE_TYPE_OPT_VAL when $autoGenerateRecordKey is used")
}
if (hoodieConfig.getString(OPERATION) == UPSERT_OPERATION_OPT_VAL) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change condition to succeed only if insert and bulk insert is enabled? Is it possible for some other operation to execute in this function apart from {insert,upsert and bulk insert}

@nsivabalan
Copy link
Contributor

few high level points

Dis-allow de-dup. If de-dup is enabled (combine before insert), we will fail the write.
Siva: looks ok to me.
Fail if "upsert" is set. With auto generation of record keys, we can't support upsert. I mean, every record is treated as a new record and so doing an index look up causes only unnecessary overhead.
Siva: I feel we can automatically switch to insert. Its an impl detail. We anyways will document that auto generation of record keys is meant to be used only for immutable use-cases. So, rather than failing, I would prefer to auto switch to "insert".

Fail if "hoodie.merge.allow.duplicate.on.inserts" is not enabled so that hudi does not unintentionaly de-dup due to small file handling.
Siva: we should automatically enable this since this is more of an impl detail. I mean, we should not fail if user does not set this.

Fail if someone choose to use MOR table type:- 2 reasons. a: there are no updates and so no point in choosing MOR. b: preCombine is a mandatory field w/ MOR table. but for table w/ auto generated record keys, precombine if not required to be set.
Siva: seems ok.

Fail if preCombine or recordkey field is set.
Siva: seems ok to me.

@xushiyan xushiyan added priority:blocker Production down; release blocker and removed priority:high Significant impact; potential bugs labels Jan 24, 2023
…record keys is enabled

(cherry picked from commit 6616b7afc7ba5d4b8721582f8e553cefbf07fd88)
(cherry picked from commit 2b0fe973c7fa19eda39ecf52622dd07b0976f38e)
private static void validateRecordKey(String recordKeyField) {
checkArgument(recordKeyField == null || !recordKeyField.isEmpty(),
private static void validateRecordKey(String recordKeyField, boolean isAutoGenerateRecordKeyEnabled) {
checkArgument(recordKeyField == null || !recordKeyField.isEmpty() || isAutoGenerateRecordKeyEnabled,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the validation on record key field be:

  1. It is not empty.
  2. Or, isAutoGenerateRecordKeyEnabled.

I am assuming record key field can be null only when isAutoGenerateRecordKeyEnabled is true. We can take it as a followup after confirming why null record key field was allowed in the first place.


import org.apache.hudi.exception.HoodieException
import org.apache.hudi.{HoodieSparkSqlWriter, SparkAdapterSupport}
import org.apache.hudi.{DataSourceWriteOptions, HoodieSparkSqlWriter, SparkAdapterSupport}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unused import

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan
Copy link
Contributor

We are closing this for now. will put up proper fix later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

7 participants