[HUDI-5514] Add in support for a keyless workflow #7640

the-other-tim-brown · 2023-01-11T00:54:59Z

Change Logs

Adds a new KeyGenerator that does not require the user to specify any fields to use for the record key and instead deterministically generates a UUID based off a subset of fields in the incoming record.

Impact

No impact to existing users since this is a new KeyGenerator that users will need to opt into.

Risk level (write none, low medium or high below)

Low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

nsivabalan

Can you fill in the PR description.
LGTM.

…values within the record

hudi-bot · 2023-01-12T00:48:36Z

CI report:

70468d9 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan · 2023-01-12T15:02:05Z

CI failure due to unrelated flaky test.

nsivabalan

Wondering if we should call this InternalKeyGenerator(or AutoKeyGenerator), rather than KeyLessKeyGenerator as it sounds contradicting.

nsivabalan · 2023-01-12T15:03:57Z

will go ahead and land this for now. will get to consensus async.

kazdy · 2023-01-12T18:48:36Z

Hi @the-other-tim-brown
I'm interested in this functionality and have some questions, if I understand correctly the UUID will be the same for the same set of values in columns that it's based on?

So this generator can't be used for generating a surrogate key (a standard practice in data warehousing) as key is derived from data? My understanding of keyless model is that record key is a surrogate key that's globally unique.

I'm wondering if there's something that does not allow to create globally unique ids via the key generator interface (maybe virtual keys support)?
At the same time in context of this PR, what's the place of UuidKeyGenerator? Could it be used to generate surrogate keys that are globally unique?

the-other-tim-brown · 2023-01-13T15:59:44Z

Hi @the-other-tim-brown I'm interested in this functionality and have some questions, if I understand correctly the UUID will be the same for the same set of values in columns that it's based on?

So this generator can't be used for generating a surrogate key (a standard practice in data warehousing) as key is derived from data? My understanding of keyless model is that record key is a surrogate key that's globally unique.

I'm wondering if there's something that does not allow to create globally unique ids via the key generator interface (maybe virtual keys support)? At the same time in context of this PR, what's the place of UuidKeyGenerator? Could it be used to generate surrogate keys that are globally unique?

Yes it is correct that the keys are not guaranteed to be unique here. The issue with using a random UUID for us was that we were using deltastreamer and if the dag ever retriggered we were seeing data generated with new random UUIDs which could cause the records to be written to different filegroups causing an issue with duplicate/lost data due to some internals of how Hudi works. @nsivabalan had some similar thoughts around other approaches, can you chime in here?

nsivabalan · 2023-01-13T18:34:15Z

hey @kazdy : we also jammed quite a bit before arriving at this solution. For eg, we did take a stab at generating unique Ids for every record here, but the problem as stated by Tim might not work for 7622. for eg, if we zoom into what happens for a commit in hudi is,
keyGen -> index look up -> upsert partitioner -> write files by executor (merge handle or create handle or append handle) -> may be write to metadata table -> complete commit.

Main crux here is that, in Upsert partitioner, we assign records to diff insert buckets based on record key hash. lets say upsert partitioner determined to add 3 new insert buckets and split 30k records among 3 insert bucket (file groups). This assignment is done using hashing of record key.

Given this, if due to failures, if keyGen stage was retriggered for a subset of spark partitions again, and when it reaches the upsert partitioner, it could get assigned to a diff insert bucket compared to its 1st attempt and so there are chances we will miss some records or add pack more records to one file group that what we intended.

Let me know if this makes sense. happy to jam to see if we can really pull this off by a row Id sort of generating rather than based on record payload.

kazdy · 2023-01-17T14:17:40Z

Thanks for the explanation, so it seems like key generator must be deterministic and there's no way around it.

What I do with hudi datasets where I need a surrogate key is that I just generate a column with UUID using built-in spark uuid() function. I think it's a valid way to do it :)
I guess using engine-specific uuid() function in keygen would not change anything.

alexeykudinkin · 2023-01-17T17:38:54Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeylessKeyGenerator.java

+        continue;
+      }
+      nonNullFields++;
+      key.append(value.hashCode());


@the-other-tim-brown @the-other-tim-brown this is incorrect way of hash/key generation, we can't distinguish b/w cases of hash_1=12 and hash_1=1, hash_2=2

Do you mean we should append some sort of delimiter after each hashcode?

… an ID based off of values within the record (apache#7640)" This reverts commit eacae1e.

…udi (#7726)" (#7747) * Revert "[HUDI-5575] Adding/Fixing auto generation of record keys w/ hudi (#7726)" This reverts commit 2fc20c1. * Revert "[HUDI-5514] Add in support for a keyless workflow by building an ID based off of values within the record (#7640)" This reverts commit eacae1e.

kazdy · 2023-01-30T15:11:47Z

Let me know if this makes sense. happy to jam to see if we can really pull this off by a row Id sort of generating rather than based on record payload.

@nsivabalan I did some reading and found out that oracle and postgres both use pseudo/ system columns to imitate PK if not defined.
Do you think it would be possible to do something similar to oracle ROWID pseudo column or postgres ctid system column?

Rowids contain the following information:
The data block of the data file containing the row. The length of this string depends on your operating system.

The row in the data block.

The database file containing the row. The first data file has the number 1. The length of this string depends on your operating system.

The data object number, which is an identification number assigned to every database segment. You can retrieve the data object number from the data dictionary views USER_OBJECTS, DBA_OBJECTS, and ALL_OBJECTS. Objects that share the same segment (clustered tables in the same cluster, for example) have the same object number.

It seems like it would be doable with vectorized parquet reader rowId/ Column Vector etc. instead of "row in data block", the file name is known and saved in meta columns.
I'm not 100% convinced that it would be possible to retrieve this data from the parquet file.
I don't know how Hudi will handle the first write since there's no information about the column vector and the hash can not be generated. So this can be impossible to use in an upsert partitioner and the whole idea does make any sense :).
I also don't know how it would play out with clustering since files need to be rewritten and therefore ROWID/record key would change.

I already see some restrictions:

only support it in CoW (bc. parquet vectorized reader needs to be used?),
only available with Virtual Keys (?),
no incremental queries allowed (I think cdc from non-pk table is not supported in oracle rdbms),
no support for datasource write with upsert when ROWID/ recordkey is not provided (should not be a problem with spark sql since it first queries Hudi table and therefore it would be possible to get ROWID) (?)

But it would allow doing DS+SQL insert, SQL updates and SQL deletes without the need to define PK on the table.
Does it make any sense?

…ased off of values within the record (apache#7640) - Adds a new KeyGenerator that does not require the user to specify any fields to use for the record key and instead deterministically generates a UUID based off a subset of fields in the incoming record.

…udi (apache#7726)" (apache#7747) * Revert "[HUDI-5575] Adding/Fixing auto generation of record keys w/ hudi (apache#7726)" This reverts commit 2fc20c1. * Revert "[HUDI-5514] Add in support for a keyless workflow by building an ID based off of values within the record (apache#7640)" This reverts commit eacae1e.

…ased off of values within the record (apache#7640) - Adds a new KeyGenerator that does not require the user to specify any fields to use for the record key and instead deterministically generates a UUID based off a subset of fields in the incoming record.

…udi (apache#7726)" (apache#7747) * Revert "[HUDI-5575] Adding/Fixing auto generation of record keys w/ hudi (apache#7726)" This reverts commit 2fc20c1. * Revert "[HUDI-5514] Add in support for a keyless workflow by building an ID based off of values within the record (apache#7640)" This reverts commit eacae1e.

nsivabalan reviewed Jan 11, 2023

View reviewed changes

danny0405 assigned nsivabalan Jan 11, 2023

danny0405 added priority:blocker Production down; release blocker writer-core labels Jan 11, 2023

the-other-tim-brown changed the title ~~Add in support for a keyless workflow by building an ID based off of …~~ Add in support for a keyless workflow Jan 11, 2023

the-other-tim-brown changed the title ~~Add in support for a keyless workflow~~ [HUDI-5532] Add in support for a keyless workflow Jan 11, 2023

the-other-tim-brown marked this pull request as ready for review January 11, 2023 14:52

nsivabalan changed the title ~~[HUDI-5532] Add in support for a keyless workflow~~ [HUDI-5514] Add in support for a keyless workflow Jan 11, 2023

Add in support for a keyless workflow by building an ID based off of …

70468d9

…values within the record

nsivabalan force-pushed the keyless-keygenerator branch from ea4b269 to 70468d9 Compare January 11, 2023 15:03

nsivabalan approved these changes Jan 12, 2023

View reviewed changes

nsivabalan merged commit eacae1e into apache:master Jan 12, 2023

the-other-tim-brown deleted the keyless-keygenerator branch January 13, 2023 16:00

alexeykudinkin reviewed Jan 17, 2023

View reviewed changes

This was referenced Jan 18, 2023

[HUDI-2681] Some fixes and config validation when auto generation of record keys is enabled #7668

Closed

[HUDI-5575] Adding/Fixing auto generation of record keys w/ hudi #7726

Merged

codope mentioned this pull request Jan 25, 2023

Revert "[HUDI-5575] Adding/Fixing auto generation of record keys w/ hudi (#7726)" #7747

Merged

4 tasks

codope added a commit to codope/hudi that referenced this pull request Jan 25, 2023

Revert "[HUDI-5514] Add in support for a keyless workflow by building…

ffd309f

… an ID based off of values within the record (apache#7640)" This reverts commit eacae1e.

[HUDI-5514] Add in support for a keyless workflow #7640

[HUDI-5514] Add in support for a keyless workflow #7640

Uh oh!

Conversation

the-other-tim-brown commented Jan 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jan 12, 2023

CI report:

Uh oh!

nsivabalan commented Jan 12, 2023

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

nsivabalan commented Jan 12, 2023

Uh oh!

kazdy commented Jan 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

the-other-tim-brown commented Jan 13, 2023

Uh oh!

nsivabalan commented Jan 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kazdy commented Jan 17, 2023

Uh oh!

alexeykudinkin Jan 17, 2023

Choose a reason for hiding this comment

Uh oh!

the-other-tim-brown Jan 17, 2023

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Jan 18, 2023

Choose a reason for hiding this comment

Uh oh!

kazdy commented Jan 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

the-other-tim-brown commented Jan 11, 2023 •

edited

Loading

kazdy commented Jan 12, 2023 •

edited

Loading

nsivabalan commented Jan 13, 2023 •

edited

Loading

kazdy commented Jan 30, 2023 •

edited

Loading