-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-5514] Add in support for a keyless workflow #7640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-5514] Add in support for a keyless workflow #7640
Conversation
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fill in the PR description.
LGTM.
…values within the record
ea4b269 to
70468d9
Compare
|
CI failure due to unrelated flaky test. |
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we should call this InternalKeyGenerator(or AutoKeyGenerator), rather than KeyLessKeyGenerator as it sounds contradicting.
|
will go ahead and land this for now. will get to consensus async. |
|
Hi @the-other-tim-brown So this generator can't be used for generating a surrogate key (a standard practice in data warehousing) as key is derived from data? My understanding of keyless model is that record key is a surrogate key that's globally unique. I'm wondering if there's something that does not allow to create globally unique ids via the key generator interface (maybe virtual keys support)? |
Yes it is correct that the keys are not guaranteed to be unique here. The issue with using a random UUID for us was that we were using deltastreamer and if the dag ever retriggered we were seeing data generated with new random UUIDs which could cause the records to be written to different filegroups causing an issue with duplicate/lost data due to some internals of how Hudi works. @nsivabalan had some similar thoughts around other approaches, can you chime in here? |
|
hey @kazdy : we also jammed quite a bit before arriving at this solution. For eg, we did take a stab at generating unique Ids for every record here, but the problem as stated by Tim might not work for 7622. for eg, if we zoom into what happens for a commit in hudi is, Main crux here is that, in Upsert partitioner, we assign records to diff insert buckets based on record key hash. lets say upsert partitioner determined to add 3 new insert buckets and split 30k records among 3 insert bucket (file groups). This assignment is done using hashing of record key. Given this, if due to failures, if keyGen stage was retriggered for a subset of spark partitions again, and when it reaches the upsert partitioner, it could get assigned to a diff insert bucket compared to its 1st attempt and so there are chances we will miss some records or add pack more records to one file group that what we intended. Let me know if this makes sense. happy to jam to see if we can really pull this off by a row Id sort of generating rather than based on record payload. |
|
Thanks for the explanation, so it seems like key generator must be deterministic and there's no way around it. What I do with hudi datasets where I need a surrogate key is that I just generate a column with UUID using built-in spark uuid() function. I think it's a valid way to do it :) |
| continue; | ||
| } | ||
| nonNullFields++; | ||
| key.append(value.hashCode()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@the-other-tim-brown @the-other-tim-brown this is incorrect way of hash/key generation, we can't distinguish b/w cases of hash_1=12 and hash_1=1, hash_2=2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we should append some sort of delimiter after each hashcode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
… an ID based off of values within the record (apache#7640)" This reverts commit eacae1e.
@nsivabalan I did some reading and found out that oracle and postgres both use pseudo/ system columns to imitate PK if not defined.
It seems like it would be doable with vectorized parquet reader rowId/ Column Vector etc. instead of "row in data block", the file name is known and saved in meta columns. I already see some restrictions:
But it would allow doing DS+SQL insert, SQL updates and SQL deletes without the need to define PK on the table. |
…ased off of values within the record (apache#7640) - Adds a new KeyGenerator that does not require the user to specify any fields to use for the record key and instead deterministically generates a UUID based off a subset of fields in the incoming record.
…udi (apache#7726)" (apache#7747) * Revert "[HUDI-5575] Adding/Fixing auto generation of record keys w/ hudi (apache#7726)" This reverts commit 2fc20c1. * Revert "[HUDI-5514] Add in support for a keyless workflow by building an ID based off of values within the record (apache#7640)" This reverts commit eacae1e.
…ased off of values within the record (apache#7640) - Adds a new KeyGenerator that does not require the user to specify any fields to use for the record key and instead deterministically generates a UUID based off a subset of fields in the incoming record.
…udi (apache#7726)" (apache#7747) * Revert "[HUDI-5575] Adding/Fixing auto generation of record keys w/ hudi (apache#7726)" This reverts commit 2fc20c1. * Revert "[HUDI-5514] Add in support for a keyless workflow by building an ID based off of values within the record (apache#7640)" This reverts commit eacae1e.
Change Logs
Adds a new KeyGenerator that does not require the user to specify any fields to use for the record key and instead deterministically generates a UUID based off a subset of fields in the incoming record.
Impact
No impact to existing users since this is a new KeyGenerator that users will need to opt into.
Risk level (write none, low medium or high below)
Low
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist