Skip to content

Conversation

@the-other-tim-brown
Copy link
Contributor

@the-other-tim-brown the-other-tim-brown commented Jan 11, 2023

Change Logs

Adds a new KeyGenerator that does not require the user to specify any fields to use for the record key and instead deterministically generates a UUID based off a subset of fields in the incoming record.

Impact

No impact to existing users since this is a new KeyGenerator that users will need to opt into.

Risk level (write none, low medium or high below)

Low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fill in the PR description.
LGTM.

@danny0405 danny0405 added priority:blocker Production down; release blocker writer-core labels Jan 11, 2023
@the-other-tim-brown the-other-tim-brown changed the title Add in support for a keyless workflow by building an ID based off of … Add in support for a keyless workflow Jan 11, 2023
@the-other-tim-brown the-other-tim-brown changed the title Add in support for a keyless workflow [HUDI-5532] Add in support for a keyless workflow Jan 11, 2023
@the-other-tim-brown the-other-tim-brown marked this pull request as ready for review January 11, 2023 14:52
@nsivabalan nsivabalan changed the title [HUDI-5532] Add in support for a keyless workflow [HUDI-5514] Add in support for a keyless workflow Jan 11, 2023
@nsivabalan nsivabalan force-pushed the keyless-keygenerator branch from ea4b269 to 70468d9 Compare January 11, 2023 15:03
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan
Copy link
Contributor

CI failure due to unrelated flaky test.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we should call this InternalKeyGenerator(or AutoKeyGenerator), rather than KeyLessKeyGenerator as it sounds contradicting.

@nsivabalan
Copy link
Contributor

will go ahead and land this for now. will get to consensus async.

@nsivabalan nsivabalan merged commit eacae1e into apache:master Jan 12, 2023
@kazdy
Copy link
Contributor

kazdy commented Jan 12, 2023

Hi @the-other-tim-brown
I'm interested in this functionality and have some questions, if I understand correctly the UUID will be the same for the same set of values in columns that it's based on?

So this generator can't be used for generating a surrogate key (a standard practice in data warehousing) as key is derived from data? My understanding of keyless model is that record key is a surrogate key that's globally unique.

I'm wondering if there's something that does not allow to create globally unique ids via the key generator interface (maybe virtual keys support)?
At the same time in context of this PR, what's the place of UuidKeyGenerator? Could it be used to generate surrogate keys that are globally unique?

@the-other-tim-brown
Copy link
Contributor Author

Hi @the-other-tim-brown I'm interested in this functionality and have some questions, if I understand correctly the UUID will be the same for the same set of values in columns that it's based on?

So this generator can't be used for generating a surrogate key (a standard practice in data warehousing) as key is derived from data? My understanding of keyless model is that record key is a surrogate key that's globally unique.

I'm wondering if there's something that does not allow to create globally unique ids via the key generator interface (maybe virtual keys support)? At the same time in context of this PR, what's the place of UuidKeyGenerator? Could it be used to generate surrogate keys that are globally unique?

Yes it is correct that the keys are not guaranteed to be unique here. The issue with using a random UUID for us was that we were using deltastreamer and if the dag ever retriggered we were seeing data generated with new random UUIDs which could cause the records to be written to different filegroups causing an issue with duplicate/lost data due to some internals of how Hudi works. @nsivabalan had some similar thoughts around other approaches, can you chime in here?

@the-other-tim-brown the-other-tim-brown deleted the keyless-keygenerator branch January 13, 2023 16:00
@nsivabalan
Copy link
Contributor

nsivabalan commented Jan 13, 2023

hey @kazdy : we also jammed quite a bit before arriving at this solution. For eg, we did take a stab at generating unique Ids for every record here, but the problem as stated by Tim might not work for 7622. for eg, if we zoom into what happens for a commit in hudi is,
keyGen -> index look up -> upsert partitioner -> write files by executor (merge handle or create handle or append handle) -> may be write to metadata table -> complete commit.

Main crux here is that, in Upsert partitioner, we assign records to diff insert buckets based on record key hash. lets say upsert partitioner determined to add 3 new insert buckets and split 30k records among 3 insert bucket (file groups). This assignment is done using hashing of record key.

Given this, if due to failures, if keyGen stage was retriggered for a subset of spark partitions again, and when it reaches the upsert partitioner, it could get assigned to a diff insert bucket compared to its 1st attempt and so there are chances we will miss some records or add pack more records to one file group that what we intended.

Let me know if this makes sense. happy to jam to see if we can really pull this off by a row Id sort of generating rather than based on record payload.

@kazdy
Copy link
Contributor

kazdy commented Jan 17, 2023

Thanks for the explanation, so it seems like key generator must be deterministic and there's no way around it.

What I do with hudi datasets where I need a surrogate key is that I just generate a column with UUID using built-in spark uuid() function. I think it's a valid way to do it :)
I guess using engine-specific uuid() function in keygen would not change anything.

continue;
}
nonNullFields++;
key.append(value.hashCode());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@the-other-tim-brown @the-other-tim-brown this is incorrect way of hash/key generation, we can't distinguish b/w cases of hash_1=12 and hash_1=1, hash_2=2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we should append some sort of delimiter after each hashcode?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

codope added a commit to codope/hudi that referenced this pull request Jan 25, 2023
… an ID based off of values within the record (apache#7640)"

This reverts commit eacae1e.
codope added a commit that referenced this pull request Jan 25, 2023
…udi (#7726)" (#7747)

* Revert "[HUDI-5575] Adding/Fixing auto generation of record keys w/ hudi (#7726)"

This reverts commit 2fc20c1.

* Revert "[HUDI-5514] Add in support for a keyless workflow by building an ID based off of values within the record (#7640)"

This reverts commit eacae1e.
@kazdy
Copy link
Contributor

kazdy commented Jan 30, 2023

Let me know if this makes sense. happy to jam to see if we can really pull this off by a row Id sort of generating rather than based on record payload.

@nsivabalan I did some reading and found out that oracle and postgres both use pseudo/ system columns to imitate PK if not defined.
Do you think it would be possible to do something similar to oracle ROWID pseudo column or postgres ctid system column?

Rowids contain the following information:
The data block of the data file containing the row. The length of this string depends on your operating system.

  • The row in the data block.
  • The database file containing the row. The first data file has the number 1. The length of this string depends on your operating system.
  • The data object number, which is an identification number assigned to every database segment. You can retrieve the data object number from the data dictionary views USER_OBJECTS, DBA_OBJECTS, and ALL_OBJECTS. Objects that share the same segment (clustered tables in the same cluster, for example) have the same object number.

It seems like it would be doable with vectorized parquet reader rowId/ Column Vector etc. instead of "row in data block", the file name is known and saved in meta columns.
I'm not 100% convinced that it would be possible to retrieve this data from the parquet file.
I don't know how Hudi will handle the first write since there's no information about the column vector and the hash can not be generated. So this can be impossible to use in an upsert partitioner and the whole idea does make any sense :).
I also don't know how it would play out with clustering since files need to be rewritten and therefore ROWID/record key would change.

I already see some restrictions:

  • only support it in CoW (bc. parquet vectorized reader needs to be used?),
  • only available with Virtual Keys (?),
  • no incremental queries allowed (I think cdc from non-pk table is not supported in oracle rdbms),
  • no support for datasource write with upsert when ROWID/ recordkey is not provided (should not be a problem with spark sql since it first queries Hudi table and therefore it would be possible to get ROWID) (?)

But it would allow doing DS+SQL insert, SQL updates and SQL deletes without the need to define PK on the table.
Does it make any sense?

fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Jan 31, 2023
…ased off of values within the record (apache#7640)

- Adds a new KeyGenerator that does not require the user to specify any fields to use for the record key and instead deterministically generates a UUID based off a subset of fields in the incoming record.
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Jan 31, 2023
…udi (apache#7726)" (apache#7747)

* Revert "[HUDI-5575] Adding/Fixing auto generation of record keys w/ hudi (apache#7726)"

This reverts commit 2fc20c1.

* Revert "[HUDI-5514] Add in support for a keyless workflow by building an ID based off of values within the record (apache#7640)"

This reverts commit eacae1e.
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
…ased off of values within the record (apache#7640)

- Adds a new KeyGenerator that does not require the user to specify any fields to use for the record key and instead deterministically generates a UUID based off a subset of fields in the incoming record.
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
…udi (apache#7726)" (apache#7747)

* Revert "[HUDI-5575] Adding/Fixing auto generation of record keys w/ hudi (apache#7726)"

This reverts commit 2fc20c1.

* Revert "[HUDI-5514] Add in support for a keyless workflow by building an ID based off of values within the record (apache#7640)"

This reverts commit eacae1e.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

6 participants