Skip to content

Conversation

@codope
Copy link
Member

@codope codope commented Jul 12, 2022

What is the purpose of the pull request

While #5771 works for immutable tables (bulk insert), this PR is to support upsert without a record key.

Brief change log

  • Generate commit sequence id and use it as a record key.
  • Use Simple keygen.
  • Config to guard the change.
  • Added a test.

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@codope codope added the priority:blocker Production down; release blocker label Jul 12, 2022
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan self-assigned this Jul 19, 2022

val hoodieAllIncomingRecords = genericRecords.map(gr => {
val processedRecord = getProcessedRecord(partitionColumns, gr, dropPartitionColumns)
val csn = HoodieRecord.generateSequenceId(instantTime, partitionId, recordIndex.getAndIncrement())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought plan was to re-use the commit seq no that we generate within the writer (i.e. by executors). or is my understanding wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets discuss around this tomorrow when we meet up.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some clarifying comments

@codope
Copy link
Member Author

codope commented Jul 20, 2022

Ensure that bulk_insert, insert, upsert follow the same logic.
Ensure that the commit sequence number is generated in the executor.

Should we also decouple the record key and partition path in keygen i.e. a separate keygen for the record key and partition path?

@codope codope added priority:critical Production degraded; pipelines stalled writer-core and removed priority:blocker Production down; release blocker labels Jul 20, 2022
@codope
Copy link
Member Author

codope commented Jul 20, 2022

@nsivabalan I have lowered the prioirty considering the impact of this change and some deisgn considerations as we discussed. From usability point of view, we can still take #5771 in the upcoming release, where we can support ingestion w/o record key just for bulk insert. Further things that need to be discussed in this PR:

  • Consider separate commit sequence number based keygen.
  • Consider decoupling key generator for record key and partition path.
  • It also interplays with RFC-46 and in long-term we may get rid of getting partition path from keygen and instead have logical paritioning.

@yihua yihua added priority:high Significant impact; potential bugs and removed priority:critical Production degraded; pipelines stalled labels Sep 14, 2022
@xushiyan
Copy link
Member

@codope should we close this as start new?

@codope
Copy link
Member Author

codope commented Nov 1, 2022

Yeah, let's close it. The implementation has some holes and requires a more rigorous design.

@codope codope closed this Nov 1, 2022
@codope codope changed the title [HUDI-4071][Stacked on 5771] Support upsert without record key [HUDI-4071][Stacked on 5771] Support insert without record key Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:high Significant impact; potential bugs

Projects

Status: 👤 User Action
Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants