[HUDI-3525] Introduce JsonkafkaSourcePostProcessor to support data post process before it is transformed to DataSet #4930

wangxianghu · 2022-03-01T18:33:50Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

Currently, we have Transform to transform source to target dataset before writing, but it is based on DataSet.
In some scenarios, our kafka data is not in the right format we need, such as binlog json format.
We need a way to extract/prepare the data we need from the original data before converting it into a DataSet.

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

nsivabalan · 2022-03-01T23:53:20Z

Can you check the CI failure please

nsivabalan · 2022-03-01T23:54:15Z

@pratyakshsharma : Can you assist in reviewing this patch.

pratyakshsharma · 2022-03-02T06:29:55Z

@pratyakshsharma : Can you assist in reviewing this patch.

Ack. Will review it today.

nsivabalan

Can we add a test case by adding a test JsonKafkaPostProcessor and ensure it works.
also add a test where you set some invalid class for the new config added. and assert for exception.

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java

…cess before it is transformed to DataSet

wangxianghu · 2022-03-03T06:55:35Z

Can we add a test case by adding a test JsonKafkaPostProcessor and ensure it works. also add a test where you set some invalid class for the new config added. and assert for exception.

done

wangxianghu · 2022-03-03T07:34:45Z

.../src/main/java/org/apache/hudi/utilities/sources/processor/JsonKafkaSourcePostProcessor.java

+ * Base class for Json kafka source post processor. User can define their own processor that extends this class to do
+ * some post process on the incoming json string records before the records are converted to DataSet<T>.
+ */
+public abstract class JsonKafkaSourcePostProcessor implements Serializable {


changed to an abstract class so as to provide a unified constructor

wangxianghu · 2022-03-03T08:16:58Z

@hudi-bot run azure

wangxianghu · 2022-03-03T10:58:41Z

@hudi-bot run azure

hudi-bot · 2022-03-03T12:12:16Z

CI report:

2f11c3d Azure: FAILURE Azure: FAILURE Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

wangxianghu · 2022-03-04T07:01:34Z

@nsivabalan please take another look when free :)

pratyakshsharma · 2022-03-06T09:26:54Z

@wangxianghu The changes look good to me. I have a high level query. You have mentioned facing issues with binlog json format for example. Do you mean to say transformation is not possible with data in binlog json format? Data coming from binlogs also has a structure/schema assigned to it as far as I remember. Can you post a sample event where you feel this new PostProcessor you introduced might be useful?
Basically I want to understand the motivation behind introducing this PR.

wangxianghu · 2022-03-06T10:20:13Z

@wangxianghu The changes look good to me. I have a high level query. You have mentioned facing issues with binlog json format for example. Do you mean to say transformation is not possible with data in binlog json format? Data coming from binlogs also has a structure/schema assigned to it as far as I remember. Can you post a sample event where you feel this new PostProcessor you introduced might be useful? Basically I want to understand the motivation behind introducing this PR.

It is possible to deal with data in binlog json format, but not very convenient.

For maxwell(our company use it to capture changed data)

{
    "database": "test",
    "table": "maxwell",
    "type": "update",
    "ts": 1449786341,
    "xid": 940786,
    "commit": true,
    "data": {"id":1, "daemon": "Firebus!  Firebus!","update_time" : "2022-02-03 12:22:42"},
    "old":  {"daemon": "Stanislaw Lem"}
  }

all we want is just :

{
    "id": 1, 
    "daemon": "Firebus!  Firebus!", 
    "update_time": "2022-02-03 12:22:42"
}

we can add write a processor to extract the data from the entire json and maybe do some custom define process, without configuring a huge schema file(including all the fields in the binlog json, no matter if we need them or not)

in some scenes, we need to encode some fileds for safety purpose, the processor can help us
sometimes our data quality is not very well, some key field let's say precombine field have null value, we can use processor to fix it
when our schema is read from jdbc or hive, we can use processor adjust our kafka data compatible to it.

All in all, with custom processor we can do anything we want on the incoming json data before they are converted into DataSet
Of course Transformer is a very useful feature too, but it is based on Spark DataSet, and have certain requirements for data quality.

pratyakshsharma · 2022-03-06T10:35:14Z

Right. With Debezium, this extraction of data is already supported. But with maxwell and any other similar service, this processor will be helpful.

wangxianghu · 2022-03-06T15:30:43Z

hi @nsivabalan any orther concern ?

nsivabalan

LGTM

…cess before it is transformed to DataSet (apache#4930)

wangxianghu force-pushed the HUDI-3525 branch from cc69df0 to 902d714 Compare March 1, 2022 18:37

wangxianghu changed the title ~~[HUDI-3525] Introduce JsonkafkaSourceProcessor to support data prepro…~~ [HUDI-3525] Introduce JsonkafkaSourcePostProcessor to support data preprocess before it is transformed to DataSet Mar 1, 2022

nsivabalan assigned nsivabalan and pratyakshsharma and unassigned nsivabalan Mar 1, 2022

nsivabalan added priority:high Significant impact; potential bugs priority:medium Moderate impact; usability gaps and removed priority:high Significant impact; potential bugs labels Mar 1, 2022

wangxianghu force-pushed the HUDI-3525 branch from 902d714 to b31e57b Compare March 2, 2022 06:22

nsivabalan self-assigned this Mar 2, 2022

nsivabalan requested changes Mar 3, 2022

View reviewed changes

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java Outdated Show resolved Hide resolved

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java Outdated Show resolved Hide resolved

[HUDI-3525] Introduce JsonkafkaSourceProcessor to support data prepro…

2f11c3d

…cess before it is transformed to DataSet

wangxianghu force-pushed the HUDI-3525 branch from b31e57b to 2f11c3d Compare March 3, 2022 06:54

wangxianghu commented Mar 3, 2022

View reviewed changes

pratyakshsharma approved these changes Mar 6, 2022

View reviewed changes

wangxianghu changed the title ~~[HUDI-3525] Introduce JsonkafkaSourcePostProcessor to support data preprocess before it is transformed to DataSet~~ [HUDI-3525] Introduce JsonkafkaSourcePostProcessor to support data post process before it is transformed to DataSet Mar 6, 2022

nsivabalan approved these changes Mar 6, 2022

View reviewed changes

nsivabalan merged commit c9ffdc4 into apache:master Mar 6, 2022

vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022

[HUDI-3525] Introduce JsonkafkaSourceProcessor to support data prepro…

9f9ccae

…cess before it is transformed to DataSet (apache#4930)

stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 12, 2022

[HUDI-3525] Introduce JsonkafkaSourceProcessor to support data prepro…

d2aed60

…cess before it is transformed to DataSet (apache#4930)

[HUDI-3525] Introduce JsonkafkaSourcePostProcessor to support data post process before it is transformed to DataSet #4930

[HUDI-3525] Introduce JsonkafkaSourcePostProcessor to support data post process before it is transformed to DataSet #4930

Uh oh!

Conversation

wangxianghu commented Mar 1, 2022 • edited by pratyakshsharma Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

nsivabalan commented Mar 1, 2022

Uh oh!

nsivabalan commented Mar 1, 2022

Uh oh!

pratyakshsharma commented Mar 2, 2022

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wangxianghu commented Mar 3, 2022

Uh oh!

wangxianghu Mar 3, 2022

Choose a reason for hiding this comment

Uh oh!

wangxianghu commented Mar 3, 2022

Uh oh!

wangxianghu commented Mar 3, 2022

Uh oh!

hudi-bot commented Mar 3, 2022

CI report:

Uh oh!

wangxianghu commented Mar 4, 2022

Uh oh!

pratyakshsharma commented Mar 6, 2022

Uh oh!

wangxianghu commented Mar 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pratyakshsharma commented Mar 6, 2022

Uh oh!

wangxianghu commented Mar 6, 2022

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wangxianghu commented Mar 1, 2022 •

edited by pratyakshsharma

Loading

wangxianghu commented Mar 6, 2022 •

edited

Loading