-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3525] Introduce JsonkafkaSourcePostProcessor to support data post process before it is transformed to DataSet #4930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can you check the CI failure please |
|
@pratyakshsharma : Can you assist in reviewing this patch. |
Ack. Will review it today. |
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test case by adding a test JsonKafkaPostProcessor and ensure it works.
also add a test where you set some invalid class for the new config added. and assert for exception.
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
Outdated
Show resolved
Hide resolved
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java
Outdated
Show resolved
Hide resolved
…cess before it is transformed to DataSet
done |
| * Base class for Json kafka source post processor. User can define their own processor that extends this class to do | ||
| * some post process on the incoming json string records before the records are converted to DataSet<T>. | ||
| */ | ||
| public abstract class JsonKafkaSourcePostProcessor implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to an abstract class so as to provide a unified constructor
|
@hudi-bot run azure |
1 similar comment
|
@hudi-bot run azure |
|
@nsivabalan please take another look when free :) |
|
@wangxianghu The changes look good to me. I have a high level query. You have mentioned facing issues with binlog json format for example. Do you mean to say transformation is not possible with data in binlog json format? Data coming from binlogs also has a structure/schema assigned to it as far as I remember. Can you post a sample event where you feel this new PostProcessor you introduced might be useful? |
It is possible to deal with data in binlog json format, but not very convenient.
all we want is just : we can add write a processor to extract the data from the entire json and maybe do some custom define process, without configuring a huge schema file(including all the fields in the binlog json, no matter if we need them or not)
All in all, with custom processor we can do anything we want on the incoming json data before they are converted into DataSet |
|
Right. With Debezium, this extraction of data is already supported. But with maxwell and any other similar service, this processor will be helpful. |
|
hi @nsivabalan any orther concern ? |
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…cess before it is transformed to DataSet (apache#4930)
…cess before it is transformed to DataSet (apache#4930)
Tips
What is the purpose of the pull request
Currently, we have
Transformto transform source to target dataset before writing, but it is based on DataSet.In some scenarios, our kafka data is not in the right format we need, such as binlog json format.
We need a way to extract/prepare the data we need from the original data before converting it into a DataSet.
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.