[Hudi 73] Adding support for vanilla AvroKafkaSource #2380

nsivabalan · 2020-12-25T18:28:06Z

What is the purpose of the pull request

Redo of #1565

Brief change log

Added HoodieKafkaAvroDecoder and AbstractHoodieKafkaAvroDeserializer to assist in deserializing kafka avro data.
Introduced a property for configuring AvroKafkaSource with or without schema-registry setup.

Verify this pull request

This change added tests and can be verified as follows:

Added TestHoodieKafkaAvroDecoder to verify the change.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

codecov-io · 2020-12-25T18:34:32Z

Codecov Report

Merging #2380 (0b111c8) into master (286055c) will decrease coverage by 0.18%.
The diff coverage is 1.58%.

@@             Coverage Diff              @@
##             master    #2380      +/-   ##
============================================
- Coverage     52.20%   52.01%   -0.19%     
- Complexity     2659     2660       +1     
============================================
  Files           335      338       +3     
  Lines         14981    15043      +62     
  Branches       1505     1509       +4     
============================================
+ Hits           7821     7825       +4     
- Misses         6535     6593      +58     
  Partials        625      625

Flag	Coverage Δ	Complexity Δ
hudicli	`38.83% <ø> (ø)`	`0.00 <ø> (ø)`
hudiclient	`100.00% <ø> (ø)`	`0.00 <ø> (ø)`
hudicommon	`54.80% <ø> (+0.03%)`	`0.00 <ø> (ø)`
hudihadoopmr	`33.29% <ø> (ø)`	`0.00 <ø> (ø)`
huditimelineservice	`65.30% <ø> (ø)`	`0.00 <ø> (ø)`
hudiutilities	`67.45% <1.58%> (-2.21%)`	`0.00 <0.00> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...ies/serde/AbstractHoodieKafkaAvroDeserializer.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (?)`
...e/hudi/utilities/serde/HoodieKafkaAvroDecoder.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (?)`
...e/config/HoodieKafkaAvroDeserializationConfig.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (?)`
...apache/hudi/utilities/sources/AvroKafkaSource.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (ø)`
...hudi/utilities/schema/FilebasedSchemaProvider.java	`83.33% <100.00%> (+0.98%)`	`5.00 <0.00> (ø)`
...ache/hudi/common/fs/inline/InMemoryFileSystem.java	`89.65% <0.00%> (+10.34%)`	`16.00% <0.00%> (+1.00%)`

nsivabalan · 2020-12-25T18:37:07Z

@afilipchik @vinothchandar : I have taken a stab at #1565. I did not have permission to update Pratyaksh's repo, hence created a new one.

Basically, AbstractHoodieKafkaAvroDeserializer initializes SchemaProvider based on configs to fetch source scheme and target schema. In other words I have combined #1562 and #1565

If deserialize() is called w/ reader schema, same is used. If not, the one from schema provider is used. In either case, writer schema is fetched from schema provider. In previous patch from Pratyaksh, we were using the passed in schema as both reader and writer schema and hence schema evolution could run into issues.

But AbstractHoodieKafkaAvroDeserializer is inspired from Confluent schema-registry repo. I am not sure how to make this generic(as of now assumes the schema id at the beginning, followed by length and data). I haven't worked w/ schema registries nor w/ kafka/avro before. Will have to do research on what other ways we could deser kafka avro data. But as per Pratyaksh's comment, looks like usage of non schema registry flows are discouraged in general. So, not sure how much value we could add by supporting all diff ways to deser kafka avro data (i.e if not for confluent way).

Let me know your thoughts. I am looking to get this into 0.7.0 (will be cutting a release in a weeks time). So, would appreciate if you can respond whenever you can.

vinothchandar · 2020-12-26T17:20:42Z

@afilipchik could you take a pass at this PR this week?

vburenin · 2021-01-04T15:48:10Z

...ities/src/main/java/org/apache/hudi/utilities/serde/AbstractHoodieKafkaAvroDeserializer.java

+    return deserialize(null, null, payload, readerSchema);
+  }
+
+  protected Object deserialize(String topic, Boolean isKey, byte[] payload, Schema readerSchema) {


Original AbstractKafkaAvroDeserializer decoder has a check

if (payload == null) { return null; }

Is removal of this check safe?

vburenin · 2021-01-04T15:57:52Z

hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/AvroKafkaSource.java

+      props.put("value.deserializer", KafkaAvroDeserializer.class);
+    } else {
+      DataSourceUtils.checkRequiredProperties(props, Collections.singletonList(FilebasedSchemaProvider.Config.SOURCE_SCHEMA_FILE_PROP));
+      props.put("value.deserializer", HoodieKafkaAvroDecoder.class);


Would be nice to have an ability to configure which decoder class is to use for value.deserializer to be able to handle internal data decodings specifics.

vburenin · 2021-01-04T16:11:17Z

...ities/src/main/java/org/apache/hudi/utilities/serde/AbstractHoodieKafkaAvroDeserializer.java

+    return deserialize(null, null, payload, readerSchema);
+  }
+
+  protected Object deserialize(String topic, Boolean isKey, byte[] payload, Schema readerSchema) {


Do you know where readerSchema come from?

It should be coming from the configured schema provider.

afilipchik · 2021-01-05T16:23:16Z

...ities/src/main/java/org/apache/hudi/utilities/serde/AbstractHoodieKafkaAvroDeserializer.java

+  protected Object deserialize(String topic, Boolean isKey, byte[] payload, Schema readerSchema) {
+    try {
+      ByteBuffer buffer = this.getByteBuffer(payload);
+      int id = buffer.getInt();


this assumes the message starts with schema version (code looks like from Confluent deserializer). It doesn't belong to AbstractHoodieKafkaAvroDeserializer.java

afilipchik · 2021-01-05T16:28:01Z

On making AbstractHoodieKafkaAvroDeserializer abstract - it looks like modified Confluent deserializer, so it believe it should be called like that. If we want to support Confluent schema registry we need to use schema id to acquire writer's schema, otherwise schema evolutions will be a pain.

I.E. to deserialize an avro message we need 2 things: schema it was written with (can come from properties, but with Confluent id in the beginning of the message tells you exact version and can be used to fetch it from the schema registry) and the reader schema (schema on the reader side which comes from the schema provider)

nsivabalan · 2021-03-02T13:28:01Z

Closing this in favor of #2598

pratyakshsharma and others added 3 commits December 25, 2020 11:55

[HUDI-73]: implemented vanilla AvroKafkaSource

d967f6c

[HUDI-73]: Small fix

ecb84dc

Fixing schema provider with AbstractHoodieKafkaAvroDeserializer

0b111c8

nsivabalan added priority:blocker Production down; release blocker area:schema Schema evolution and data types labels Dec 26, 2020

vinothchandar self-assigned this Dec 26, 2020

vburenin reviewed Jan 4, 2021

View reviewed changes

afilipchik reviewed Jan 5, 2021

View reviewed changes

vinothchandar removed the priority:blocker Production down; release blocker label Jan 10, 2021

nsivabalan added status:in-progress Work in progress priority:high Significant impact; potential bugs labels Feb 11, 2021

vinothchandar added priority:critical Production degraded; pipelines stalled and removed priority:high Significant impact; potential bugs labels Feb 11, 2021

nsivabalan closed this Mar 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Hudi 73] Adding support for vanilla AvroKafkaSource #2380

[Hudi 73] Adding support for vanilla AvroKafkaSource #2380

Uh oh!

nsivabalan commented Dec 25, 2020

Uh oh!

codecov-io commented Dec 25, 2020 •

edited

Loading

Uh oh!

nsivabalan commented Dec 25, 2020 •

edited

Loading

Uh oh!

vinothchandar commented Dec 26, 2020

Uh oh!

vburenin Jan 4, 2021

Uh oh!

vburenin Jan 4, 2021

Uh oh!

vburenin Jan 4, 2021

Uh oh!

afilipchik Jan 5, 2021

Uh oh!

afilipchik Jan 5, 2021

Uh oh!

afilipchik commented Jan 5, 2021

Uh oh!

nsivabalan commented Mar 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[Hudi 73] Adding support for vanilla AvroKafkaSource #2380

[Hudi 73] Adding support for vanilla AvroKafkaSource #2380

Uh oh!

Conversation

nsivabalan commented Dec 25, 2020

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

codecov-io commented Dec 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nsivabalan commented Dec 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vinothchandar commented Dec 26, 2020

Uh oh!

vburenin Jan 4, 2021

Choose a reason for hiding this comment

Uh oh!

vburenin Jan 4, 2021

Choose a reason for hiding this comment

Uh oh!

vburenin Jan 4, 2021

Choose a reason for hiding this comment

Uh oh!

afilipchik Jan 5, 2021

Choose a reason for hiding this comment

Uh oh!

afilipchik Jan 5, 2021

Choose a reason for hiding this comment

Uh oh!

afilipchik commented Jan 5, 2021

Uh oh!

nsivabalan commented Mar 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov-io commented Dec 25, 2020 •

edited

Loading

nsivabalan commented Dec 25, 2020 •

edited

Loading