Support multiple Avro schema version in Pulsar SQL #4847

congbobo184 · 2019-07-30T02:43:02Z

Motivation

pulsa sql avro schema support schema version

Verifying this change

Add the tests for it

Does this pull request potentially affect one of the following parts:
If yes was chosen, please highlight the changes

Dependencies (does it add or upgrade a dependency): (no)
The public API: (no)
The schema: (yes)
The default values of configurations: (no)
The wire protocol: (no)
The rest endpoints: (no)
The admin cli options: (no)
Anything that affects deployment: (no)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)
If a feature is not applicable for documentation, explain why?
If a feature is not documented yet in this PR, please create a followup issue for adding the documentation

congbobo184 · 2019-07-30T07:15:38Z

run Integration Tests

congbobo184 · 2019-07-30T07:15:49Z

run java8 tests

congbobo184 · 2019-07-30T09:17:03Z

run Integration Tests

congbobo184 · 2019-07-30T09:17:11Z

run java8 tests

sijie

@congbobo184 as we discussed in the previous pull request (the primitive one), we agreed on that we want to improve the schema implementation (to support ByteBuf and ByteBuffer) before changing the json and avro schema support in presto. Do you mind picking up that change first?

congbobo184 · 2019-07-31T01:59:41Z

@congbobo184 as we discussed in the previous pull request (the primitive one), we agreed on that we want to improve the schema implementation (to support ByteBuf and ByteBuffer) before changing the json and avro schema support in presto. Do you mind picking up that change first?

I don't mind picking up it :)

…schema_version # Conflicts: # pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/AvroSchemaHandler.java # pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/JSONSchemaHandler.java # pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarPrimitiveSchemaHandler.java # pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarSchemaHandlers.java

…schema_version

codelipenghui · 2019-12-09T08:12:51Z

pulsar-common/src/main/java/org/apache/pulsar/common/api/raw/RawMessageImpl.java

+
+    @Override
+    public byte[] getSchemaVersion() {
+        if (msgMetadataBuilder != null && msgMetadataBuilder.hasSchemaVersion()) {


You can get schema version by msgMetadata directly.

codelipenghui · 2019-12-09T08:33:41Z

...ql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarSqlSchemaInfoProvider.java

+    private static long bytes2Long(byte[] byteNum) {
+        long num = 0;
+        for (int ix = 0; ix < 8; ++ix) {
+            num <<= 8;
+            num |= (byteNum[ix] & 0xff);
+        }
+        return num;


It's better to implement in LongSchemaVersion and move the SchemaVersions to pulsar-common module, so that can be used conveniently by other connectors while work with multi-version schema supporting

codelipenghui · 2019-12-09T08:35:19Z

pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/AvroSchemaHandler.java

-        } finally {
-            ReferenceCountUtil.safeRelease(heapBuffer);
+    @Override
+    public Object deserialize(RawMessage rawMessage) {


You can just add a method deserialize(ByteBuf byteBuf, byte[] schemaVersion) to instead use RawMessage as an input param.

…schema_version

congbobo184 · 2019-12-10T02:19:17Z

run java8 tests

congbobo184 · 2019-12-10T03:25:14Z

run java8 tests

congbobo184 · 2019-12-10T03:26:09Z

run java8 tests

congbobo184 · 2019-12-10T03:26:17Z

run cpp tests

congbobo184 · 2019-12-10T06:09:08Z

run cpp tests

congbobo184 · 2019-12-10T12:31:50Z

run cpp tests

…schema_version # Conflicts: # pulsar-client/src/main/java/org/apache/pulsar/client/impl/schema/StructSchema.java # pulsar-common/src/main/java/org/apache/pulsar/common/api/raw/RawMessage.java # pulsar-common/src/main/java/org/apache/pulsar/common/api/raw/RawMessageImpl.java # pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarRecordCursor.java # pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarSchemaHandlers.java # pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/SchemaHandler.java

congbobo184 · 2020-03-10T07:38:39Z

/pulsarbot run-failure-checks

congbobo184 · 2020-03-10T15:57:03Z

/pulsarbot run-failure-checks

congbobo184 · 2020-03-11T01:58:37Z

/pulsarbot run-failure-checks

codelipenghui · 2020-03-15T10:00:33Z

pulsar-common/src/main/java/org/apache/pulsar/common/schema/LongSchemaVersion.java

@@ -66,4 +69,13 @@ public String toString() {
            .add("version", version)
            .toString();
    }
+
+    public static long bytes2Long(byte[] byteNum) {


You can use ByteBuffer.wrap(byte[]).getLong() directly.

codelipenghui · 2020-03-15T10:01:29Z

pulsar-common/src/main/java/org/apache/pulsar/common/schema/package-info.java

+ * under the License.
+ */
+/**
+ * Implementation of Simple Authentication and Security Layer.


The java doc does not match the package info.

codelipenghui · 2020-03-15T10:21:14Z

pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/AvroSchemaHandler.java

-    public Object deserialize(ByteBuf keyPayload, ByteBuf dataPayload) {
-        return null;
+    public Object deserialize(ByteBuf payload, byte[] schemaVersion) {
+        return genericAvroSchema.decode(payload, schemaVersion);
    }

    @Override
    public Object extractField(int index, Object currentRecord) {


I don't quite understand the reason for the change in this method. I think to support multiple schema version decode does not affect extract field from the GenericRecord right?

I only add a deserialize method, and add a default interface for the keyPayload deserialize.

I'm talking about extractField(int index, Object currentRecord) method. I noticed the new change is read by field names and we use read by position index before, is the previous method not enough to support multiple schema versions?

codelipenghui · 2020-03-15T10:26:06Z

...ql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarSqlSchemaInfoProvider.java

+            LOG.error("Can't get generic schema for topic {} schema version {}",
+                    topicName.toString(), new String(schemaVersion, StandardCharsets.UTF_8), e);
+            throw new RuntimeException("Can't get generic schema for topic " + topicName.toString());


You should complete the future with exception.

I will fix this same as below.

codelipenghui · 2020-03-15T10:26:29Z

...ql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarSqlSchemaInfoProvider.java

+        } catch (PulsarAdminException e) {
+            LOG.error("Can't get current schema for topic {}",
+                    topicName.toString(), e);
+            throw new RuntimeException("Can't get current schema for topic " + topicName.toString());


Same as above

codelipenghui · 2020-03-15T10:27:44Z

pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/SchemaHandler.java

+    default Object deserialize(ByteBuf keyPayload, ByteBuf dataPayload) {
+        return deserialize(dataPayload);
+    }


Why skip key payload deserialization?

Only keyValueSchemaHandle can deserialize the keyPayload, so if you don't implement this method and then it can't deserialize the keyPayload.

Thanks, I understand what you mean. Can you add some comments for these three methods? It will be easier to read.

Ok, I will add the comment.

codelipenghui · 2020-03-15T13:54:17Z

/cc @gaoran10 Please help take a look this PR.

congbobo184 · 2020-03-19T09:09:39Z

/pulsarbot run-failure-checks

congbobo184 · 2020-03-19T10:17:31Z

/pulsarbot run-failure-checks

…schema_version

sijie · 2020-03-26T06:58:44Z

@congbobo184 can you rebase to latest master?

codelipenghui

Overall looks good, @congbobo184 I left some minor comments, please take a look.

codelipenghui · 2020-03-29T01:20:59Z

pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/AvroSchemaHandler.java

+        this.schemaInfo = schemaInfo;
+        this.genericAvroSchema = new GenericAvroSchema(schemaInfo);
+        this.genericAvroSchema
+                .setSchemaInfoProvider(
+                        new PulsarSqlSchemaInfoProvider(topicName, pulsarConnectorConfig.getPulsarAdmin()));
+        this.columnHandles = columnHandles;


Suggested change

this.schemaInfo = schemaInfo;

this.genericAvroSchema = new GenericAvroSchema(schemaInfo);

this.genericAvroSchema

.setSchemaInfoProvider(

new PulsarSqlSchemaInfoProvider(topicName, pulsarConnectorConfig.getPulsarAdmin()));

this.columnHandles = columnHandles;

this(new PulsarSqlSchemaInfoProvider(topicName, pulsarConnectorConfig.getPulsarAdmin()), schemaInfo, columnHandles);

codelipenghui · 2020-03-29T01:27:50Z

pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/AvroSchemaHandler.java

-    public Object deserialize(ByteBuf keyPayload, ByteBuf dataPayload) {
-        return null;
+    public Object deserialize(ByteBuf payload, byte[] schemaVersion) {
+        return genericAvroSchema.decode(payload, schemaVersion);
    }

    @Override
    public Object extractField(int index, Object currentRecord) {


I'm talking about extractField(int index, Object currentRecord) method. I noticed the new change is read by field names and we use read by position index before, is the previous method not enough to support multiple schema versions?

codelipenghui · 2020-03-29T01:40:38Z

pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/KeyValueSchemaHandler.java

+        return new KeyValue<>(keyObj, valueObj);
+    }
+
+    private KeyValue<ByteBuf, ByteBuf> deserializeCommon(ByteBuf keyPayload, ByteBuf dataPayload) {


I think this method did not do any work related to deserialization. So is it better rename it to getKeyValueByteBuf? And for reducing KeyValue object creation, I think you can implement the key-value schema deserialization in deserialize(ByteBuf keyPayload, ByteBuf dataPayload, byte[] schemaVersion), and check the schemaVersion is or null, so that the deserialize(ByteBuf keyPayload, ByteBuf dataPayload) and straightforward call deserialize(keyPayload, dataPayload, null), so that we don't need to create KeyValue object for every message.

multi schema versions only use the last schema, if reduce or add increase field will upset the index, so we only need to use field name to find it.

I agree with you to rename it to getKeyValueByteBuf. I don't think we need to check the schema version is or null, if it is null we will use the last version reader to decode it, we only care about is field name does is match.

codelipenghui · 2020-03-29T01:46:40Z

pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarRecordCursor.java

+            if (currentMessage.getSchemaVersion() != null) {
+                currentRecord = this.schemaHandler.deserialize(keyByteBuf,
+                        this.currentMessage.getData(), this.currentMessage.getSchemaVersion());
+            } else {
+                currentRecord = this.schemaHandler.deserialize(keyByteBuf, this.currentMessage.getData());
+            }
+        } else if (currentMessage.getSchemaVersion() != null) {
+            currentRecord = this.schemaHandler.deserialize(this.currentMessage.getData(),
+                    this.currentMessage.getSchemaVersion());


Related to the above comment. If we can use null for schemaVersion in method deserialize(ByteBuf keyPayload, ByteBuf dataPayload, byte[] schemaVersion), we don't need to add if-else here right?

yes you are right, we don't nee to add if-else here.

codelipenghui · 2020-03-29T01:50:09Z

...ql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarSqlSchemaInfoProvider.java

+
+
+/**
+ * Multi version generic schema provider by guava cache.


Suggested change

* Multi version generic schema provider by guava cache.

* Multi version schema info provider for Pulsar SQL leverage guava cache.

codelipenghui · 2020-03-29T01:57:41Z

pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/SchemaHandler.java

+    default Object deserialize(ByteBuf keyPayload, ByteBuf dataPayload) {
+        return deserialize(dataPayload);
+    }


Thanks, I understand what you mean. Can you add some comments for these three methods? It will be easier to read.

…schema_version

codelipenghui · 2020-04-02T13:45:25Z

@gaoran10 Please also help take a look at this PR since you have done some work that related Pulsar SQL schema.

codelipenghui · 2020-05-07T15:47:46Z

ping @gaoran10 @sijie Please help review this PR.

Support multiple Avro schema version in Pulsar SQL

Pulsar sql support schema version

0e0e9a0

sijie reviewed Jul 30, 2019

View reviewed changes

congbobo184 closed this Jul 31, 2019

congbobo184 deleted the pulsar_sql_support_schema_version branch July 31, 2019 01:55

congbobo184 restored the pulsar_sql_support_schema_version branch July 31, 2019 01:56

congbobo184 reopened this Jul 31, 2019

congbobo added 8 commits September 5, 2019 18:46

Merge remote-tracking branch 'apache/master' into pulsar_sql_support_…

277ecfb

…schema_version

Add the decode byteBuf and modify test

b87e142

Merge remote-tracking branch 'apache/master' into pulsar_sql_support_…

244d22f

…schema_version

modify the code style

1771681

Modify the code check style

137e1be

Merge remote-tracking branch 'apache/master' into pulsar_sql_support_…

235048a

…schema_version

Fix schema version provider key is byte[]

123f5d8

codelipenghui reviewed Dec 9, 2019

View reviewed changes

congbobo added 3 commits December 9, 2019 17:18

Fix some comments

e06f6be

Merge remote-tracking branch 'apache/master' into pulsar_sql_support_…

1779434

…schema_version

Modify the codeStyle

f8c11f0

Modify the check style

78e86b5

sijie added this to the 2.6.0 milestone Feb 16, 2020

Key value add multiVersionSchema

57132c4

codelipenghui reviewed Mar 15, 2020

View reviewed changes

Fix some comment

51239bd

Merge remote-tracking branch 'apache/master' into pulsar_sql_support_…

bfce092

…schema_version

no message

59e2c40

codelipenghui reviewed Mar 29, 2020

View reviewed changes

congbobo added 3 commits March 30, 2020 00:37

Merge remote-tracking branch 'apache/master' into pulsar_sql_support_…

1cac019

…schema_version

Fix some comments

13ecb9a

Fix some comments

c9bd00f

codelipenghui approved these changes Mar 30, 2020

View reviewed changes

no message

47aaccd

codelipenghui requested review from sijie, codelipenghui and jiazhai April 2, 2020 00:15

codelipenghui merged commit 097108a into apache:master May 19, 2020

codelipenghui changed the title ~~Pulsar sql avro support schema version~~ Support multiple Avro schema version in Pulsar SQL May 19, 2020

sijie mentioned this pull request May 20, 2020

[discussion] Pulsar release 2.6.0 #5819

Closed

Huanli-Meng pushed a commit to Huanli-Meng/pulsar that referenced this pull request May 27, 2020

Support multiple Avro schema version in Pulsar SQL (apache#4847)

7018e4c

Support multiple Avro schema version in Pulsar SQL

huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this pull request Aug 24, 2020

Support multiple Avro schema version in Pulsar SQL (apache#4847)

b2d65ad

Support multiple Avro schema version in Pulsar SQL

	* Multi version generic schema provider by guava cache.
	* Multi version schema info provider for Pulsar SQL leverage guava cache.

Support multiple Avro schema version in Pulsar SQL #4847

Support multiple Avro schema version in Pulsar SQL #4847

Conversation

congbobo184 commented Jul 30, 2019

Motivation

Verifying this change

Documentation

congbobo184 commented Jul 30, 2019

congbobo184 commented Jul 30, 2019

congbobo184 commented Jul 30, 2019

congbobo184 commented Jul 30, 2019

sijie left a comment

Choose a reason for hiding this comment

congbobo184 commented Jul 31, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

congbobo184 commented Dec 10, 2019

congbobo184 commented Dec 10, 2019

congbobo184 commented Dec 10, 2019

congbobo184 commented Dec 10, 2019

congbobo184 commented Dec 10, 2019

congbobo184 commented Dec 10, 2019

congbobo184 commented Mar 10, 2020

congbobo184 commented Mar 10, 2020

congbobo184 commented Mar 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codelipenghui commented Mar 15, 2020

congbobo184 commented Mar 19, 2020

congbobo184 commented Mar 19, 2020

sijie commented Mar 26, 2020

codelipenghui left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codelipenghui commented Apr 2, 2020

codelipenghui commented May 7, 2020