-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multiple Avro schema version in Pulsar SQL #4847
Support multiple Avro schema version in Pulsar SQL #4847
Conversation
run Integration Tests |
run java8 tests |
run Integration Tests |
run java8 tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@congbobo184 as we discussed in the previous pull request (the primitive one), we agreed on that we want to improve the schema implementation (to support ByteBuf and ByteBuffer) before changing the json and avro schema support in presto. Do you mind picking up that change first?
I don't mind picking up it :) |
…schema_version # Conflicts: # pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/AvroSchemaHandler.java # pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/JSONSchemaHandler.java # pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarPrimitiveSchemaHandler.java # pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarSchemaHandlers.java
|
||
@Override | ||
public byte[] getSchemaVersion() { | ||
if (msgMetadataBuilder != null && msgMetadataBuilder.hasSchemaVersion()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get schema version by msgMetadata directly.
private static long bytes2Long(byte[] byteNum) { | ||
long num = 0; | ||
for (int ix = 0; ix < 8; ++ix) { | ||
num <<= 8; | ||
num |= (byteNum[ix] & 0xff); | ||
} | ||
return num; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to implement in LongSchemaVersion and move the SchemaVersions to pulsar-common module, so that can be used conveniently by other connectors while work with multi-version schema supporting
} finally { | ||
ReferenceCountUtil.safeRelease(heapBuffer); | ||
@Override | ||
public Object deserialize(RawMessage rawMessage) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just add a method deserialize(ByteBuf byteBuf, byte[] schemaVersion) to instead use RawMessage as an input param.
run java8 tests |
run java8 tests |
1 similar comment
run java8 tests |
run cpp tests |
2 similar comments
run cpp tests |
run cpp tests |
…schema_version # Conflicts: # pulsar-client/src/main/java/org/apache/pulsar/client/impl/schema/StructSchema.java # pulsar-common/src/main/java/org/apache/pulsar/common/api/raw/RawMessage.java # pulsar-common/src/main/java/org/apache/pulsar/common/api/raw/RawMessageImpl.java # pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarRecordCursor.java # pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarSchemaHandlers.java # pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/SchemaHandler.java
/pulsarbot run-failure-checks |
/pulsarbot run-failure-checks |
1 similar comment
/pulsarbot run-failure-checks |
@@ -66,4 +69,13 @@ public String toString() { | |||
.add("version", version) | |||
.toString(); | |||
} | |||
|
|||
public static long bytes2Long(byte[] byteNum) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use ByteBuffer.wrap(byte[]).getLong()
directly.
* under the License. | ||
*/ | ||
/** | ||
* Implementation of Simple Authentication and Security Layer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The java doc does not match the package info.
public Object deserialize(ByteBuf keyPayload, ByteBuf dataPayload) { | ||
return null; | ||
public Object deserialize(ByteBuf payload, byte[] schemaVersion) { | ||
return genericAvroSchema.decode(payload, schemaVersion); | ||
} | ||
|
||
@Override | ||
public Object extractField(int index, Object currentRecord) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand the reason for the change in this method. I think to support multiple schema version decode does not affect extract field from the GenericRecord right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only add a deserialize method, and add a default interface for the keyPayload deserialize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm talking about extractField(int index, Object currentRecord)
method. I noticed the new change is read by field names and we use read by position index before, is the previous method not enough to support multiple schema versions?
LOG.error("Can't get generic schema for topic {} schema version {}", | ||
topicName.toString(), new String(schemaVersion, StandardCharsets.UTF_8), e); | ||
throw new RuntimeException("Can't get generic schema for topic " + topicName.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should complete the future with exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix this same as below.
} catch (PulsarAdminException e) { | ||
LOG.error("Can't get current schema for topic {}", | ||
topicName.toString(), e); | ||
throw new RuntimeException("Can't get current schema for topic " + topicName.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
default Object deserialize(ByteBuf keyPayload, ByteBuf dataPayload) { | ||
return deserialize(dataPayload); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why skip key payload deserialization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only keyValueSchemaHandle can deserialize the keyPayload, so if you don't implement this method and then it can't deserialize the keyPayload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I understand what you mean. Can you add some comments for these three methods? It will be easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will add the comment.
/cc @gaoran10 Please help take a look this PR. |
/pulsarbot run-failure-checks |
1 similar comment
/pulsarbot run-failure-checks |
@congbobo184 can you rebase to latest master? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, @congbobo184 I left some minor comments, please take a look.
this.schemaInfo = schemaInfo; | ||
this.genericAvroSchema = new GenericAvroSchema(schemaInfo); | ||
this.genericAvroSchema | ||
.setSchemaInfoProvider( | ||
new PulsarSqlSchemaInfoProvider(topicName, pulsarConnectorConfig.getPulsarAdmin())); | ||
this.columnHandles = columnHandles; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.schemaInfo = schemaInfo; | |
this.genericAvroSchema = new GenericAvroSchema(schemaInfo); | |
this.genericAvroSchema | |
.setSchemaInfoProvider( | |
new PulsarSqlSchemaInfoProvider(topicName, pulsarConnectorConfig.getPulsarAdmin())); | |
this.columnHandles = columnHandles; | |
this(new PulsarSqlSchemaInfoProvider(topicName, pulsarConnectorConfig.getPulsarAdmin()), schemaInfo, columnHandles); |
public Object deserialize(ByteBuf keyPayload, ByteBuf dataPayload) { | ||
return null; | ||
public Object deserialize(ByteBuf payload, byte[] schemaVersion) { | ||
return genericAvroSchema.decode(payload, schemaVersion); | ||
} | ||
|
||
@Override | ||
public Object extractField(int index, Object currentRecord) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm talking about extractField(int index, Object currentRecord)
method. I noticed the new change is read by field names and we use read by position index before, is the previous method not enough to support multiple schema versions?
return new KeyValue<>(keyObj, valueObj); | ||
} | ||
|
||
private KeyValue<ByteBuf, ByteBuf> deserializeCommon(ByteBuf keyPayload, ByteBuf dataPayload) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method did not do any work related to deserialization. So is it better rename it to getKeyValueByteBuf
? And for reducing KeyValue object creation, I think you can implement the key-value schema deserialization in deserialize(ByteBuf keyPayload, ByteBuf dataPayload, byte[] schemaVersion)
, and check the schemaVersion
is or null, so that the deserialize(ByteBuf keyPayload, ByteBuf dataPayload)
and straightforward call deserialize(keyPayload, dataPayload, null)
, so that we don't need to create KeyValue object for every message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multi schema versions only use the last schema, if reduce or add increase field will upset the index, so we only need to use field name to find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you to rename it to getKeyValueByteBuf. I don't think we need to check the schema version is or null, if it is null we will use the last version reader to decode it, we only care about is field name does is match.
if (currentMessage.getSchemaVersion() != null) { | ||
currentRecord = this.schemaHandler.deserialize(keyByteBuf, | ||
this.currentMessage.getData(), this.currentMessage.getSchemaVersion()); | ||
} else { | ||
currentRecord = this.schemaHandler.deserialize(keyByteBuf, this.currentMessage.getData()); | ||
} | ||
} else if (currentMessage.getSchemaVersion() != null) { | ||
currentRecord = this.schemaHandler.deserialize(this.currentMessage.getData(), | ||
this.currentMessage.getSchemaVersion()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to the above comment. If we can use null for schemaVersion
in method deserialize(ByteBuf keyPayload, ByteBuf dataPayload, byte[] schemaVersion)
, we don't need to add if-else here right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you are right, we don't nee to add if-else here.
|
||
|
||
/** | ||
* Multi version generic schema provider by guava cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Multi version generic schema provider by guava cache. | |
* Multi version schema info provider for Pulsar SQL leverage guava cache. |
default Object deserialize(ByteBuf keyPayload, ByteBuf dataPayload) { | ||
return deserialize(dataPayload); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I understand what you mean. Can you add some comments for these three methods? It will be easier to read.
@gaoran10 Please also help take a look at this PR since you have done some work that related Pulsar SQL schema. |
Support multiple Avro schema version in Pulsar SQL
Support multiple Avro schema version in Pulsar SQL
Motivation
pulsa sql avro schema support schema version
Verifying this change
Add the tests for it
Does this pull request potentially affect one of the following parts:
If yes was chosen, please highlight the changes
Dependencies (does it add or upgrade a dependency): (no)
The public API: (no)
The schema: (yes)
The default values of configurations: (no)
The wire protocol: (no)
The rest endpoints: (no)
The admin cli options: (no)
Anything that affects deployment: (no)
Documentation
Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)
If a feature is not applicable for documentation, explain why?
If a feature is not documented yet in this PR, please create a followup issue for adding the documentation