Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple Avro schema version in Pulsar SQL #4847

Merged
merged 23 commits into from
May 19, 2020
Merged

Support multiple Avro schema version in Pulsar SQL #4847

merged 23 commits into from
May 19, 2020

Conversation

congbobo184
Copy link
Contributor

Motivation

pulsa sql avro schema support schema version

Verifying this change

Add the tests for it

Does this pull request potentially affect one of the following parts:
If yes was chosen, please highlight the changes

Dependencies (does it add or upgrade a dependency): (no)
The public API: (no)
The schema: (yes)
The default values of configurations: (no)
The wire protocol: (no)
The rest endpoints: (no)
The admin cli options: (no)
Anything that affects deployment: (no)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)
If a feature is not applicable for documentation, explain why?
If a feature is not documented yet in this PR, please create a followup issue for adding the documentation

@congbobo184
Copy link
Contributor Author

run Integration Tests

@congbobo184
Copy link
Contributor Author

run java8 tests

@congbobo184
Copy link
Contributor Author

run Integration Tests

@congbobo184
Copy link
Contributor Author

run java8 tests

Copy link
Member

@sijie sijie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@congbobo184 as we discussed in the previous pull request (the primitive one), we agreed on that we want to improve the schema implementation (to support ByteBuf and ByteBuffer) before changing the json and avro schema support in presto. Do you mind picking up that change first?

@congbobo184 congbobo184 deleted the pulsar_sql_support_schema_version branch July 31, 2019 01:55
@congbobo184 congbobo184 restored the pulsar_sql_support_schema_version branch July 31, 2019 01:56
@congbobo184 congbobo184 reopened this Jul 31, 2019
@congbobo184
Copy link
Contributor Author

@congbobo184 as we discussed in the previous pull request (the primitive one), we agreed on that we want to improve the schema implementation (to support ByteBuf and ByteBuffer) before changing the json and avro schema support in presto. Do you mind picking up that change first?

I don't mind picking up it :)

…schema_version

# Conflicts:
#	pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/AvroSchemaHandler.java
#	pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/JSONSchemaHandler.java
#	pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarPrimitiveSchemaHandler.java
#	pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarSchemaHandlers.java

@Override
public byte[] getSchemaVersion() {
if (msgMetadataBuilder != null && msgMetadataBuilder.hasSchemaVersion()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get schema version by msgMetadata directly.

Comment on lines 102 to 108
private static long bytes2Long(byte[] byteNum) {
long num = 0;
for (int ix = 0; ix < 8; ++ix) {
num <<= 8;
num |= (byteNum[ix] & 0xff);
}
return num;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to implement in LongSchemaVersion and move the SchemaVersions to pulsar-common module, so that can be used conveniently by other connectors while work with multi-version schema supporting

} finally {
ReferenceCountUtil.safeRelease(heapBuffer);
@Override
public Object deserialize(RawMessage rawMessage) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just add a method deserialize(ByteBuf byteBuf, byte[] schemaVersion) to instead use RawMessage as an input param.

@congbobo184
Copy link
Contributor Author

run java8 tests

@congbobo184
Copy link
Contributor Author

run java8 tests

1 similar comment
@congbobo184
Copy link
Contributor Author

run java8 tests

@congbobo184
Copy link
Contributor Author

run cpp tests

2 similar comments
@congbobo184
Copy link
Contributor Author

run cpp tests

@congbobo184
Copy link
Contributor Author

run cpp tests

@sijie sijie added this to the 2.6.0 milestone Feb 16, 2020
…schema_version

# Conflicts:
#	pulsar-client/src/main/java/org/apache/pulsar/client/impl/schema/StructSchema.java
#	pulsar-common/src/main/java/org/apache/pulsar/common/api/raw/RawMessage.java
#	pulsar-common/src/main/java/org/apache/pulsar/common/api/raw/RawMessageImpl.java
#	pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarRecordCursor.java
#	pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/PulsarSchemaHandlers.java
#	pulsar-sql/presto-pulsar/src/main/java/org/apache/pulsar/sql/presto/SchemaHandler.java
@congbobo184
Copy link
Contributor Author

/pulsarbot run-failure-checks

@congbobo184
Copy link
Contributor Author

/pulsarbot run-failure-checks

1 similar comment
@congbobo184
Copy link
Contributor Author

/pulsarbot run-failure-checks

@@ -66,4 +69,13 @@ public String toString() {
.add("version", version)
.toString();
}

public static long bytes2Long(byte[] byteNum) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use ByteBuffer.wrap(byte[]).getLong() directly.

* under the License.
*/
/**
* Implementation of Simple Authentication and Security Layer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The java doc does not match the package info.

public Object deserialize(ByteBuf keyPayload, ByteBuf dataPayload) {
return null;
public Object deserialize(ByteBuf payload, byte[] schemaVersion) {
return genericAvroSchema.decode(payload, schemaVersion);
}

@Override
public Object extractField(int index, Object currentRecord) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand the reason for the change in this method. I think to support multiple schema version decode does not affect extract field from the GenericRecord right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only add a deserialize method, and add a default interface for the keyPayload deserialize.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm talking about extractField(int index, Object currentRecord) method. I noticed the new change is read by field names and we use read by position index before, is the previous method not enough to support multiple schema versions?

Comment on lines 75 to 77
LOG.error("Can't get generic schema for topic {} schema version {}",
topicName.toString(), new String(schemaVersion, StandardCharsets.UTF_8), e);
throw new RuntimeException("Can't get generic schema for topic " + topicName.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should complete the future with exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix this same as below.

} catch (PulsarAdminException e) {
LOG.error("Can't get current schema for topic {}",
topicName.toString(), e);
throw new RuntimeException("Can't get current schema for topic " + topicName.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Comment on lines 34 to 36
default Object deserialize(ByteBuf keyPayload, ByteBuf dataPayload) {
return deserialize(dataPayload);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why skip key payload deserialization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only keyValueSchemaHandle can deserialize the keyPayload, so if you don't implement this method and then it can't deserialize the keyPayload.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I understand what you mean. Can you add some comments for these three methods? It will be easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will add the comment.

@codelipenghui
Copy link
Contributor

/cc @gaoran10 Please help take a look this PR.

@congbobo184
Copy link
Contributor Author

/pulsarbot run-failure-checks

1 similar comment
@congbobo184
Copy link
Contributor Author

/pulsarbot run-failure-checks

@sijie
Copy link
Member

sijie commented Mar 26, 2020

@congbobo184 can you rebase to latest master?

Copy link
Contributor

@codelipenghui codelipenghui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, @congbobo184 I left some minor comments, please take a look.

Comment on lines 52 to 57
this.schemaInfo = schemaInfo;
this.genericAvroSchema = new GenericAvroSchema(schemaInfo);
this.genericAvroSchema
.setSchemaInfoProvider(
new PulsarSqlSchemaInfoProvider(topicName, pulsarConnectorConfig.getPulsarAdmin()));
this.columnHandles = columnHandles;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this.schemaInfo = schemaInfo;
this.genericAvroSchema = new GenericAvroSchema(schemaInfo);
this.genericAvroSchema
.setSchemaInfoProvider(
new PulsarSqlSchemaInfoProvider(topicName, pulsarConnectorConfig.getPulsarAdmin()));
this.columnHandles = columnHandles;
this(new PulsarSqlSchemaInfoProvider(topicName, pulsarConnectorConfig.getPulsarAdmin()), schemaInfo, columnHandles);

public Object deserialize(ByteBuf keyPayload, ByteBuf dataPayload) {
return null;
public Object deserialize(ByteBuf payload, byte[] schemaVersion) {
return genericAvroSchema.decode(payload, schemaVersion);
}

@Override
public Object extractField(int index, Object currentRecord) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm talking about extractField(int index, Object currentRecord) method. I noticed the new change is read by field names and we use read by position index before, is the previous method not enough to support multiple schema versions?

return new KeyValue<>(keyObj, valueObj);
}

private KeyValue<ByteBuf, ByteBuf> deserializeCommon(ByteBuf keyPayload, ByteBuf dataPayload) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method did not do any work related to deserialization. So is it better rename it to getKeyValueByteBuf? And for reducing KeyValue object creation, I think you can implement the key-value schema deserialization in deserialize(ByteBuf keyPayload, ByteBuf dataPayload, byte[] schemaVersion), and check the schemaVersion is or null, so that the deserialize(ByteBuf keyPayload, ByteBuf dataPayload) and straightforward call deserialize(keyPayload, dataPayload, null), so that we don't need to create KeyValue object for every message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multi schema versions only use the last schema, if reduce or add increase field will upset the index, so we only need to use field name to find it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you to rename it to getKeyValueByteBuf. I don't think we need to check the schema version is or null, if it is null we will use the last version reader to decode it, we only care about is field name does is match.

Comment on lines 426 to 434
if (currentMessage.getSchemaVersion() != null) {
currentRecord = this.schemaHandler.deserialize(keyByteBuf,
this.currentMessage.getData(), this.currentMessage.getSchemaVersion());
} else {
currentRecord = this.schemaHandler.deserialize(keyByteBuf, this.currentMessage.getData());
}
} else if (currentMessage.getSchemaVersion() != null) {
currentRecord = this.schemaHandler.deserialize(this.currentMessage.getData(),
this.currentMessage.getSchemaVersion());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the above comment. If we can use null for schemaVersion in method deserialize(ByteBuf keyPayload, ByteBuf dataPayload, byte[] schemaVersion), we don't need to add if-else here right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you are right, we don't nee to add if-else here.



/**
* Multi version generic schema provider by guava cache.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Multi version generic schema provider by guava cache.
* Multi version schema info provider for Pulsar SQL leverage guava cache.

Comment on lines 34 to 36
default Object deserialize(ByteBuf keyPayload, ByteBuf dataPayload) {
return deserialize(dataPayload);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I understand what you mean. Can you add some comments for these three methods? It will be easier to read.

@codelipenghui
Copy link
Contributor

@gaoran10 Please also help take a look at this PR since you have done some work that related Pulsar SQL schema.

@codelipenghui
Copy link
Contributor

ping @gaoran10 @sijie Please help review this PR.

@codelipenghui codelipenghui merged commit 097108a into apache:master May 19, 2020
@codelipenghui codelipenghui changed the title Pulsar sql avro support schema version Support multiple Avro schema version in Pulsar SQL May 19, 2020
Huanli-Meng pushed a commit to Huanli-Meng/pulsar that referenced this pull request May 27, 2020
Support multiple Avro schema version in Pulsar SQL
huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this pull request Aug 24, 2020
Support multiple Avro schema version in Pulsar SQL
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sql Pulsar SQL related features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants