You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just read through the protobuf deserializer code in anticipation of migrating away from our homegrown implementation, but it looks like the implementation in this repo is parsing the schema retrieved from schema registry for every message that is deserialized - along with parsing the schema of all of its dependencies. The results of these parse operations are only used for determining the name of the protobuf which will then be instantiated by MessageFactory, so there isn't even any significant value being extracted from ensuring that the MessageDescriptor is an accurate representation of the version of the proto carried in the buffer. The only information extracted from the parse that is used is the FileDescriptor in order to work out which message is referenced by the message indexes in the buffer.
As I understand it, the schema registry updates the schemaId returned for a given subject whenever the schema changes in any way, so schemaId alone could probably be used as a cache key to look up already parsed message or file descriptors, but schemaId and message indexes bytes could certainly be used as a cache key in the (not actually supported by other confluent tools, I think) use-case where more than one proto message from the same schema file is being written to the same topic via the same subject.
The registry client has a cache for schema contents so there would seem to be no reason not to use a cache for the parsed representation of those schema contents, though it would probably be nice if the two caches could be linked in some way so that cache invalidation would cascade from schema client to protobuf deserializer.
This would dramatically speed up deserialization and cut down on cpu utilization when processing large volumes of messages.
The text was updated successfully, but these errors were encountered:
Description
I just read through the protobuf deserializer code in anticipation of migrating away from our homegrown implementation, but it looks like the implementation in this repo is parsing the schema retrieved from schema registry for every message that is deserialized - along with parsing the schema of all of its dependencies. The results of these parse operations are only used for determining the name of the protobuf which will then be instantiated by MessageFactory, so there isn't even any significant value being extracted from ensuring that the MessageDescriptor is an accurate representation of the version of the proto carried in the buffer. The only information extracted from the parse that is used is the FileDescriptor in order to work out which message is referenced by the message indexes in the buffer.
As I understand it, the schema registry updates the schemaId returned for a given subject whenever the schema changes in any way, so schemaId alone could probably be used as a cache key to look up already parsed message or file descriptors, but schemaId and message indexes bytes could certainly be used as a cache key in the (not actually supported by other confluent tools, I think) use-case where more than one proto message from the same schema file is being written to the same topic via the same subject.
The registry client has a cache for schema contents so there would seem to be no reason not to use a cache for the parsed representation of those schema contents, though it would probably be nice if the two caches could be linked in some way so that cache invalidation would cascade from schema client to protobuf deserializer.
This would dramatically speed up deserialization and cut down on cpu utilization when processing large volumes of messages.
The text was updated successfully, but these errors were encountered: