Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

protobuf deserializer (inefficiently) parses schema file contents for every incoming message #969

Closed
ideasculptor opened this issue Mar 29, 2023 · 0 comments · Fixed by #1128

Comments

@ideasculptor
Copy link
Contributor

Description

I just read through the protobuf deserializer code in anticipation of migrating away from our homegrown implementation, but it looks like the implementation in this repo is parsing the schema retrieved from schema registry for every message that is deserialized - along with parsing the schema of all of its dependencies. The results of these parse operations are only used for determining the name of the protobuf which will then be instantiated by MessageFactory, so there isn't even any significant value being extracted from ensuring that the MessageDescriptor is an accurate representation of the version of the proto carried in the buffer. The only information extracted from the parse that is used is the FileDescriptor in order to work out which message is referenced by the message indexes in the buffer.

As I understand it, the schema registry updates the schemaId returned for a given subject whenever the schema changes in any way, so schemaId alone could probably be used as a cache key to look up already parsed message or file descriptors, but schemaId and message indexes bytes could certainly be used as a cache key in the (not actually supported by other confluent tools, I think) use-case where more than one proto message from the same schema file is being written to the same topic via the same subject.

The registry client has a cache for schema contents so there would seem to be no reason not to use a cache for the parsed representation of those schema contents, though it would probably be nice if the two caches could be linked in some way so that cache invalidation would cascade from schema client to protobuf deserializer.

This would dramatically speed up deserialization and cut down on cpu utilization when processing large volumes of messages.

rayokota added a commit that referenced this issue Jan 23, 2024
* chore: update repo semaphore config

* chore: update repo semaphore config

* chore: update repo semaphore config

* chore: update repo semaphore config

* chore: update repo semaphore config

* chore: update repo semaphore config

* chore: update repo semaphore config

* chore: update repo semaphore config

* chore: update repo semaphore config

* chore: update repo semaphore config

* chore: update repo semaphore config

* chore: update repo semaphore config

* chore: update repo semaphore config

* chore: update repo semaphore config

* chore: update repo semaphore config

* chore: update repo semaphore config

* Cache parsed file descriptors

Fixes #969

* Clean up import

---------

Co-authored-by: Confluent Jenkins Bot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants