protobuf deserializer (inefficiently) parses schema file contents for every incoming message #969

ideasculptor · 2023-03-29T18:24:09Z

Description

I just read through the protobuf deserializer code in anticipation of migrating away from our homegrown implementation, but it looks like the implementation in this repo is parsing the schema retrieved from schema registry for every message that is deserialized - along with parsing the schema of all of its dependencies. The results of these parse operations are only used for determining the name of the protobuf which will then be instantiated by MessageFactory, so there isn't even any significant value being extracted from ensuring that the MessageDescriptor is an accurate representation of the version of the proto carried in the buffer. The only information extracted from the parse that is used is the FileDescriptor in order to work out which message is referenced by the message indexes in the buffer.

As I understand it, the schema registry updates the schemaId returned for a given subject whenever the schema changes in any way, so schemaId alone could probably be used as a cache key to look up already parsed message or file descriptors, but schemaId and message indexes bytes could certainly be used as a cache key in the (not actually supported by other confluent tools, I think) use-case where more than one proto message from the same schema file is being written to the same topic via the same subject.

The registry client has a cache for schema contents so there would seem to be no reason not to use a cache for the parsed representation of those schema contents, though it would probably be nice if the two caches could be linked in some way so that cache invalidation would cascade from schema client to protobuf deserializer.

This would dramatically speed up deserialization and cut down on cpu utilization when processing large volumes of messages.

* chore: update repo semaphore config * chore: update repo semaphore config * chore: update repo semaphore config * chore: update repo semaphore config * chore: update repo semaphore config * chore: update repo semaphore config * chore: update repo semaphore config * chore: update repo semaphore config * chore: update repo semaphore config * chore: update repo semaphore config * chore: update repo semaphore config * chore: update repo semaphore config * chore: update repo semaphore config * chore: update repo semaphore config * chore: update repo semaphore config * chore: update repo semaphore config * Cache parsed file descriptors Fixes #969 * Clean up import --------- Co-authored-by: Confluent Jenkins Bot <[email protected]>

rayokota mentioned this issue Jan 18, 2024

Cache parsed file descriptors #1128

Merged

rayokota closed this as completed in #1128 Jan 23, 2024

rayokota added the schema registry label Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

protobuf deserializer (inefficiently) parses schema file contents for every incoming message #969

protobuf deserializer (inefficiently) parses schema file contents for every incoming message #969

ideasculptor commented Mar 29, 2023

protobuf deserializer (inefficiently) parses schema file contents for every incoming message #969

protobuf deserializer (inefficiently) parses schema file contents for every incoming message #969

Comments

ideasculptor commented Mar 29, 2023

Description