You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When we built the avro support into Arroyo, the arrow-rs avro implementation was not complete enough to use so we took a bit of a shortcut with the avro-to-json approach
It's not straightforward to support all avro features as SQL data types (for example, arbitrary unions), so today for any fields that have an unsupported data type, we use "raw json" encoding, where we re-encode those columns as JSON and make them available for querying with json functions. This allows us to support any avro schema.
Arroyo is a very good library, and we ran into some performance issues when using it, and we found that there were large-scale decoding operations, as shown below.
I analyzed the code
https://github.com/zhuliquan/arroyo/blob/776965ae9d6ee818595197288d5cca379c564368/crates/arroyo-formats/src/de.rs#L338-L355
We found The consumed Kafka data of AVRO is first converted to Avro
Value
, then to JsonValue
, then serialized to bytes, and finally to RecordBatch. I actually have a question here, why not just convert from avro to RecordBatch? The arrow-rs also support AVRO format (https://github.com/apache/arrow-rs/tree/master/arrow-avro).The text was updated successfully, but these errors were encountered: