Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support avro to record batch directly #768

Open
zhuliquan opened this issue Oct 28, 2024 · 1 comment
Open

support avro to record batch directly #768

zhuliquan opened this issue Oct 28, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@zhuliquan
Copy link
Contributor

Arroyo is a very good library, and we ran into some performance issues when using it, and we found that there were large-scale decoding operations, as shown below.
image
I analyzed the code
https://github.com/zhuliquan/arroyo/blob/776965ae9d6ee818595197288d5cca379c564368/crates/arroyo-formats/src/de.rs#L338-L355
We found The consumed Kafka data of AVRO is first converted to Avro Value, then to Json Value, then serialized to bytes, and finally to RecordBatch. I actually have a question here, why not just convert from avro to RecordBatch? The arrow-rs also support AVRO format (https://github.com/apache/arrow-rs/tree/master/arrow-avro).

@mwylde
Copy link
Member

mwylde commented Oct 28, 2024

The answer is two parts:

  1. When we built the avro support into Arroyo, the arrow-rs avro implementation was not complete enough to use so we took a bit of a shortcut with the avro-to-json approach
  2. It's not straightforward to support all avro features as SQL data types (for example, arbitrary unions), so today for any fields that have an unsupported data type, we use "raw json" encoding, where we re-encode those columns as JSON and make them available for querying with json functions. This allows us to support any avro schema.

Assuming we can find a pathway to support (2) with the arrow-rs implementation (and it's reasonably complete/fast) we can move to that. The approach might look like what we already do for JSON in our arrow-rs fork: https://github.com/ArroyoSystems/arrow-rs/blob/52.1.0/json/arrow-json/src/reader/json_array.rs

@mwylde mwylde added the enhancement New feature or request label Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants