-
Notifications
You must be signed in to change notification settings - Fork 96
GH-698: Improve and fix Avro read consumers #718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-698: Improve and fix Avro read consumers #718
Conversation
…sed by the consumers
…mentation needs revising)
This comment has been minimized.
This comment has been minimized.
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the interest of trying to keep semver, we should avoid breaking changes if possible. Any thoughts @jbonofre @laurentgo? Or we could just call the next release 19.0.0...
If it helps we could just have a single flag for "old" behavior?
| case List: | ||
| case FixedSizeList: | ||
| return buildArraySchema(builder.array(), field, namespace); | ||
| // Arrow uses "$data$" as the field name for list items, that is not a valid Avro name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The funny thing is, arrow-java shouldn't be doing that, it was just never corrected...
| return buildArraySchema(builder.array(), field, namespace); | ||
| // Arrow uses "$data$" as the field name for list items, that is not a valid Avro name | ||
| Field itemField = field.getChildren().get(0); | ||
| if (ListVector.DATA_VECTOR_NAME.equals(itemField.getName())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we perhaps want to check for invalid names more generally and mangle/normalize them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just normalize all field names to something consistent in Avro?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I think for list / map types using the constant defined names for children makes sense, with "item" instead of "$data$" for list items. More generally, we could normalise illegal chars to "_" to match the Avro name rules. Per my understanding similar rules are already enforced in C++, but are not part of the Arrow spec or Java implementation.
Very happy to put the normalisation in, it's probably a more useful behaviour than throwing an error in the adapter. Would you like me to do it?
| /** | ||
| * Producer wrapper which producers nullable types to an avro encoder. Write the data to the | ||
| * underlying {@link FieldVector}. | ||
| * Producer wrapper which producers nullable types to an avro encoder. Reed data from the underlying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Producer wrapper which producers nullable types to an avro encoder. Reed data from the underlying | |
| * Producer wrapper which produces nullable types to an avro encoder. Read data from the underlying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
;-)
| return skipFieldNames; | ||
| } | ||
|
|
||
| public boolean isHandleNullable() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we perhaps get a more descriptive name for this parameter overall? "handleUnionOfNullAsNullable"? (As enterprise-java-y as that is...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now part of the legacyMode parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an opportunity to structure this as a parameterized test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have factored out the common code as a helper. Don't think it can go all the way to being parameterised because the types need to be set up differently.
| new FieldType(nullable, arrowType, /* dictionary= */ null, getMetaData(schema)); | ||
| vector = createVector(consumerVector, fieldType, name, allocator); | ||
| consumer = new AvroDecimalConsumer.BytesDecimalConsumer((DecimalVector) vector); | ||
| if (decimalType.getPrecision() <= 38) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it's technically possible to have a decimal256 with a smaller precision though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess in that case it would round-trip to the smaller type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For FIXED decimals I am using the fixedSize to choose the decimal type. For BYTES decimals yes they would just come back as the smaller type if the precision fits.
I used FIXED as the default output for decimals in the producers, because it is closer to the Arrow representation, but on reflection Avro as a format is very focused on keeping data compact, maybe BYTES makes more sense. It seems to be the default choice in Avro. Do you think we should go with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think FIXED is okay to allow round-trip by default even if it's not technically as compact.
A major version bump feels a bit extreme for something as small as this! Let me try the single flag approach. There might be complications with the change for zone-aware vs local timestamps, because the types are different, but I think so long as the check happens before the types are decided it should be ok. |
…rt of the core Avro format
|
Ok, here is an update. I have put a flag "legacyMode" in the config object and used that to control all the places where the old logic is impacted. I have allowed decimal 256 to come through in legacy mode and also timestamp-nanos with the old semantics. The local-timestamp-xxx types do not come through in legacy mode, because timestamp-xxx is already treated as local. I have replaced the original code for the logical types test and those are all passing. |
|
I have removed the breaking changes text from the headline text but am not able to remove the label. |
| new FieldType(nullable, arrowType, /* dictionary= */ null, getMetaData(schema)); | ||
| vector = createVector(consumerVector, fieldType, name, allocator); | ||
| consumer = new AvroDecimalConsumer.BytesDecimalConsumer((DecimalVector) vector); | ||
| if (decimalType.getPrecision() <= 38) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think FIXED is okay to allow round-trip by default even if it's not technically as compact.
|
Looks like there are some lint errors to be fixed |
Apologies - I have reapplied spotless, should be ok now! |
Hi @lidavidm - here is part 2 in my Avro series, apologies for the delay, it's the usual work / contention story!
What's Changed
This PR relates to #698 and is the second in a series intended to provide full Avro read / write support in native Java. It adds round-trip tests for both schemas (Arrow schema -> Avro -> Arrow) and data (Arrow VSR -> Avro block -> Arrow VSR). It also adds a number of fixes and improvements to the Avro Consumers so that data arrives back in its original form after a round trip. The main changes are:
Breaking changes have been removed from this PR.
Per discussion below, all breaking changes are now behind a "legacyMode" flag in the AvroToArrowConfig object, which is enabled by default in all the original code paths.
Closes #698 .
This change is meant to allow for round trip of schemas and individual Avro data blocks (one Avro data block -> one VSR). File-level capabilities are not included. I have not included anything to recycle the VSR as part of the read API, this feels like it belongs with the file-level piece. Also I have not done anything specific for enums / dict encoding as of yet.