GH-698: Improve and fix Avro read consumers #718

martin-traverse · 2025-04-14T17:56:24Z

Hi @lidavidm - here is part 2 in my Avro series, apologies for the delay, it's the usual work / contention story!

What's Changed

This PR relates to #698 and is the second in a series intended to provide full Avro read / write support in native Java. It adds round-trip tests for both schemas (Arrow schema -> Avro -> Arrow) and data (Arrow VSR -> Avro block -> Arrow VSR). It also adds a number of fixes and improvements to the Avro Consumers so that data arrives back in its original form after a round trip. The main changes are:

Added a top level method in AvroToArrow to convert Avro schema directly to Arrow schema (this may exist elsewhere, but is needed to provide an API that matches the logic of this implementation)
Avro unions of [ type, null ] or [ null, type ] now have special handling, these are interpreted as a single nullable type rather than a union. Setting legacyMode = false in the AvroToArrowConfig object is required to enable this behaviour, otherwise unions are interpreted literally. Unions with more than 2 elements are always interpreted literally (but, per [Java] Type-ids in UnionVector are erroneously coupled to the Arrow types of the underlying vectors #108, in practice Java's current Union implementation is probably not usable with Avro atm).
Added support for new logical types (decimal 256, timestamp nano and 3 local timestamp types)
Existing timestamp-mills and timestamp-micros times now interpreted as zone-aware (previously they were interpreted as local, but now the local timestamp types are interpreted as local - I think this is correct per the Avro spec). Requires setting legacyMode = false.
Removed namespaces from generated Arrow field names in complex types. E.g. the Avro field myNamepsace.outerRecord.structField.intField should be called just "intField" inside the Arrow struct. This doesn't affect the skip field logic, which still works using the qualified names. This requires setting legacyMode = false.
Remove unexpected metadata in generated Arrow fields (empty alias lists and attributes interpreted as part of the field schema). This requires setting legacyMode = false.
Use the expected child vector names for Arrow LIST and MAP types when reading. For LIST, the default child vector is called "$data$" which is illegal in Avro, so the child field name is also changed to "item" in the producers. This requires setting legacyMode = false.

Breaking changes have been removed from this PR.

Per discussion below, all breaking changes are now behind a "legacyMode" flag in the AvroToArrowConfig object, which is enabled by default in all the original code paths.

Closes #698 .

This change is meant to allow for round trip of schemas and individual Avro data blocks (one Avro data block -> one VSR). File-level capabilities are not included. I have not included anything to recycle the VSR as part of the read API, this feels like it belongs with the file-level piece. Also I have not done anything specific for enums / dict encoding as of yet.

…sed by the consumers

…Avro)

…self)

…mentation needs revising)

lidavidm

I think in the interest of trying to keep semver, we should avoid breaking changes if possible. Any thoughts @jbonofre @laurentgo? Or we could just call the next release 19.0.0...

If it helps we could just have a single flag for "old" behavior?

lidavidm · 2025-04-15T01:26:13Z

adapter/avro/src/main/java/org/apache/arrow/adapter/avro/ArrowToAvroUtils.java

      case List:
      case FixedSizeList:
-        return buildArraySchema(builder.array(), field, namespace);
+        // Arrow uses "$data$" as the field name for list items, that is not a valid Avro name


The funny thing is, arrow-java shouldn't be doing that, it was just never corrected...

lidavidm · 2025-04-15T01:26:40Z

adapter/avro/src/main/java/org/apache/arrow/adapter/avro/ArrowToAvroUtils.java

-        return buildArraySchema(builder.array(), field, namespace);
+        // Arrow uses "$data$" as the field name for list items, that is not a valid Avro name
+        Field itemField = field.getChildren().get(0);
+        if (ListVector.DATA_VECTOR_NAME.equals(itemField.getName())) {


Do we perhaps want to check for invalid names more generally and mangle/normalize them?

Or just normalize all field names to something consistent in Avro?

Hm, I think for list / map types using the constant defined names for children makes sense, with "item" instead of "$data$" for list items. More generally, we could normalise illegal chars to "_" to match the Avro name rules. Per my understanding similar rules are already enforced in C++, but are not part of the Arrow spec or Java implementation.

Very happy to put the normalisation in, it's probably a more useful behaviour than throwing an error in the adapter. Would you like me to do it?

lidavidm · 2025-04-15T01:51:06Z

adapter/avro/src/main/java/org/apache/arrow/adapter/avro/producers/AvroNullableProducer.java

 /**
- * Producer wrapper which producers nullable types to an avro encoder. Write the data to the
- * underlying {@link FieldVector}.
+ * Producer wrapper which producers nullable types to an avro encoder. Reed data from the underlying


Suggested change

* Producer wrapper which producers nullable types to an avro encoder. Reed data from the underlying

* Producer wrapper which produces nullable types to an avro encoder. Read data from the underlying

lidavidm · 2025-04-15T01:56:23Z

adapter/avro/src/main/java/org/apache/arrow/adapter/avro/AvroToArrowConfig.java

    return skipFieldNames;
  }
+
+  public boolean isHandleNullable() {


nit: could we perhaps get a more descriptive name for this parameter overall? "handleUnionOfNullAsNullable"? (As enterprise-java-y as that is...)

This is now part of the legacyMode parameter

lidavidm · 2025-04-15T02:08:37Z

adapter/avro/src/test/java/org/apache/arrow/adapter/avro/RoundTripSchemaTest.java

Is there an opportunity to structure this as a parameterized test?

I have factored out the common code as a helper. Don't think it can go all the way to being parameterised because the types need to be set up differently.

lidavidm · 2025-04-15T02:42:50Z

adapter/avro/src/main/java/org/apache/arrow/adapter/avro/AvroToArrowUtils.java

              new FieldType(nullable, arrowType, /* dictionary= */ null, getMetaData(schema));
          vector = createVector(consumerVector, fieldType, name, allocator);
-          consumer = new AvroDecimalConsumer.BytesDecimalConsumer((DecimalVector) vector);
+          if (decimalType.getPrecision() <= 38) {


Hmm, it's technically possible to have a decimal256 with a smaller precision though

I guess in that case it would round-trip to the smaller type?

For FIXED decimals I am using the fixedSize to choose the decimal type. For BYTES decimals yes they would just come back as the smaller type if the precision fits.

I used FIXED as the default output for decimals in the producers, because it is closer to the Arrow representation, but on reflection Avro as a format is very focused on keeping data compact, maybe BYTES makes more sense. It seems to be the default choice in Avro. Do you think we should go with that?

I think FIXED is okay to allow round-trip by default even if it's not technically as compact.

martin-traverse · 2025-04-15T06:24:56Z

I think in the interest of trying to keep semver, we should avoid breaking changes if possible. Any thoughts @jbonofre @laurentgo? Or we could just call the next release 19.0.0...

If it helps we could just have a single flag for "old" behavior?

A major version bump feels a bit extreme for something as small as this! Let me try the single flag approach. There might be complications with the change for zone-aware vs local timestamps, because the types are different, but I think so long as the check happens before the types are decided it should be ok.

…al props

…rt of the core Avro format

martin-traverse · 2025-04-18T09:09:21Z

Ok, here is an update. I have put a flag "legacyMode" in the config object and used that to control all the places where the old logic is impacted. I have allowed decimal 256 to come through in legacy mode and also timestamp-nanos with the old semantics. The local-timestamp-xxx types do not come through in legacy mode, because timestamp-xxx is already treated as local. I have replaced the original code for the logical types test and those are all passing.

martin-traverse · 2025-04-18T09:39:58Z

I have removed the breaking changes text from the headline text but am not able to remove the label.

lidavidm · 2025-04-21T01:23:56Z

adapter/avro/src/main/java/org/apache/arrow/adapter/avro/AvroToArrowUtils.java

              new FieldType(nullable, arrowType, /* dictionary= */ null, getMetaData(schema));
          vector = createVector(consumerVector, fieldType, name, allocator);
-          consumer = new AvroDecimalConsumer.BytesDecimalConsumer((DecimalVector) vector);
+          if (decimalType.getPrecision() <= 38) {


I think FIXED is okay to allow round-trip by default even if it's not technically as compact.

lidavidm · 2025-04-21T04:13:14Z

Looks like there are some lint errors to be fixed

martin-traverse · 2025-04-22T23:49:00Z

Looks like there are some lint errors to be fixed

Apologies - I have reapplied spotless, should be ok now!

martin-traverse added 30 commits April 14, 2025 18:04

Provide an API for converting schemas directly, with the same rules u…

1872d41

…sed by the consumers

Schema round trip for primitive types, nullable and non-nullable

d07b417

Fix field translation for nullable primitive types

d2b33ae

Add schema round-trip tests for logical types

1766e8b

Add schema round-trip tests for complex types

6d77139

Remove schema round trip test for fixed size list (does not exist in …

47572c0

…Avro)

Use expected child field name for list round trip

331dabc

Support new logical types for LONG

5361009

Fix child field names for ARRAY and MAP types

c2bd7b3

Fix handling of list item field names ($data$ is not legal in Avro)

3d9faf6

Include child fields when handling nullable types

8897832

Support decimal 256 when converting field types

eaef6c8

Updates tests to expect type support for decimal 256

0f94366

Add first round trip data tests

2cfb81a

Add nullable consumer wrapper

f7aba15

Data round trip tests for primitive types, nullable and non-nullable

dde2dca

Factor out common logic in round trip data test

8525421

Round trip test cases for logical types

1ac4fae

Add support for Decimal 256

b9cf11b

Respect nullability of null vectors

c61cc3a

Add round trip tests for complex types

84ac2c7

Fix unexpected metadata generated by consumers

0fd70de

Add (and fix) consumers for all timestamp types (local and zone-aware)

737ca3b

Update consumer tests for logical timestamp types

4e6efdb

Improve nullable type handling (allow to work with list and map)

8350f40

Fix one doc comment

6788d51

Implementation updates for Struct type

7f7c4e7

Nullable vectors for enums

e0550da

Tidy up union handling (will need revising along with union vector it…

0eb69a3

…self)

Remove schema round trip test for unions (the main union vector imple…

bccc14a

…mentation needs revising)

Apply spotless

577f8aa

martin-traverse requested review from jbonofre, laurentgo, lidavidm and wgtmac as code owners April 14, 2025 17:56

github-actions bot added the breaking-change label Apr 14, 2025

This comment has been minimized.

Sign in to view

martin-traverse mentioned this pull request Apr 14, 2025

Avro support - Improve existing read capabilities #698

Closed

lidavidm added the enhancement PRs that add or improve features. label Apr 15, 2025

github-actions bot added this to the 18.3.0 milestone Apr 15, 2025

lidavidm reviewed Apr 15, 2025

View reviewed changes

martin-traverse added 9 commits April 18, 2025 09:17

Rename legacy mode flag in AvroToArrowConfig

1a03b8a

Use legacy mode flag to guard handling of empty alias lists in extern…

141b6fd

…al props

Use legacy mode flag to guard handling of metadata fields that are pa…

3c173ec

…rt of the core Avro format

Use legacy mode flag to guard handling of namespaces in field names

f8aa184

Use legacy mode flag to guard handling of zone-aware vs local timestamps

bff71c4

Replace original code for Avro to Arrow logical types test

74717c3

Fix typos in comment

6c75d87

Comments on the legacy mode config parameter

9aaf5cf

Factor out repeated code in RT schema test

62d08a9

lidavidm removed the breaking-change label Apr 21, 2025

lidavidm approved these changes Apr 21, 2025

View reviewed changes

Apply spotless

d6538e6

lidavidm mentioned this pull request Apr 22, 2025

[Release] 18.3.0 #723

Closed

5 tasks

lidavidm merged commit d2465c3 into apache:main Apr 23, 2025
25 of 26 checks passed

	* Producer wrapper which producers nullable types to an avro encoder. Reed data from the underlying
	* Producer wrapper which produces nullable types to an avro encoder. Read data from the underlying

GH-698: Improve and fix Avro read consumers #718

GH-698: Improve and fix Avro read consumers #718

Uh oh!

Conversation

martin-traverse commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's Changed

Uh oh!

This comment has been minimized.

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-traverse commented Apr 15, 2025

Uh oh!

martin-traverse commented Apr 18, 2025

Uh oh!

martin-traverse commented Apr 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Apr 21, 2025

Uh oh!

martin-traverse commented Apr 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

martin-traverse commented Apr 14, 2025 •

edited

Loading