Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
1872d41
Provide an API for converting schemas directly, with the same rules u…
martin-traverse Apr 1, 2025
d07b417
Schema round trip for primitive types, nullable and non-nullable
martin-traverse Apr 1, 2025
d2b33ae
Fix field translation for nullable primitive types
martin-traverse Apr 1, 2025
1766e8b
Add schema round-trip tests for logical types
martin-traverse Apr 1, 2025
6d77139
Add schema round-trip tests for complex types
martin-traverse Apr 2, 2025
47572c0
Remove schema round trip test for fixed size list (does not exist in …
martin-traverse Apr 4, 2025
331dabc
Use expected child field name for list round trip
martin-traverse Apr 4, 2025
5361009
Support new logical types for LONG
martin-traverse Apr 4, 2025
c2bd7b3
Fix child field names for ARRAY and MAP types
martin-traverse Apr 4, 2025
3d9faf6
Fix handling of list item field names ($data$ is not legal in Avro)
martin-traverse Apr 4, 2025
8897832
Include child fields when handling nullable types
martin-traverse Apr 7, 2025
eaef6c8
Support decimal 256 when converting field types
martin-traverse Apr 7, 2025
0f94366
Updates tests to expect type support for decimal 256
martin-traverse Apr 7, 2025
2cfb81a
Add first round trip data tests
martin-traverse Apr 8, 2025
f7aba15
Add nullable consumer wrapper
martin-traverse Apr 9, 2025
dde2dca
Data round trip tests for primitive types, nullable and non-nullable
martin-traverse Apr 11, 2025
8525421
Factor out common logic in round trip data test
martin-traverse Apr 11, 2025
1ac4fae
Round trip test cases for logical types
martin-traverse Apr 11, 2025
b9cf11b
Add support for Decimal 256
martin-traverse Apr 11, 2025
c61cc3a
Respect nullability of null vectors
martin-traverse Apr 11, 2025
84ac2c7
Add round trip tests for complex types
martin-traverse Apr 11, 2025
0fd70de
Fix unexpected metadata generated by consumers
martin-traverse Apr 11, 2025
737ca3b
Add (and fix) consumers for all timestamp types (local and zone-aware)
martin-traverse Apr 12, 2025
4e6efdb
Update consumer tests for logical timestamp types
martin-traverse Apr 12, 2025
8350f40
Improve nullable type handling (allow to work with list and map)
martin-traverse Apr 12, 2025
6788d51
Fix one doc comment
martin-traverse Apr 13, 2025
7f7c4e7
Implementation updates for Struct type
martin-traverse Apr 14, 2025
e0550da
Nullable vectors for enums
martin-traverse Apr 14, 2025
0eb69a3
Tidy up union handling (will need revising along with union vector it…
martin-traverse Apr 14, 2025
bccc14a
Remove schema round trip test for unions (the main union vector imple…
martin-traverse Apr 14, 2025
577f8aa
Apply spotless
martin-traverse Apr 14, 2025
1a03b8a
Rename legacy mode flag in AvroToArrowConfig
martin-traverse Apr 18, 2025
141b6fd
Use legacy mode flag to guard handling of empty alias lists in extern…
martin-traverse Apr 18, 2025
3c173ec
Use legacy mode flag to guard handling of metadata fields that are pa…
martin-traverse Apr 18, 2025
f8aa184
Use legacy mode flag to guard handling of namespaces in field names
martin-traverse Apr 18, 2025
bff71c4
Use legacy mode flag to guard handling of zone-aware vs local timestamps
martin-traverse Apr 18, 2025
74717c3
Replace original code for Avro to Arrow logical types test
martin-traverse Apr 18, 2025
6c75d87
Fix typos in comment
martin-traverse Apr 18, 2025
9aaf5cf
Comments on the legacy mode config parameter
martin-traverse Apr 18, 2025
62d08a9
Factor out repeated code in RT schema test
martin-traverse Apr 18, 2025
d6538e6
Apply spotless
martin-traverse Apr 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -332,7 +332,17 @@ private static <T> T buildBaseTypeSchema(

case List:
case FixedSizeList:
return buildArraySchema(builder.array(), field, namespace);
// Arrow uses "$data$" as the field name for list items, that is not a valid Avro name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The funny thing is, arrow-java shouldn't be doing that, it was just never corrected...

Field itemField = field.getChildren().get(0);
if (ListVector.DATA_VECTOR_NAME.equals(itemField.getName())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we perhaps want to check for invalid names more generally and mangle/normalize them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just normalize all field names to something consistent in Avro?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think for list / map types using the constant defined names for children makes sense, with "item" instead of "$data$" for list items. More generally, we could normalise illegal chars to "_" to match the Avro name rules. Per my understanding similar rules are already enforced in C++, but are not part of the Arrow spec or Java implementation.

Very happy to put the normalisation in, it's probably a more useful behaviour than throwing an error in the adapter. Would you like me to do it?

Field safeItemField =
new Field("item", itemField.getFieldType(), itemField.getChildren());
Field safeListField =
new Field(field.getName(), field.getFieldType(), List.of(safeItemField));
return buildArraySchema(builder.array(), safeListField, namespace);
} else {
return buildArraySchema(builder.array(), field, namespace);
}

case Map:
return buildMapSchema(builder.map(), field, namespace);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,4 +59,23 @@ public static AvroToArrowVectorIterator avroToArrowIterator(

return AvroToArrowVectorIterator.create(decoder, schema, config);
}

/**
* Convert an Avro schema to its Arrow equivalent.
*
* <p>The resulting set of Arrow fields matches what would be set in the VSR after calling
* avroToArrow() or avroToArrowIterator(), respecting the configuration in the config parameter.
*
* @param schema The Avro schema to convert
* @param config Configuration options for conversion
* @return The equivalent Arrow schema
*/
public static org.apache.arrow.vector.types.pojo.Schema avroToAvroSchema(
Schema schema, AvroToArrowConfig config) {

Preconditions.checkNotNull(schema, "Avro schema object cannot be null");
Preconditions.checkNotNull(config, "config cannot be null");

return AvroToArrowUtils.createArrowSchema(schema, config);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,12 @@ public class AvroToArrowConfig {
/** The field names which to skip when reading decoder values. */
private final Set<String> skipFieldNames;

/**
* Use legacy-mode to keep compatibility with old behavior (pre-2025), enabled by default. This
* affects how the AvroToArrow code interprets the Avro schema.
*/
private final boolean legacyMode;

/**
* Instantiate an instance.
*
Expand All @@ -64,6 +70,37 @@ public class AvroToArrowConfig {
this.targetBatchSize = targetBatchSize;
this.provider = provider;
this.skipFieldNames = skipFieldNames;

// Default values for optional parameters
legacyMode = true; // Keep compatibility with old behavior by default
}

/**
* Instantiate an instance.
*
* @param allocator The memory allocator to construct the Arrow vectors with.
* @param targetBatchSize The maximum rowCount to read each time when partially convert data.
* @param provider The dictionary provider used for enum type, adapter will update this provider.
* @param skipFieldNames Field names which to skip.
* @param legacyMode Keep compatibility with old behavior (pre-2025)
*/
AvroToArrowConfig(
BufferAllocator allocator,
int targetBatchSize,
DictionaryProvider.MapDictionaryProvider provider,
Set<String> skipFieldNames,
boolean legacyMode) {

Preconditions.checkArgument(
targetBatchSize == AvroToArrowVectorIterator.NO_LIMIT_BATCH_SIZE || targetBatchSize > 0,
"invalid targetBatchSize: %s",
targetBatchSize);

this.allocator = allocator;
this.targetBatchSize = targetBatchSize;
this.provider = provider;
this.skipFieldNames = skipFieldNames;
this.legacyMode = legacyMode;
}

public BufferAllocator getAllocator() {
Expand All @@ -81,4 +118,8 @@ public DictionaryProvider.MapDictionaryProvider getProvider() {
public Set<String> getSkipFieldNames() {
return skipFieldNames;
}

public boolean isLegacyMode() {
return legacyMode;
}
}

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.arrow.adapter.avro.consumers;

import java.io.IOException;
import org.apache.arrow.vector.FieldVector;
import org.apache.avro.io.Decoder;

/**
* Consumer wrapper which consumes nullable type values from avro decoder. Write the data to the
* underlying {@link FieldVector}.
*
* @param <T> The vector within consumer or its delegate.
*/
public class AvroNullableConsumer<T extends FieldVector> extends BaseAvroConsumer<T> {

private final Consumer<T> delegate;
private final int nullIndex;

/** Instantiate a AvroNullableConsumer. */
@SuppressWarnings("unchecked")
public AvroNullableConsumer(Consumer<T> delegate, int nullIndex) {
super((T) delegate.getVector());
this.delegate = delegate;
this.nullIndex = nullIndex;
}

@Override
public void consume(Decoder decoder) throws IOException {
int typeIndex = decoder.readInt();
if (typeIndex == nullIndex) {
decoder.readNull();
delegate.addNull();
} else {
delegate.consume(decoder);
}
currentIndex++;
}

@Override
public void addNull() {
// Can be called by containers of nullable types
delegate.addNull();
currentIndex++;
}

@Override
public void setPosition(int index) {
if (index < 0 || index > vector.getValueCount()) {
throw new IllegalArgumentException("Index out of bounds");
}
delegate.setPosition(index);
super.setPosition(index);
}

@Override
public boolean resetValueVector(T vector) {
boolean delegateOk = delegate.resetValueVector(vector);
boolean thisOk = super.resetValueVector(vector);
return thisOk && delegateOk;
}

@Override
public void close() throws Exception {
super.close();
delegate.close();
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.arrow.adapter.avro.consumers.logical;

import java.io.IOException;
import java.nio.ByteBuffer;
import org.apache.arrow.adapter.avro.consumers.BaseAvroConsumer;
import org.apache.arrow.util.Preconditions;
import org.apache.arrow.vector.Decimal256Vector;
import org.apache.avro.io.Decoder;

/**
* Consumer which consume 256-bit decimal type values from avro decoder. Write the data to {@link
* Decimal256Vector}.
*/
public abstract class AvroDecimal256Consumer extends BaseAvroConsumer<Decimal256Vector> {

protected AvroDecimal256Consumer(Decimal256Vector vector) {
super(vector);
}

/** Consumer for decimal logical type with 256 bit width and original bytes type. */
public static class BytesDecimal256Consumer extends AvroDecimal256Consumer {

private ByteBuffer cacheBuffer;

/** Instantiate a BytesDecimal256Consumer. */
public BytesDecimal256Consumer(Decimal256Vector vector) {
super(vector);
}

@Override
public void consume(Decoder decoder) throws IOException {
cacheBuffer = decoder.readBytes(cacheBuffer);
byte[] bytes = new byte[cacheBuffer.limit()];
Preconditions.checkArgument(bytes.length <= 32, "Decimal bytes length should <= 32.");
cacheBuffer.get(bytes);
vector.setBigEndian(currentIndex++, bytes);
}
}

/** Consumer for decimal logical type with 256 bit width and original fixed type. */
public static class FixedDecimal256Consumer extends AvroDecimal256Consumer {

private final byte[] reuseBytes;

/** Instantiate a FixedDecimal256Consumer. */
public FixedDecimal256Consumer(Decimal256Vector vector, int size) {
super(vector);
Preconditions.checkArgument(size <= 32, "Decimal bytes length should <= 32.");
reuseBytes = new byte[size];
}

@Override
public void consume(Decoder decoder) throws IOException {
decoder.readFixed(reuseBytes);
vector.setBigEndian(currentIndex++, reuseBytes);
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
import org.apache.avro.io.Decoder;

/**
* Consumer which consume date timestamp-micro values from avro decoder. Write the data to {@link
* Consumer which consumes local-timestamp-micros values from avro decoder. Write the data to {@link
* TimeStampMicroVector}.
*/
public class AvroTimestampMicrosConsumer extends BaseAvroConsumer<TimeStampMicroVector> {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.arrow.adapter.avro.consumers.logical;

import java.io.IOException;
import org.apache.arrow.adapter.avro.consumers.BaseAvroConsumer;
import org.apache.arrow.vector.TimeStampMicroTZVector;
import org.apache.avro.io.Decoder;

/**
* Consumer which consumes timestamp-micros values from avro decoder. Write the data to {@link
* TimeStampMicroTZVector}.
*/
public class AvroTimestampMicrosTzConsumer extends BaseAvroConsumer<TimeStampMicroTZVector> {

/** Instantiate a AvroTimestampMicrosTzConsumer. */
public AvroTimestampMicrosTzConsumer(TimeStampMicroTZVector vector) {
super(vector);
}

@Override
public void consume(Decoder decoder) throws IOException {
vector.set(currentIndex++, decoder.readLong());
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
import org.apache.avro.io.Decoder;

/**
* Consumer which consume date timestamp-millis values from avro decoder. Write the data to {@link
* Consumer which consume local-timestamp-millis values from avro decoder. Write the data to {@link
* TimeStampMilliVector}.
*/
public class AvroTimestampMillisConsumer extends BaseAvroConsumer<TimeStampMilliVector> {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.arrow.adapter.avro.consumers.logical;

import java.io.IOException;
import org.apache.arrow.adapter.avro.consumers.BaseAvroConsumer;
import org.apache.arrow.vector.TimeStampMilliTZVector;
import org.apache.avro.io.Decoder;

/**
* Consumer which consume timestamp-millis values from avro decoder. Write the data to {@link
* TimeStampMilliTZVector}.
*/
public class AvroTimestampMillisTzConsumer extends BaseAvroConsumer<TimeStampMilliTZVector> {

/** Instantiate a AvroTimestampMillisTzConsumer. */
public AvroTimestampMillisTzConsumer(TimeStampMilliTZVector vector) {
super(vector);
}

@Override
public void consume(Decoder decoder) throws IOException {
vector.set(currentIndex++, decoder.readLong());
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.arrow.adapter.avro.consumers.logical;

import java.io.IOException;
import org.apache.arrow.adapter.avro.consumers.BaseAvroConsumer;
import org.apache.arrow.vector.TimeStampNanoVector;
import org.apache.avro.io.Decoder;

/**
* Consumer which consume local-timestamp-nanos values from avro decoder. Write the data to {@link
* TimeStampNanoVector}.
*/
public class AvroTimestampNanosConsumer extends BaseAvroConsumer<TimeStampNanoVector> {

/** Instantiate a AvroTimestampNanosConsumer. */
public AvroTimestampNanosConsumer(TimeStampNanoVector vector) {
super(vector);
}

@Override
public void consume(Decoder decoder) throws IOException {
vector.set(currentIndex++, decoder.readLong());
}
}
Loading
Loading