Skip to content

Conversation

@singhpk234
Copy link
Contributor

@singhpk234 singhpk234 commented Oct 17, 2024

About the change

The UUID type in the parquet writer expects ByteBuffer rather than UUID otherwise writer fails with :

class java.util.UUID cannot be cast to class [B (java.util.UUID and [B are in module java.base of loader 'bootstrap')

The FixedLength would need byteArray rather than ByteBuffer otherwise one get this error

class java.nio.HeapByteBuffer cannot be cast to class [B (java.nio.HeapByteBuffer and [B are in module java.base of loader 'bootstrap')

Testing

Added new tests

cc @bryanck

@singhpk234 singhpk234 marked this pull request as draft October 17, 2024 22:16
@singhpk234 singhpk234 changed the title [KafkaConnect] Fix RecordConverter [KafkaConnect] Fix RecordConverter for UUID and Fixed Types Oct 18, 2024
@singhpk234 singhpk234 marked this pull request as ready for review October 18, 2024 15:38
@github-actions github-actions bot removed the ORC label Oct 18, 2024
@RussellSpitzer RussellSpitzer added this to the Iceberg 1.7.0 milestone Oct 22, 2024

public class RecordConverterTest {
@ExtendWith(ParameterizedTestExtension.class)
public class RecordConverterTest extends BaseWriterTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping we'd keep this test specific to the conversion functions, and keep writer tests separate. Do you have thoughts on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking in lines of conversion functions are no longer format agnostic as we are adding format info into deciding the record conversion, hence though it would be fair to test this E2E here.

please let me know your thoughts considering above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can create a dedicated test class for writer ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we have this parameterized by file type? The tests here make sense to me but I am only looking at this module for the first time for this PR.

Copy link
Member

@RussellSpitzer RussellSpitzer Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah iI see, the format only comes into play for UUID so the other parameterizations are essentially no-ops. Perhaps we just need one specialized test then ' testParquetUUIDSerialization"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added this test, Thanks for suggesting this !
I also do think we need an E2E with write I can take this test as a followup to this pr as it would require refactoring of the writer tests.

@jbonofre
Copy link
Member

This change looks good, I'm just wondering about the test. I would have kept the original test and create a new one dedicated for the writer.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me on the fix side but I agree with the others than we need to adjust the tests to be a bit more specific to this fix I think?

@RussellSpitzer RussellSpitzer merged commit 9ecd97b into apache:main Oct 25, 2024
@RussellSpitzer
Copy link
Member

Thanks @singhpk234 for the PR and @jbonofre, @bryanck and @ajantha-bhat For Review!

@Gezi-lzq
Copy link

Gezi-lzq commented Nov 28, 2024

When writing UUIDs, should we handle the conversion directly within BaseParquetWriter, by modifying BaseParquetWriter#primitive to check if the LogicalTypeAnnotation is a UUID and then use a UUIDWriter for writing it, instead of performing the conversion based on the file type before writing?

@Override
public ParquetValueWriter<?> primitive(PrimitiveType primitive) {
    ....
    switch (primitive.getPrimitiveTypeName()) {
        case FIXED_LEN_BYTE_ARRAY:
            if (LogicalTypeAnnotation.uuidType().equals(primitive.getLogicalTypeAnnotation())) {
                return new UUIDWriter(desc);
            }
            return new FixedWriter(desc);

            ...
    }
}

private static class UUIDWriter extends ParquetValueWriters.PrimitiveWriter<UUID> {
    private UUIDWriter(ColumnDescriptor desc) {
        super(desc);
    }

    @Override
    public void write(int repetitionLevel, UUID value) {
        column.writeBinary(repetitionLevel, Binary.fromReusedByteArray(UUIDUtil.convert(value)));
    }
}

Similar to the approach taken in #7399 in Apache Iceberg.
@singhpk234 @bryanck @openinx @RussellSpitzer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants