[MINOR] Reformat HoodieAvroParquetReader #8888

Zouxxyy · 2023-06-05T16:55:18Z

Change Logs

Minor modification for #7173

boolean supportAvroRead = false;
code reformat

Impact

fix HoodieAvroParquetReader

Risk level (write none, low medium or high below)

low

Documentation Update

None

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

Zouxxyy · 2023-06-05T16:57:44Z

In addition, the following problem encountered when using hive2

Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable

Since the correct logical type cannot be obtained from the footer, subsequent reading still fails

cdmikechen · 2023-06-06T00:19:38Z

@Zouxxyy
We have compatible hive2 and hive3 adaptations for timestamp types in HUDI-5189, so hive fields can be declared as the correct timestamp types.
May I ask if you used the latest master branch to test? This feature has only recently been incorporated, and in the past, it really wasn't possible.

Zouxxyy · 2023-06-06T01:48:29Z

@cdmikechen have you try it with hive2 (parquet-hadoop-bundle 1.8.1) ?

here is my test:

with hive2 (parquet-hadoop-bundle 1.8.1)

    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.10.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>2.10.1</version>
        </dependency>
        <!-- hive2 -->
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-hadoop</artifactId>
            <version>1.8.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-avro</artifactId>
            <version>1.10.1</version>
            <exclusions>
                <exclusion>
                    <groupId>org.apache.parquet</groupId>
                    <artifactId>parquet-hadoop</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
    </dependencies>

use fileFooter.getFileMetaData().getSchema() get wrong schema: optional int64 a_timestamp without logical type, Since the correct logical type cannot be obtained from the footer, subsequent reading still fails

    ParquetMetadata fileFooter = ParquetFileReader.readFooter(
        new Configuration(), new Path("file:///Users/zxy/Desktop/mz_parquet_to_hudi/486591f0-3e2e-457a-9f3a-d8305d5850fd-0_0-11-11_20230601185040816.parquet"),
        ParquetMetadataConverter.NO_FILTER);

    MessageType schema = fileFooter.getFileMetaData().getSchema();

result

message hoodie.mz_parquet_to_hudi.mz_parquet_to_hudi_record {
  optional binary _hoodie_commit_time (UTF8);
  optional binary _hoodie_commit_seqno (UTF8);
  optional binary _hoodie_record_key (UTF8);
  optional binary _hoodie_partition_path (UTF8);
  optional binary _hoodie_file_name (UTF8);
  optional int32 a_tinyint;
  optional int32 a_smallint;
  optional int32 a_int;
  optional int64 a_bigint;
  optional float a_float;
  optional double a_double;
  optional fixed_len_byte_array(5) a_decimal (DECIMAL(10,2));
  optional binary a_varchar (UTF8);
  optional binary a_char (UTF8);
  optional binary a_string (UTF8);
  optional int64 a_timestamp;
  optional binary a_binary;
  optional boolean a_boolean;
}

with hive3 (parquet-hadoop-bundle 1.10.1)

    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.10.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>2.10.1</version>
        </dependency>
        <!-- hive3 -->
<!--        <dependency>-->
<!--            <groupId>org.apache.parquet</groupId>-->
<!--            <artifactId>parquet-hadoop</artifactId>-->
<!--            <version>1.8.1</version>-->
<!--        </dependency>-->
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-avro</artifactId>
            <version>1.10.1</version>
<!--            <exclusions>-->
<!--                <exclusion>-->
<!--                    <groupId>org.apache.parquet</groupId>-->
<!--                    <artifactId>parquet-hadoop</artifactId>-->
<!--                </exclusion>-->
<!--            </exclusions>-->
        </dependency>
    </dependencies>

use fileFooter.getFileMetaData().getSchema() get correct schema

message hoodie.mz_parquet_to_hudi.mz_parquet_to_hudi_record {
  optional binary _hoodie_commit_time (UTF8);
  optional binary _hoodie_commit_seqno (UTF8);
  optional binary _hoodie_record_key (UTF8);
  optional binary _hoodie_partition_path (UTF8);
  optional binary _hoodie_file_name (UTF8);
  optional int32 a_tinyint;
  optional int32 a_smallint;
  optional int32 a_int;
  optional int64 a_bigint;
  optional float a_float;
  optional double a_double;
  optional fixed_len_byte_array(5) a_decimal (DECIMAL(10,2));
  optional binary a_varchar (UTF8);
  optional binary a_char (UTF8);
  optional binary a_string (UTF8);
  optional int64 a_timestamp (TIMESTAMP_MICROS);
  optional binary a_binary;
  optional boolean a_boolean;
}

Here is the footer read by parquet-tools:

extra:                  hoodie_min_record_key = a_int:3
extra:                  parquet.avro.schema = {"type":"record","name":"mz_parquet_to_hudi_record","namespace":"hoodie.mz_parquet_to_hudi","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"a_tinyint","type":["null","int"],"default":null},{"name":"a_smallint","type":["null","int"],"default":null},{"name":"a_int","type":["null","int"],"default":null},{"name":"a_bigint","type":["null","long"],"default":null},{"name":"a_float","type":["null","float"],"default":null},{"name":"a_double","type":["null","double"],"default":null},{"name":"a_decimal","type":["null",{"type":"fixed","name":"fixed","namespace":"hoodie.mz_parquet_to_hudi.mz_parquet_to_hudi_record.a_decimal","size":5,"logicalType":"decimal","precision":10,"scale":2}],"default":null},{"name":"a_varchar","type":["null","string"],"default":null},{"name":"a_char","type":["null","string"],"default":null},{"name":"a_string","type":["null","string"],"default":null},{"name":"a_timestamp","type":["null",{"type":"long","logicalType":"timestamp-micros"}],"default":null},{"name":"a_binary","type":["null","bytes"],"default":null},{"name":"a_boolean","type":["null","boolean"],"default":null}]}
extra:                  writer.model.name = avro
extra:                  hoodie_max_record_key = a_int:3

file schema:            hoodie.mz_parquet_to_hudi.mz_parquet_to_hudi_record
--------------------------------------------------------------------------------
_hoodie_commit_time:    OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_commit_seqno:   OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_record_key:     OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_partition_path: OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_file_name:      OPTIONAL BINARY O:UTF8 R:0 D:1
a_tinyint:              OPTIONAL INT32 R:0 D:1
a_smallint:             OPTIONAL INT32 R:0 D:1
a_int:                  OPTIONAL INT32 R:0 D:1
a_bigint:               OPTIONAL INT64 R:0 D:1
a_float:                OPTIONAL FLOAT R:0 D:1
a_double:               OPTIONAL DOUBLE R:0 D:1
a_decimal:              OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
a_varchar:              OPTIONAL BINARY O:UTF8 R:0 D:1
a_char:                 OPTIONAL BINARY O:UTF8 R:0 D:1
a_string:               OPTIONAL BINARY O:UTF8 R:0 D:1
a_timestamp:            OPTIONAL INT64 R:0 D:1
a_binary:               OPTIONAL BINARY R:0 D:1
a_boolean:              OPTIONAL BOOLEAN R:0 D:1

row group 1:            RC:1 TS:1528 OFFSET:4
--------------------------------------------------------------------------------
_hoodie_commit_time:     BINARY GZIP DO:0 FPO:4 SZ:142/124/0.87 VC:1 ENC:BIT_PACKED,RLE,PLAIN
_hoodie_commit_seqno:    BINARY GZIP DO:0 FPO:146 SZ:162/144/0.89 VC:1 ENC:BIT_PACKED,RLE,PLAIN
_hoodie_record_key:      BINARY GZIP DO:0 FPO:308 SZ:92/74/0.80 VC:1 ENC:BIT_PACKED,RLE,PLAIN
_hoodie_partition_path:  BINARY GZIP DO:0 FPO:400 SZ:57/39/0.68 VC:1 ENC:BIT_PACKED,RLE,PLAIN
_hoodie_file_name:       BINARY GZIP DO:0 FPO:457 SZ:412/401/0.97 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_tinyint:               INT32 GZIP DO:0 FPO:869 SZ:73/55/0.75 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_smallint:              INT32 GZIP DO:0 FPO:942 SZ:73/55/0.75 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_int:                   INT32 GZIP DO:0 FPO:1015 SZ:73/55/0.75 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_bigint:                INT64 GZIP DO:0 FPO:1088 SZ:90/75/0.83 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_float:                 FLOAT GZIP DO:0 FPO:1178 SZ:73/55/0.75 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_double:                DOUBLE GZIP DO:0 FPO:1251 SZ:91/75/0.82 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_decimal:               FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:1342 SZ:80/60/0.75 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_varchar:               BINARY GZIP DO:0 FPO:1422 SZ:67/49/0.73 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_char:                  BINARY GZIP DO:0 FPO:1489 SZ:67/49/0.73 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_string:                BINARY GZIP DO:0 FPO:1556 SZ:67/49/0.73 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_timestamp:             INT64 GZIP DO:0 FPO:1623 SZ:95/75/0.79 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_binary:                BINARY GZIP DO:0 FPO:1718 SZ:72/54/0.75 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_boolean:               BOOLEAN GZIP DO:0 FPO:1790 SZ:60/40/0.67 VC:1 ENC:BIT_PACKED,RLE,PLAIN

It seemed hive2 can't read timestamp-micros in parquet.avro.schema

So maybe we should shade parquet-hadoop 0.10.1 in hadoop-mr-bundle, and I find a revert PR about it. #6930

@cdmikechen @xicm @xushiyan @danny0405

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java

hudi-bot · 2023-06-06T08:14:36Z

CI report:

59fec83 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

[MINOR] Reformat HoodieAvroParquetReader

b4c624a

xicm reviewed Jun 6, 2023

View reviewed changes

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java Outdated Show resolved Hide resolved

danny0405 self-assigned this Jun 6, 2023

danny0405 added engine:hive Hive integration issue:version-compatibility Version compatibility issues labels Jun 6, 2023

update

59fec83

danny0405 approved these changes Jun 6, 2023

View reviewed changes

danny0405 merged commit 2294c52 into apache:master Jun 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MINOR] Reformat HoodieAvroParquetReader #8888

[MINOR] Reformat HoodieAvroParquetReader #8888

Uh oh!

Zouxxyy commented Jun 5, 2023 •

edited

Loading

Uh oh!

Zouxxyy commented Jun 5, 2023 •

edited

Loading

Uh oh!

cdmikechen commented Jun 6, 2023

Uh oh!

Zouxxyy commented Jun 6, 2023 •

edited

Loading

Uh oh!

Uh oh!

hudi-bot commented Jun 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[MINOR] Reformat HoodieAvroParquetReader #8888

[MINOR] Reformat HoodieAvroParquetReader #8888

Uh oh!

Conversation

Zouxxyy commented Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Zouxxyy commented Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cdmikechen commented Jun 6, 2023

Uh oh!

Zouxxyy commented Jun 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

hudi-bot commented Jun 6, 2023

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Zouxxyy commented Jun 5, 2023 •

edited

Loading

Zouxxyy commented Jun 5, 2023 •

edited

Loading

Zouxxyy commented Jun 6, 2023 •

edited

Loading