Skip to content

Conversation

@Zouxxyy
Copy link
Contributor

@Zouxxyy Zouxxyy commented Jun 5, 2023

Change Logs

Minor modification for #7173

  • boolean supportAvroRead = false;
  • code reformat

Impact

fix HoodieAvroParquetReader

Risk level (write none, low medium or high below)

low

Documentation Update

None

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Jun 5, 2023

In addition, the following problem encountered when using hive2

Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable

Since the correct logical type cannot be obtained from the footer, subsequent reading still fails

@cdmikechen
Copy link
Contributor

@Zouxxyy
We have compatible hive2 and hive3 adaptations for timestamp types in HUDI-5189, so hive fields can be declared as the correct timestamp types.
May I ask if you used the latest master branch to test? This feature has only recently been incorporated, and in the past, it really wasn't possible.

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Jun 6, 2023

@cdmikechen have you try it with hive2 (parquet-hadoop-bundle 1.8.1) ?

here is my test:

with hive2 (parquet-hadoop-bundle 1.8.1)

    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.10.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>2.10.1</version>
        </dependency>
        <!-- hive2 -->
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-hadoop</artifactId>
            <version>1.8.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-avro</artifactId>
            <version>1.10.1</version>
            <exclusions>
                <exclusion>
                    <groupId>org.apache.parquet</groupId>
                    <artifactId>parquet-hadoop</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
    </dependencies>

use fileFooter.getFileMetaData().getSchema() get wrong schema: optional int64 a_timestamp without logical type, Since the correct logical type cannot be obtained from the footer, subsequent reading still fails

    ParquetMetadata fileFooter = ParquetFileReader.readFooter(
        new Configuration(), new Path("file:///Users/zxy/Desktop/mz_parquet_to_hudi/486591f0-3e2e-457a-9f3a-d8305d5850fd-0_0-11-11_20230601185040816.parquet"),
        ParquetMetadataConverter.NO_FILTER);

    MessageType schema = fileFooter.getFileMetaData().getSchema();

result

message hoodie.mz_parquet_to_hudi.mz_parquet_to_hudi_record {
  optional binary _hoodie_commit_time (UTF8);
  optional binary _hoodie_commit_seqno (UTF8);
  optional binary _hoodie_record_key (UTF8);
  optional binary _hoodie_partition_path (UTF8);
  optional binary _hoodie_file_name (UTF8);
  optional int32 a_tinyint;
  optional int32 a_smallint;
  optional int32 a_int;
  optional int64 a_bigint;
  optional float a_float;
  optional double a_double;
  optional fixed_len_byte_array(5) a_decimal (DECIMAL(10,2));
  optional binary a_varchar (UTF8);
  optional binary a_char (UTF8);
  optional binary a_string (UTF8);
  optional int64 a_timestamp;
  optional binary a_binary;
  optional boolean a_boolean;
}

with hive3 (parquet-hadoop-bundle 1.10.1)

    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.10.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>2.10.1</version>
        </dependency>
        <!-- hive3 -->
<!--        <dependency>-->
<!--            <groupId>org.apache.parquet</groupId>-->
<!--            <artifactId>parquet-hadoop</artifactId>-->
<!--            <version>1.8.1</version>-->
<!--        </dependency>-->
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-avro</artifactId>
            <version>1.10.1</version>
<!--            <exclusions>-->
<!--                <exclusion>-->
<!--                    <groupId>org.apache.parquet</groupId>-->
<!--                    <artifactId>parquet-hadoop</artifactId>-->
<!--                </exclusion>-->
<!--            </exclusions>-->
        </dependency>
    </dependencies>

use fileFooter.getFileMetaData().getSchema() get correct schema

message hoodie.mz_parquet_to_hudi.mz_parquet_to_hudi_record {
  optional binary _hoodie_commit_time (UTF8);
  optional binary _hoodie_commit_seqno (UTF8);
  optional binary _hoodie_record_key (UTF8);
  optional binary _hoodie_partition_path (UTF8);
  optional binary _hoodie_file_name (UTF8);
  optional int32 a_tinyint;
  optional int32 a_smallint;
  optional int32 a_int;
  optional int64 a_bigint;
  optional float a_float;
  optional double a_double;
  optional fixed_len_byte_array(5) a_decimal (DECIMAL(10,2));
  optional binary a_varchar (UTF8);
  optional binary a_char (UTF8);
  optional binary a_string (UTF8);
  optional int64 a_timestamp (TIMESTAMP_MICROS);
  optional binary a_binary;
  optional boolean a_boolean;
}

Here is the footer read by parquet-tools:

extra:                  hoodie_min_record_key = a_int:3
extra:                  parquet.avro.schema = {"type":"record","name":"mz_parquet_to_hudi_record","namespace":"hoodie.mz_parquet_to_hudi","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"a_tinyint","type":["null","int"],"default":null},{"name":"a_smallint","type":["null","int"],"default":null},{"name":"a_int","type":["null","int"],"default":null},{"name":"a_bigint","type":["null","long"],"default":null},{"name":"a_float","type":["null","float"],"default":null},{"name":"a_double","type":["null","double"],"default":null},{"name":"a_decimal","type":["null",{"type":"fixed","name":"fixed","namespace":"hoodie.mz_parquet_to_hudi.mz_parquet_to_hudi_record.a_decimal","size":5,"logicalType":"decimal","precision":10,"scale":2}],"default":null},{"name":"a_varchar","type":["null","string"],"default":null},{"name":"a_char","type":["null","string"],"default":null},{"name":"a_string","type":["null","string"],"default":null},{"name":"a_timestamp","type":["null",{"type":"long","logicalType":"timestamp-micros"}],"default":null},{"name":"a_binary","type":["null","bytes"],"default":null},{"name":"a_boolean","type":["null","boolean"],"default":null}]}
extra:                  writer.model.name = avro
extra:                  hoodie_max_record_key = a_int:3

file schema:            hoodie.mz_parquet_to_hudi.mz_parquet_to_hudi_record
--------------------------------------------------------------------------------
_hoodie_commit_time:    OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_commit_seqno:   OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_record_key:     OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_partition_path: OPTIONAL BINARY O:UTF8 R:0 D:1
_hoodie_file_name:      OPTIONAL BINARY O:UTF8 R:0 D:1
a_tinyint:              OPTIONAL INT32 R:0 D:1
a_smallint:             OPTIONAL INT32 R:0 D:1
a_int:                  OPTIONAL INT32 R:0 D:1
a_bigint:               OPTIONAL INT64 R:0 D:1
a_float:                OPTIONAL FLOAT R:0 D:1
a_double:               OPTIONAL DOUBLE R:0 D:1
a_decimal:              OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
a_varchar:              OPTIONAL BINARY O:UTF8 R:0 D:1
a_char:                 OPTIONAL BINARY O:UTF8 R:0 D:1
a_string:               OPTIONAL BINARY O:UTF8 R:0 D:1
a_timestamp:            OPTIONAL INT64 R:0 D:1
a_binary:               OPTIONAL BINARY R:0 D:1
a_boolean:              OPTIONAL BOOLEAN R:0 D:1

row group 1:            RC:1 TS:1528 OFFSET:4
--------------------------------------------------------------------------------
_hoodie_commit_time:     BINARY GZIP DO:0 FPO:4 SZ:142/124/0.87 VC:1 ENC:BIT_PACKED,RLE,PLAIN
_hoodie_commit_seqno:    BINARY GZIP DO:0 FPO:146 SZ:162/144/0.89 VC:1 ENC:BIT_PACKED,RLE,PLAIN
_hoodie_record_key:      BINARY GZIP DO:0 FPO:308 SZ:92/74/0.80 VC:1 ENC:BIT_PACKED,RLE,PLAIN
_hoodie_partition_path:  BINARY GZIP DO:0 FPO:400 SZ:57/39/0.68 VC:1 ENC:BIT_PACKED,RLE,PLAIN
_hoodie_file_name:       BINARY GZIP DO:0 FPO:457 SZ:412/401/0.97 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_tinyint:               INT32 GZIP DO:0 FPO:869 SZ:73/55/0.75 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_smallint:              INT32 GZIP DO:0 FPO:942 SZ:73/55/0.75 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_int:                   INT32 GZIP DO:0 FPO:1015 SZ:73/55/0.75 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_bigint:                INT64 GZIP DO:0 FPO:1088 SZ:90/75/0.83 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_float:                 FLOAT GZIP DO:0 FPO:1178 SZ:73/55/0.75 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_double:                DOUBLE GZIP DO:0 FPO:1251 SZ:91/75/0.82 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_decimal:               FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:1342 SZ:80/60/0.75 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_varchar:               BINARY GZIP DO:0 FPO:1422 SZ:67/49/0.73 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_char:                  BINARY GZIP DO:0 FPO:1489 SZ:67/49/0.73 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_string:                BINARY GZIP DO:0 FPO:1556 SZ:67/49/0.73 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_timestamp:             INT64 GZIP DO:0 FPO:1623 SZ:95/75/0.79 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_binary:                BINARY GZIP DO:0 FPO:1718 SZ:72/54/0.75 VC:1 ENC:BIT_PACKED,RLE,PLAIN
a_boolean:               BOOLEAN GZIP DO:0 FPO:1790 SZ:60/40/0.67 VC:1 ENC:BIT_PACKED,RLE,PLAIN

It seemed hive2 can't read timestamp-micros in parquet.avro.schema

So maybe we should shade parquet-hadoop 0.10.1 in hadoop-mr-bundle, and I find a revert PR about it. #6930

@cdmikechen @xicm @xushiyan @danny0405

@danny0405 danny0405 self-assigned this Jun 6, 2023
@danny0405 danny0405 added engine:hive Hive integration issue:version-compatibility Version compatibility issues labels Jun 6, 2023
@hudi-bot
Copy link
Collaborator

hudi-bot commented Jun 6, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405 danny0405 merged commit 2294c52 into apache:master Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

engine:hive Hive integration issue:version-compatibility Version compatibility issues

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants