[HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp #8955

Zouxxyy · 2023-06-13T10:27:02Z

Change Logs

Fix the following two scenarios when use hive to query tables containing timestamp fields

No query field, like count(*)

-- spark
create table test_ts_tbl(
  id int, 
  ts1 timestamp)
using hudi
tblproperties(
  type='mor', 
  primaryKey='id'
);

INSERT INTO test_ts_tbl
SELECT 1,
cast ('2021-12-25 12:01:01' as timestamp);

-- hive
select count(*) from test_ts_tbl; // error

In this scenario, HoodieColumnProjectionUtils.getReadColumnNames(conf) will be empty, baseSchema will not be initialized, and NPE will be throwed when calling:

public ArrayWritable getCurrentValue() throws IOException, InterruptedException {
   GenericRecord record = parquetRecordReader.getCurrentValue();
   return (ArrayWritable) HoodieRealtimeRecordReaderUtils.avroToArrayWritable(record, baseSchema, true);
 }

Complex fields containing timestamp are currently not supported

-- spark
create table test_ts_tbl(
  id int, 
  ts1 array<timestamp>, 
  ts2 map<string, timestamp>, 
  ts3 struct<province:timestamp, city:string>)
using hudi
tblproperties(
  type='mor', 
  primaryKey='id'
);

INSERT INTO test_ts_tbl
SELECT 1,
array(cast ('2021-12-25 12:01:01' as timestamp)),
map('key', cast ('2021-12-25 12:01:01' as timestamp)),
struct(cast ('2021-12-25 12:01:01' as timestamp), 'test');

-- hive
select * from test_ts_tbl; // error

Impact

Fix the above two problems

Risk level (write none, low medium or high below)

low

Documentation Update

None

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

danny0405 · 2023-06-13T10:30:32Z

@xicm Hi, can you help with the review ?

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java

Zouxxyy · 2023-06-13T11:30:17Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java

  public static boolean supportTimestamp(Configuration conf) {
    List<String> readCols = Arrays.asList(getReadColumnNames(conf));
    if (readCols.isEmpty()) {
-      return getIOColumnTypes(conf).contains("timestamp");


@xicm Here I think it should return false directly, what do you think.

Agree with you, @cdmikechen do you have any other concern?

When the readCols can be empty ?

When the readCols can be empty ?

As far as I know, such as count(*) which doesn't need to read any cols

In such case, the timestamp can be read correctly anyway?

In such case, the timestamp can be read correctly anyway?

yes

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java

cdmikechen · 2023-06-14T13:02:17Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java

 public class HoodieAvroParquetReader extends RecordReader<Void, ArrayWritable> {

  private final ParquetRecordReader<GenericData.Record> parquetRecordReader;
  private Schema baseSchema;


In my origin PR HUDI-83 I didn't declare the baseSchema variable and didn't modify the getCurrentValue method.
In fact I would like to know if there is any problem or no NPE if we don't declare the baseSchema?

In my origin PR HUDI-83 I didn't declare the baseSchema variable and didn't modify the getCurrentValue method. In fact I would like to know if there is any problem or no NPE if we don't declare the baseSchema?

I have tested that baseSchema need to be used in getCurrentValue, otherwise, the result field will be null, like this #7173 (comment)

@Zouxxyy
I'm having some confusion, I remember doing some situation testing against Hive when I first made the changes (about 1 year ago), including count(*) or specified fields.
I don't know if some subsequent new FEATURE or PR has affected this, I think I'll do another test later this week. Although we have added a separate class to handle timestamp types, my original intention was to use Hive or Hadoop origin method as much as possible for other fields, otherwise it would be costly for us to maintain subsequently.

@cdmikechen Have you ever tested select id, ts1 from test_ts_1? will return null if don't use baseSchema
Below is my full test, feel free to try

-- spark-sql create table test_ts_1( id int, ts1 timestamp) using hudi tblproperties( type='mor', primaryKey='id' ); INSERT INTO test_ts_1 SELECT 1, cast ('2021-12-25 12:01:01' as timestamp); create table test_ts_2( id int, ts1 array<timestamp>, ts2 map<string, timestamp>, ts3 struct<province:timestamp, city:string>) using hudi tblproperties( type='mor', primaryKey='id' ); INSERT INTO test_ts_2 SELECT 1, array(cast ('2021-12-25 12:01:01' as timestamp)), map('key', cast ('2021-12-25 12:01:01' as timestamp)), struct(cast ('2021-12-25 12:01:01' as timestamp), 'test'); -- hive select * from test_ts_1; select id from test_ts_1; select ts1 from test_ts_1; select id, ts1 from test_ts_1; select count(*) from test_ts_1; select * from test_ts_2; select id from test_ts_2; select ts1 from test_ts_2; select id, ts1 from test_ts_2; select count(*) from test_ts_2;

CC @danny0405 @xicm

Zouxxyy · 2023-06-14T16:41:20Z

@hudi-bot run azure

danny0405 · 2023-06-15T02:42:23Z

@hudi-bot run azure

hudi-bot · 2023-06-15T05:41:47Z

CI report:

6c16144 UNKNOWN
9bad575 Azure: FAILURE Azure: FAILURE Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405

I'm okay with this change, @Zouxxyy can you fire a following up fix when your test has encounter problems.

CTTY · 2023-07-06T22:03:16Z

I'm seeing this failure when running unit test, this test seems to be added by this PR. Error message:

Error:  Failures: 
Error:    TestHoodieAvroUtils.testGenerateProjectionSchema:453 expected: <Field fake_field not found in log schema. Query cannot proceed! Derived Schema Fields: [non_pii_col, _hoodie_commit_time, _row_key, _hoodie_partition_path, _hoodie_record_key, pii_col, _hoodie_commit_seqno, _hoodie_file_name, timestamp]> but was: <Field fake_field not found in log schema. Query cannot proceed! Derived Schema Fields: [_hoodie_commit_time, non_pii_col, _hoodie_partition_path, _row_key, _hoodie_record_key, pii_col, _hoodie_commit_seqno, _hoodie_file_name, timestamp]>

Looks like only the order of column is wrong, but could you help me understand if this is a valid failure or we should fix the test?

Zouxxyy · 2023-07-07T01:55:54Z

Looks like only the order of column is wrong, but could you help me understand if this is a valid failure or we should fix the test?

Are you testing java17? https://github.com/apache/hudi/pull/9136/files#top
It seems that the order of items in set in java17 has changed, we can change the test case like this, if we need to support java17

from

    assertEquals("Field fake_field not found in log schema. Query cannot proceed! Derived Schema Fields: "
            + "[non_pii_col, _hoodie_commit_time, _row_key, _hoodie_partition_path, _hoodie_record_key, pii_col,"
            + " _hoodie_commit_seqno, _hoodie_file_name, timestamp]",
        assertThrows(HoodieException.class, () ->
            HoodieAvroUtils.generateProjectionSchema(originalSchema, Arrays.asList("_row_key", "timestamp", "fake_field"))).getMessage());

to

     assertTrue(assertThrows(HoodieException. class, () ->
         HoodieAvroUtils.generateProjectionSchema(originalSchema, Arrays.asList("_row_key", "timestamp", "fake_field")))
         .getMessage().contains("Field fake_field not found in log schema. Query cannot proceed!"));

Zouxxyy added 2 commits June 13, 2023 18:26

[HUDI-6367] Fix NPE in HoodieAvroParquetReader

a8c5325

update

fc4b139

danny0405 self-assigned this Jun 13, 2023

danny0405 added engine:hive Hive integration issue:stability labels Jun 13, 2023

danny0405 reviewed Jun 13, 2023

View reviewed changes

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java Show resolved Hide resolved

Zouxxyy commented Jun 13, 2023

View reviewed changes

danny0405 added the area:schema Schema evolution and data types label Jun 14, 2023

Zouxxyy added 5 commits June 14, 2023 11:46

update

de5ed6c

for checkstyle

6c16144

for checkstyle

02ddd51

add typeContainsTimestamp

ccd6825

fix checkstyle

f1437ab

danny0405 reviewed Jun 14, 2023

View reviewed changes

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java Show resolved Hide resolved

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java Outdated Show resolved Hide resolved

cdmikechen reviewed Jun 14, 2023

View reviewed changes

for comment

9bad575

Zouxxyy changed the title ~~[HUDI-6367] Fix NPE in HoodieAvroParquetReader~~ [HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp Jun 14, 2023

danny0405 approved these changes Jun 15, 2023

View reviewed changes

danny0405 merged commit 8bbda17 into apache:master Jun 15, 2023

CTTY mentioned this pull request Jul 18, 2023

[HUDI-6509] Add GitHub CI for Java 17 #9136

Merged

4 tasks

[HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp #8955

[HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp #8955

Uh oh!

Conversation

Zouxxyy commented Jun 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

danny0405 commented Jun 13, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Zouxxyy Jun 13, 2023

Choose a reason for hiding this comment

Uh oh!

xicm Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

danny0405 Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

Zouxxyy Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

danny0405 Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

Zouxxyy Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cdmikechen Jun 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Zouxxyy Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

cdmikechen Jun 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Zouxxyy Jun 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Zouxxyy commented Jun 14, 2023

Uh oh!

danny0405 commented Jun 15, 2023

Uh oh!

hudi-bot commented Jun 15, 2023

CI report:

Uh oh!

danny0405 left a comment

Choose a reason for hiding this comment

Uh oh!

CTTY commented Jul 6, 2023

Uh oh!

Zouxxyy commented Jul 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Zouxxyy commented Jun 13, 2023 •

edited

Loading

cdmikechen Jun 14, 2023 •

edited

Loading

cdmikechen Jun 14, 2023 •

edited

Loading

Zouxxyy Jun 15, 2023 •

edited

Loading