Skip to content

Conversation

@Zouxxyy
Copy link
Contributor

@Zouxxyy Zouxxyy commented Jun 13, 2023

Change Logs

Fix the following two scenarios when use hive to query tables containing timestamp fields

  1. No query field, like count(*)
-- spark
create table test_ts_tbl(
  id int, 
  ts1 timestamp)
using hudi
tblproperties(
  type='mor', 
  primaryKey='id'
);

INSERT INTO test_ts_tbl
SELECT 1,
cast ('2021-12-25 12:01:01' as timestamp);

-- hive
select count(*) from test_ts_tbl; // error

In this scenario, HoodieColumnProjectionUtils.getReadColumnNames(conf) will be empty, baseSchema will not be initialized, and NPE will be throwed when calling:

public ArrayWritable getCurrentValue() throws IOException, InterruptedException {
   GenericRecord record = parquetRecordReader.getCurrentValue();
   return (ArrayWritable) HoodieRealtimeRecordReaderUtils.avroToArrayWritable(record, baseSchema, true);
 }
  1. Complex fields containing timestamp are currently not supported
-- spark
create table test_ts_tbl(
  id int, 
  ts1 array<timestamp>, 
  ts2 map<string, timestamp>, 
  ts3 struct<province:timestamp, city:string>)
using hudi
tblproperties(
  type='mor', 
  primaryKey='id'
);

INSERT INTO test_ts_tbl
SELECT 1,
array(cast ('2021-12-25 12:01:01' as timestamp)),
map('key', cast ('2021-12-25 12:01:01' as timestamp)),
struct(cast ('2021-12-25 12:01:01' as timestamp), 'test');

-- hive
select * from test_ts_tbl; // error

Impact

Fix the above two problems

Risk level (write none, low medium or high below)

low

Documentation Update

None

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@danny0405
Copy link
Contributor

@xicm Hi, can you help with the review ?

@danny0405 danny0405 self-assigned this Jun 13, 2023
public static boolean supportTimestamp(Configuration conf) {
List<String> readCols = Arrays.asList(getReadColumnNames(conf));
if (readCols.isEmpty()) {
return getIOColumnTypes(conf).contains("timestamp");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xicm Here I think it should return false directly, what do you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with you, @cdmikechen do you have any other concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the readCols can be empty ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the readCols can be empty ?

As far as I know, such as count(*) which doesn't need to read any cols

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In such case, the timestamp can be read correctly anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In such case, the timestamp can be read correctly anyway?

yes

@danny0405 danny0405 added the area:schema Schema evolution and data types label Jun 14, 2023
public class HoodieAvroParquetReader extends RecordReader<Void, ArrayWritable> {

private final ParquetRecordReader<GenericData.Record> parquetRecordReader;
private Schema baseSchema;
Copy link
Contributor

@cdmikechen cdmikechen Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my origin PR HUDI-83 I didn't declare the baseSchema variable and didn't modify the getCurrentValue method.
In fact I would like to know if there is any problem or no NPE if we don't declare the baseSchema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my origin PR HUDI-83 I didn't declare the baseSchema variable and didn't modify the getCurrentValue method. In fact I would like to know if there is any problem or no NPE if we don't declare the baseSchema?

I have tested that baseSchema need to be used in getCurrentValue, otherwise, the result field will be null, like this #7173 (comment)

Copy link
Contributor

@cdmikechen cdmikechen Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Zouxxyy
I'm having some confusion, I remember doing some situation testing against Hive when I first made the changes (about 1 year ago), including count(*) or specified fields.
I don't know if some subsequent new FEATURE or PR has affected this, I think I'll do another test later this week. Although we have added a separate class to handle timestamp types, my original intention was to use Hive or Hadoop origin method as much as possible for other fields, otherwise it would be costly for us to maintain subsequently.

Copy link
Contributor Author

@Zouxxyy Zouxxyy Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdmikechen Have you ever tested select id, ts1 from test_ts_1? will return null if don't use baseSchema
Below is my full test, feel free to try

-- spark-sql
create table test_ts_1(
  id int, 
  ts1 timestamp)
using hudi
tblproperties(
  type='mor', 
  primaryKey='id'
);

INSERT INTO test_ts_1
SELECT 1,
cast ('2021-12-25 12:01:01' as timestamp);

create table test_ts_2(
  id int, 
  ts1 array<timestamp>, 
  ts2 map<string, timestamp>, 
  ts3 struct<province:timestamp, city:string>)
using hudi
tblproperties(
  type='mor', 
  primaryKey='id'
);

INSERT INTO test_ts_2
SELECT 1,
array(cast ('2021-12-25 12:01:01' as timestamp)),
map('key', cast ('2021-12-25 12:01:01' as timestamp)),
struct(cast ('2021-12-25 12:01:01' as timestamp), 'test');

-- hive
select * from test_ts_1;
select id from test_ts_1;
select ts1 from test_ts_1;
select id, ts1 from test_ts_1;
select count(*) from test_ts_1;

select * from test_ts_2;
select id from test_ts_2;
select ts1 from test_ts_2;
select id, ts1 from test_ts_2;
select count(*) from test_ts_2;

CC @danny0405 @xicm

@Zouxxyy Zouxxyy changed the title [HUDI-6367] Fix NPE in HoodieAvroParquetReader [HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp Jun 14, 2023
@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Jun 14, 2023

@hudi-bot run azure

1 similar comment
@danny0405
Copy link
Contributor

@hudi-bot run azure

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with this change, @Zouxxyy can you fire a following up fix when your test has encounter problems.

@danny0405 danny0405 merged commit 8bbda17 into apache:master Jun 15, 2023
@CTTY
Copy link
Contributor

CTTY commented Jul 6, 2023

I'm seeing this failure when running unit test, this test seems to be added by this PR. Error message:

Error:  Failures: 
Error:    TestHoodieAvroUtils.testGenerateProjectionSchema:453 expected: <Field fake_field not found in log schema. Query cannot proceed! Derived Schema Fields: [non_pii_col, _hoodie_commit_time, _row_key, _hoodie_partition_path, _hoodie_record_key, pii_col, _hoodie_commit_seqno, _hoodie_file_name, timestamp]> but was: <Field fake_field not found in log schema. Query cannot proceed! Derived Schema Fields: [_hoodie_commit_time, non_pii_col, _hoodie_partition_path, _row_key, _hoodie_record_key, pii_col, _hoodie_commit_seqno, _hoodie_file_name, timestamp]>

Looks like only the order of column is wrong, but could you help me understand if this is a valid failure or we should fix the test?

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Jul 7, 2023

Looks like only the order of column is wrong, but could you help me understand if this is a valid failure or we should fix the test?

Are you testing java17? https://github.com/apache/hudi/pull/9136/files#top
It seems that the order of items in set in java17 has changed, we can change the test case like this, if we need to support java17

from

    assertEquals("Field fake_field not found in log schema. Query cannot proceed! Derived Schema Fields: "
            + "[non_pii_col, _hoodie_commit_time, _row_key, _hoodie_partition_path, _hoodie_record_key, pii_col,"
            + " _hoodie_commit_seqno, _hoodie_file_name, timestamp]",
        assertThrows(HoodieException.class, () ->
            HoodieAvroUtils.generateProjectionSchema(originalSchema, Arrays.asList("_row_key", "timestamp", "fake_field"))).getMessage());

to

     assertTrue(assertThrows(HoodieException. class, () ->
         HoodieAvroUtils.generateProjectionSchema(originalSchema, Arrays.asList("_row_key", "timestamp", "fake_field")))
         .getMessage().contains("Field fake_field not found in log schema. Query cannot proceed!"));

@CTTY CTTY mentioned this pull request Jul 18, 2023
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:schema Schema evolution and data types engine:hive Hive integration issue:stability

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

6 participants