-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp #8955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@xicm Hi, can you help with the review ? |
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java
Outdated
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java
Outdated
Show resolved
Hide resolved
| public static boolean supportTimestamp(Configuration conf) { | ||
| List<String> readCols = Arrays.asList(getReadColumnNames(conf)); | ||
| if (readCols.isEmpty()) { | ||
| return getIOColumnTypes(conf).contains("timestamp"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xicm Here I think it should return false directly, what do you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with you, @cdmikechen do you have any other concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the readCols can be empty ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the
readColscan be empty ?
As far as I know, such as count(*) which doesn't need to read any cols
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In such case, the timestamp can be read correctly anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In such case, the timestamp can be read correctly anyway?
yes
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieColumnProjectionUtils.java
Outdated
Show resolved
Hide resolved
| public class HoodieAvroParquetReader extends RecordReader<Void, ArrayWritable> { | ||
|
|
||
| private final ParquetRecordReader<GenericData.Record> parquetRecordReader; | ||
| private Schema baseSchema; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my origin PR HUDI-83 I didn't declare the baseSchema variable and didn't modify the getCurrentValue method.
In fact I would like to know if there is any problem or no NPE if we don't declare the baseSchema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my origin PR HUDI-83 I didn't declare the
baseSchemavariable and didn't modify thegetCurrentValuemethod. In fact I would like to know if there is any problem or no NPE if we don't declare thebaseSchema?
I have tested that baseSchema need to be used in getCurrentValue, otherwise, the result field will be null, like this #7173 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Zouxxyy
I'm having some confusion, I remember doing some situation testing against Hive when I first made the changes (about 1 year ago), including count(*) or specified fields.
I don't know if some subsequent new FEATURE or PR has affected this, I think I'll do another test later this week. Although we have added a separate class to handle timestamp types, my original intention was to use Hive or Hadoop origin method as much as possible for other fields, otherwise it would be costly for us to maintain subsequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cdmikechen Have you ever tested select id, ts1 from test_ts_1? will return null if don't use baseSchema
Below is my full test, feel free to try
-- spark-sql
create table test_ts_1(
id int,
ts1 timestamp)
using hudi
tblproperties(
type='mor',
primaryKey='id'
);
INSERT INTO test_ts_1
SELECT 1,
cast ('2021-12-25 12:01:01' as timestamp);
create table test_ts_2(
id int,
ts1 array<timestamp>,
ts2 map<string, timestamp>,
ts3 struct<province:timestamp, city:string>)
using hudi
tblproperties(
type='mor',
primaryKey='id'
);
INSERT INTO test_ts_2
SELECT 1,
array(cast ('2021-12-25 12:01:01' as timestamp)),
map('key', cast ('2021-12-25 12:01:01' as timestamp)),
struct(cast ('2021-12-25 12:01:01' as timestamp), 'test');
-- hive
select * from test_ts_1;
select id from test_ts_1;
select ts1 from test_ts_1;
select id, ts1 from test_ts_1;
select count(*) from test_ts_1;
select * from test_ts_2;
select id from test_ts_2;
select ts1 from test_ts_2;
select id, ts1 from test_ts_2;
select count(*) from test_ts_2;CC @danny0405 @xicm
|
@hudi-bot run azure |
1 similar comment
|
@hudi-bot run azure |
danny0405
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with this change, @Zouxxyy can you fire a following up fix when your test has encounter problems.
|
I'm seeing this failure when running unit test, this test seems to be added by this PR. Error message: Looks like only the order of column is wrong, but could you help me understand if this is a valid failure or we should fix the test? |
Are you testing java17? https://github.com/apache/hudi/pull/9136/files#top from assertEquals("Field fake_field not found in log schema. Query cannot proceed! Derived Schema Fields: "
+ "[non_pii_col, _hoodie_commit_time, _row_key, _hoodie_partition_path, _hoodie_record_key, pii_col,"
+ " _hoodie_commit_seqno, _hoodie_file_name, timestamp]",
assertThrows(HoodieException.class, () ->
HoodieAvroUtils.generateProjectionSchema(originalSchema, Arrays.asList("_row_key", "timestamp", "fake_field"))).getMessage());to assertTrue(assertThrows(HoodieException. class, () ->
HoodieAvroUtils.generateProjectionSchema(originalSchema, Arrays.asList("_row_key", "timestamp", "fake_field")))
.getMessage().contains("Field fake_field not found in log schema. Query cannot proceed!")); |
Change Logs
Fix the following two scenarios when use hive to query tables containing timestamp fields
count(*)In this scenario,
HoodieColumnProjectionUtils.getReadColumnNames(conf)will be empty,baseSchemawill not be initialized, and NPE will be throwed when calling:Impact
Fix the above two problems
Risk level (write none, low medium or high below)
low
Documentation Update
None
Contributor's checklist