-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-5308] Hive3 query returns null when the where clause has a partition field #7355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@hudi-bot run azure |
| // Make the HDFS dataset non-hoodie and run the same query; Checks for interoperability with non-hoodie tables | ||
| // Delete Hoodie directory to make it non-hoodie dataset | ||
| executeCommandStringInDocker(ADHOC_1_CONTAINER, "hdfs dfs -rm -r " + hdfsPath + "/.hoodie", true); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need meta table when we query hive now, this case doesn't fit anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meta table when we query hive now
For meta table do you mean the metadata table?
| HoodieTableConfig tableConfig = metaClient.getTableConfig(); | ||
| addProjectionToJobConf(realtimeSplit, jobConf, metaClient.getTableConfig().getPreCombineField()); | ||
| // add partition fields to hive job conf | ||
| HoodieRealtimeInputFormatUtils.addProjectionField(jobConf, metaClient.getTableConfig().getPartitionFields()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root cause of this issue is we add partition fields to parquet file, while parquet file of hive doesn't,
the problem may be solved by hive via apache/hive#3742.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse line 72?
|
@hudi-bot run azure |
|
Which version of Hive has this issue ? |
I test with hive 3.1.2, I believe hive 2.x has this issue too. |
|
cc @xiarixiaoyao , could you help the review please? |
|
Is the problematic table partitioning in hive style: |
Hive style and none hive style all return null. |
I'm skeptical about the use case, because Hive queries have been supported for a long time, can you share you hive table properties ? |
|
|
steps to reproduce
|
|
@xicm I have not come across this issue before. I have queried MOR partitioned table with Hive and Presto with partition predicates and it returns the correct results for me. I even ran the steps that you shared - #7355 (comment) - but I could not reproduce (screenshot for _ro and _rt tables below). Am I missing something? I think querying by partition is the most common thing to do and if query returned null then it would have surfaced in earlier versions too. What version of Hudi are you using? Can you also share how you connect to Hive and any config that's being set, e.g. |
|
@codope Thanks for your testing. It's so confusing. I just use |
Oh ok. I am using Hive 2.3.1. But, you should try connecting using |
|
@hudi-bot run azure |
|
So it is because the incorrect hive server version is used ? |
yes, partition query returns null with hive3. |
So the fix is only necessary for old hive server and it is actually not a bug? |
The fix is necessary for hive3.1.2, for hive2, as codepe tested , partition query is ok. |
|
5308.patch.zip |
| Path inputPath = ((FileSplit)split).getPath(); | ||
| FileSystem fs = inputPath.getFileSystem(job); | ||
| Option<Path> tablePath = TablePathUtils.getTablePath(fs, inputPath); | ||
| HoodieTableMetaClient metaClient = HoodieTableMetaClient.builder().setConf(job).setBasePath(tablePath.get().toString()).build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this code work here:
HoodieTableMetaClient metaClient = HoodieTableMetaClient.builder().setConf(jobConf).setBasePath(realtimeSplit.getBasePath()).build();| public static void addProjectionField(Configuration conf, Option<String[]> fieldName) { | ||
| if (fieldName.isPresent()) { | ||
| List<String> columnNameList = Arrays.stream(conf.get(serdeConstants.LIST_COLUMNS).split(",")).collect(Collectors.toList()); | ||
| Arrays.stream(fieldName.get()).forEach(field -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what this value represents
conf.get(serdeConstants.LIST_COLUMNS| String[] fieldOrders = fieldOrdersSet.toArray(new String[0]); | ||
| List<String> fieldNames = fieldNameCsv.isEmpty() ? new ArrayList<>() : Arrays.stream(fieldNameCsv.split(",")) | ||
| .filter(fn -> !partitioningFields.contains(fn)).collect(Collectors.toList()); | ||
| List<String> fieldNames = fieldNameCsv.isEmpty() ? new ArrayList<>() : Arrays.stream(fieldNameCsv.split(",")).collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the removal of partition fiels affect the behavior of Hive 2.x?
...adoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java
Show resolved
Hide resolved
| } | ||
|
|
||
| if (!readColNames.contains(fieldName)) { | ||
| if (!Arrays.asList(readColNames.split(",")).contains(fieldName)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch
|
|
||
| public static void addRequiredProjectionFields(Configuration configuration, Option<HoodieVirtualKeyInfo> hoodieVirtualKeyInfo, Option<String> preCombineKeyOpt) { | ||
| public static void addRequiredProjectionFields(Configuration configuration, Option<HoodieVirtualKeyInfo> hoodieVirtualKeyInfo) { | ||
| // Need this to do merge records in HoodieRealtimeRecordReader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addRequiredProjectionFields -> addVirtualKeysProjection
danny0405
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
…ition field (apache#7355) * Partition query in hive3 returns null for Hive 3.x.
…ition field (apache#7355) * Partition query in hive3 returns null for Hive 3.x.
…ition field (apache#7355) * Partition query in hive3 returns null for Hive 3.x.



Change Logs
We add partition fields to parquet file, while parquet file of hive doesn't, this will result returning null when the where clause has a partition field.
This pr add partition field id to
hive.io.file.readcolumn.ids.Details:
When hive queries, hive constructs a
FilterPredicate. TheFilterPredicatehas aValueInspector, which is initialized fromhive.io.file.readcolumn.ids. If partition fields are in where clause andhive.io.file.readcolumn.idsdoes not contain partition fields,FilterPredicatewon't have aValueInspector,FilterPredicatewill not setisKnown, andgetCurrentRecordwill return null.Some hive code
Impact
HoodieParquetInputFormat.java
HoodieParquetRealtimeInputFormat.java
Risk level (write none, low medium or high below)
low
If medium or high, explain what verification was done to mitigate the risks.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist