[HUDI-5308] Hive3 query returns null when the where clause has a partition field #7355

xicm · 2022-12-01T10:19:30Z

Change Logs

We add partition fields to parquet file, while parquet file of hive doesn't, this will result returning null when the where clause has a partition field.

This pr add partition field id to hive.io.file.readcolumn.ids.

Details:
When hive queries, hive constructs a FilterPredicate. The FilterPredicate has a ValueInspector, which is initialized from hive.io.file.readcolumn.ids. If partition fields are in where clause and hive.io.file.readcolumn.ids does not contain partition fields, FilterPredicate won't have a ValueInspector, FilterPredicate will not set isKnown, and getCurrentRecord will return null.

Some hive code

    public void addBinary(Binary value) {
        ValueInspector[] var2 = this.valueInspectors;
        int var3 = var2.length;

        for(int var4 = 0; var4 < var3; ++var4) {
            ValueInspector valueInspector = var2[var4];
            valueInspector.update(value);  // isKnown is updated here
        }

        this.delegate.addBinary(value);
    }
            


    new ValueInspector() {
         public void updateNull() {
              this.setResult(false);
         }

          public void update(float value) {
               this.setResult(value == target);
          }
   };

      protected final void setResult(boolean result) {
            if (this.isKnown) {
                throw new IllegalStateException("setResult() called on a ValueInspector whose result is already known! Did you forget to call reset()?");
            } else {
                this.result = result;
                this.isKnown = true;
            }
        }




    public T getCurrentRecord() {
        boolean keep = IncrementallyUpdatedFilterPredicateEvaluator.evaluate(this.filterPredicate);
        IncrementallyUpdatedFilterPredicateResetter.reset(this.filterPredicate);
        return keep ? this.delegate.getCurrentRecord() : null;
    }

    public static boolean evaluate(IncrementallyUpdatedFilterPredicate pred) {
        Preconditions.checkNotNull(pred, "pred");
        return pred.accept(INSTANCE);
    }

    public boolean visit(ValueInspector p) {
        if (!p.isKnown()) { // default value of isKnown is false
            p.updateNull();
        }

        return p.getResult();
    }

Impact

HoodieParquetInputFormat.java
HoodieParquetRealtimeInputFormat.java

Risk level (write none, low medium or high below)

low

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

xicm · 2022-12-02T05:11:17Z

@hudi-bot run azure

xicm · 2022-12-15T10:10:14Z

hudi-integ-test/src/test/java/org/apache/hudi/integ/ITTestHoodieSanity.java

-    // Make the HDFS dataset non-hoodie and run the same query; Checks for interoperability with non-hoodie tables
-    // Delete Hoodie directory to make it non-hoodie dataset
-    executeCommandStringInDocker(ADHOC_1_CONTAINER, "hdfs dfs -rm -r " + hdfsPath + "/.hoodie", true);
-


We need meta table when we query hive now, this case doesn't fit anymore.

meta table when we query hive now

For meta table do you mean the metadata table?

xicm · 2022-12-15T10:19:55Z

...adoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java

    HoodieTableConfig tableConfig = metaClient.getTableConfig();
    addProjectionToJobConf(realtimeSplit, jobConf, metaClient.getTableConfig().getPreCombineField());
+    // add partition fields to hive job conf
+    HoodieRealtimeInputFormatUtils.addProjectionField(jobConf, metaClient.getTableConfig().getPartitionFields());


The root cause of this issue is we add partition fields to parquet file, while parquet file of hive doesn't,
the problem may be solved by hive via apache/hive#3742.

Can we reuse line 72?

xicm · 2023-01-03T07:14:35Z

@hudi-bot run azure

bvaradar · 2023-03-06T18:14:10Z

Which version of Hive has this issue ?

xicm · 2023-03-07T01:29:08Z

Which version of Hive has this issue ?

I test with hive 3.1.2, I believe hive 2.x has this issue too.

danny0405 · 2023-04-19T03:28:48Z

cc @xiarixiaoyao , could you help the review please?

danny0405 · 2023-04-24T03:58:20Z

Is the problematic table partitioning in hive style: par1=val1 ? I was astonished that the partition path queries for Hive return nulls.

xicm · 2023-04-24T04:26:45Z

Is the problematic table partitioning in hive style: par1=val1 ? I was astonished that the partition path queries for Hive return nulls.

Hive style and none hive style all return null.

danny0405 · 2023-04-24T07:16:09Z

Is the problematic table partitioning in hive style: par1=val1 ? I was astonished that the partition path queries for Hive return nulls.

Hive style and none hive style all return null.

I'm skeptical about the use case, because Hive queries have been supported for a long time, can you share you hive table properties ?

xicm · 2023-04-24T09:19:28Z

+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE EXTERNAL TABLE `test_partition`(            |
|   `_hoodie_commit_time` string,                    |
|   `_hoodie_commit_seqno` string,                   |
|   `_hoodie_record_key` string,                     |
|   `_hoodie_partition_path` string,                 |
|   `_hoodie_file_name` string,                      |
|   `id` bigint,                                     |
|   `name` string,                                   |
|   `ts` bigint)                                     |
| PARTITIONED BY (                                   |
|   `part` string)                                   |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'  |
| WITH SERDEPROPERTIES (                             |
|   'path'='hdfs://xxxxxxx/tmp/hoodie/test_partition')  |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'  |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' |
| LOCATION                                           |
|   'hdfs://xxxxxxxxxx/tmp/hoodie/test_partition' |
| TBLPROPERTIES (                                    |
|   'bucketing_version'='2',                         |
|   'hoodie.datasource.hive_sync.enable'='true',     |
|   'hoodie.datasource.hive_sync.metastore.uris'='thrift://xxxxxxxx:9083',  |
|   'hoodie.datasource.hive_sync.mode'='HMS',        |
|   'hoodie.datasource.hive_sync.partition_fields'='part',  |
|   'hoodie.datasource.hive_sync.table'='test_partition',  |
|   'hoodie.embed.timeline.server'='false',          |
|   'preCombineField'='ts',                          |
|   'primaryKey'='id',                               |
|   'spark.sql.create.version'='3.1.1',              |
|   'spark.sql.sources.provider'='hudi',             |
|   'spark.sql.sources.schema.numPartCols'='1',      |
|   'spark.sql.sources.schema.numParts'='1',         |
|   'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"id","type":"long","nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"ts","type":"long","nullable":true,"metadata":{}},{"name":"part","type":"string","nullable":true,"metadata":{}}]}',  |
|   'spark.sql.sources.schema.partCol.0'='part',     |
|   'transient_lastDdlTime'='1682327384',            |
|   'type'='mor')                                    |
+----------------------------------------------------+

xicm · 2023-04-24T09:21:05Z

xicm · 2023-04-24T09:21:43Z

xicm · 2023-04-24T09:24:44Z

steps to reproduce

create table in spark sql

create table test_partition (
  id bigint,
  name string,
  ts bigint,
  part string
) using hudi
tblproperties (
  type = 'mor',
  primaryKey = 'id',
  preCombineField = 'ts',
  'hoodie.embed.timeline.server' = 'false',
  'hoodie.datasource.hive_sync.enable'='true',
  'hoodie.datasource.hive_sync.mode'='HMS',
  'hoodie.datasource.hive_sync.table'='test_partition',
  'hoodie.datasource.hive_sync.metastore.uris'='thrift://xxxxxxxxxx',
  'hoodie.datasource.hive_sync.partition_fields'='part'
)
partitioned by (part)
location '/tmp/hoodie/test_partition';

insert a record

insert into test_partition partition (part)
select 1 as id, 'a1' as name, 1000 as ts, '2021-12-09' as part;

query in hive

select * from test_partition_ro where part = '2021-12-09';

codope · 2023-04-28T12:36:18Z

@xicm I have not come across this issue before. I have queried MOR partitioned table with Hive and Presto with partition predicates and it returns the correct results for me. I even ran the steps that you shared - #7355 (comment) - but I could not reproduce (screenshot for _ro and _rt tables below). Am I missing something?

I think querying by partition is the most common thing to do and if query returned null then it would have surfaced in earlier versions too. What version of Hudi are you using? Can you also share how you connect to Hive and any config that's being set, e.g. hive.input.format. For reference, I simply used beeline as:

beeline -u jdbc:hive2://hiveserver:10000 \
>   --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
>   --hiveconf hive.stats.autogather=false

Screenshot below

xicm · 2023-04-28T13:12:07Z

@codope Thanks for your testing. It's so confusing. I just use hive cli without any other conf. My hive version is 3.1.2, I don't have hive2, what's your hive version? Is this a matter of version differences？

codope · 2023-04-28T13:20:55Z

@codope Thanks for your testing. It's so confusing. I just use hive cli without any other conf. My hive version is 3.1.2, I don't have hive2, what's your hive version? Is this a matter of version differences？

Oh ok. I am using Hive 2.3.1. But, you should try connecting using hive2 even for Hive version 3.1.2. That's the latest Hive server interface version for both the 2.3.x and 3.1.x lines.

xicm · 2023-04-28T23:49:04Z

@hudi-bot run azure

danny0405 · 2023-04-30T03:08:41Z

So it is because the incorrect hive server version is used ?

xicm · 2023-04-30T05:26:59Z

So it is because the incorrect hive server version is used ?

yes, partition query returns null with hive3.

danny0405 · 2023-05-02T08:05:11Z

So it is because the incorrect hive server version is used ?

yes, partition query returns null with hive3.

So the fix is only necessary for old hive server and it is actually not a bug?

xicm · 2023-05-02T08:21:34Z

So the fix is only necessary for old hive server and it is actually not a bug?

The fix is necessary for hive3.1.2, for hive2, as codepe tested , partition query is ok.

danny0405 · 2023-05-03T09:19:24Z

5308.patch.zip
I'm not that familiar with Hive so just created a patch to refactor the tests.

danny0405 · 2023-05-03T09:20:11Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java

+    Path inputPath = ((FileSplit)split).getPath();
+    FileSystem fs =  inputPath.getFileSystem(job);
+    Option<Path> tablePath = TablePathUtils.getTablePath(fs, inputPath);
+    HoodieTableMetaClient metaClient = HoodieTableMetaClient.builder().setConf(job).setBasePath(tablePath.get().toString()).build();


Does this code work here:

HoodieTableMetaClient metaClient = HoodieTableMetaClient.builder().setConf(jobConf).setBasePath(realtimeSplit.getBasePath()).build();

danny0405 · 2023-05-03T09:21:04Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java

+  public static void addProjectionField(Configuration conf, Option<String[]> fieldName) {
+    if (fieldName.isPresent()) {
+      List<String> columnNameList = Arrays.stream(conf.get(serdeConstants.LIST_COLUMNS).split(",")).collect(Collectors.toList());
+      Arrays.stream(fieldName.get()).forEach(field -> {


Can you explain what this value represents

conf.get(serdeConstants.LIST_COLUMNS

danny0405 · 2023-05-03T09:22:08Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java

    String[] fieldOrders = fieldOrdersSet.toArray(new String[0]);
-    List<String> fieldNames = fieldNameCsv.isEmpty() ? new ArrayList<>() : Arrays.stream(fieldNameCsv.split(","))
-        .filter(fn -> !partitioningFields.contains(fn)).collect(Collectors.toList());
+    List<String> fieldNames = fieldNameCsv.isEmpty() ? new ArrayList<>() : Arrays.stream(fieldNameCsv.split(",")).collect(Collectors.toList());


Does the removal of partition fiels affect the behavior of Hive 2.x?

...adoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java

danny0405 · 2023-05-05T02:33:18Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java

    }

-    if (!readColNames.contains(fieldName)) {
+    if (!Arrays.asList(readColNames.split(",")).contains(fieldName)) {


danny0405 · 2023-05-05T02:51:01Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java


-  public static void addRequiredProjectionFields(Configuration configuration, Option<HoodieVirtualKeyInfo> hoodieVirtualKeyInfo, Option<String> preCombineKeyOpt) {
+  public static void addRequiredProjectionFields(Configuration configuration, Option<HoodieVirtualKeyInfo> hoodieVirtualKeyInfo) {
    // Need this to do merge records in HoodieRealtimeRecordReader


addRequiredProjectionFields -> addVirtualKeysProjection

hudi-bot · 2023-05-10T05:17:20Z

CI report:

f94f624 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405

+1

…ition field (apache#7355) * Partition query in hive3 returns null for Hive 3.x.

xicm changed the title ~~[HUDI-5308] The hive query returns null when the where condition has a partition field~~ [HUDI-5308] Hive query returns null when the where condition has a partition field Dec 3, 2022

xicm changed the title ~~[HUDI-5308] Hive query returns null when the where condition has a partition field~~ [HUDI-5308] Hive3 query returns null when the where condition has a partition field Dec 3, 2022

xicm changed the title ~~[HUDI-5308] Hive3 query returns null when the where condition has a partition field~~ [HUDI-5308] Hive3 query returns null when the where clause has a partition field Dec 3, 2022

nsivabalan added priority:blocker Production down; release blocker release-0.12.2 Patches targetted for 0.12.2 labels Dec 5, 2022

codope added priority:critical Production degraded; pipelines stalled engine:hive Hive integration and removed priority:blocker Production down; release blocker release-0.12.2 Patches targetted for 0.12.2 labels Dec 7, 2022

xicm mentioned this pull request Dec 8, 2022

[HUDI-5047] Add partition value in HoodieLogRecordReader when hoodie.datasource.write.drop.partition.columns=true #6986

Closed

4 tasks

xicm changed the title ~~[HUDI-5308] Hive3 query returns null when the where clause has a partition field~~ [HUDI-5308] [testing]Hive3 query returns null when the where clause has a partition field Dec 13, 2022

xicm commented Dec 15, 2022

View reviewed changes

xicm changed the title ~~[HUDI-5308] [testing]Hive3 query returns null when the where clause has a partition field~~ [HUDI-5308] Hive3 query returns null when the where clause has a partition field Dec 15, 2022

xicm changed the title ~~[HUDI-5308] Hive3 query returns null when the where clause has a partition field~~ [HUDI-5308] Hive query returns null when the where clause has a partition field Dec 15, 2022

xicm commented Dec 15, 2022

View reviewed changes

danny0405 added release-0.13.1 release-0.14.0 labels Apr 26, 2023

xicm force-pushed the HUDI-5308 branch from 6c5e047 to 136ee7b Compare April 28, 2023 03:10

xicm force-pushed the HUDI-5308 branch from 136ee7b to ae6ad71 Compare April 28, 2023 07:36

xicm force-pushed the HUDI-5308 branch from ae6ad71 to e371363 Compare April 28, 2023 13:07

xicm changed the title ~~[HUDI-5308] Hive query returns null when the where clause has a partition field~~ [HUDI-5308] Hive3 query returns null when the where clause has a partition field Apr 28, 2023

danny0405 reviewed May 3, 2023

View reviewed changes

danny0405 reviewed May 5, 2023

View reviewed changes

...adoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java Show resolved Hide resolved

danny0405 reviewed May 5, 2023

View reviewed changes

yihua removed the release-0.13.1 label May 6, 2023

xicm force-pushed the HUDI-5308 branch from 94e2ecd to 3abe97c Compare May 8, 2023 09:43

[HUDI-5308] Partition query in hive3 returns null

2aed142

xicm force-pushed the HUDI-5308 branch from 3abe97c to 2aed142 Compare May 9, 2023 01:40

remove unused import

f94f624

danny0405 approved these changes May 10, 2023

View reviewed changes

danny0405 merged commit 15cd052 into apache:master May 10, 2023

yihua pushed a commit to yihua/hudi that referenced this pull request May 15, 2023

[HUDI-5308] Hive3 query returns null when the where clause has a part…

bd45556

…ition field (apache#7355) * Partition query in hive3 returns null for Hive 3.x.

yihua pushed a commit to yihua/hudi that referenced this pull request May 15, 2023

[HUDI-5308] Hive3 query returns null when the where clause has a part…

1893eb7

…ition field (apache#7355) * Partition query in hive3 returns null for Hive 3.x.

yihua pushed a commit to yihua/hudi that referenced this pull request May 17, 2023

[HUDI-5308] Hive3 query returns null when the where clause has a part…

30d497a

…ition field (apache#7355) * Partition query in hive3 returns null for Hive 3.x.

Zouxxyy mentioned this pull request Sep 28, 2023

[HUDI-6899] Fix hive read schema evolution table with partition field #9798

Closed

4 tasks

xicm mentioned this pull request Jan 29, 2024

[HUDI-6787] Integrate the new file group reader with Hive query engine #10422

Merged

4 tasks

[HUDI-5308] Hive3 query returns null when the where clause has a partition field #7355

[HUDI-5308] Hive3 query returns null when the where clause has a partition field #7355

Uh oh!

Conversation

xicm commented Dec 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

xicm commented Dec 2, 2022

Uh oh!

xicm Dec 15, 2022

Choose a reason for hiding this comment

Uh oh!

danny0405 Apr 26, 2023

Choose a reason for hiding this comment

Uh oh!

xicm Dec 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 May 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xicm commented Jan 3, 2023

Uh oh!

bvaradar commented Mar 6, 2023

Uh oh!

xicm commented Mar 7, 2023

Uh oh!

danny0405 commented Apr 19, 2023

Uh oh!

danny0405 commented Apr 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xicm commented Apr 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danny0405 commented Apr 24, 2023

Uh oh!

xicm commented Apr 24, 2023

Uh oh!

xicm commented Apr 24, 2023

Uh oh!

xicm commented Apr 24, 2023

Uh oh!

xicm commented Apr 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codope commented Apr 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xicm commented Apr 28, 2023

Uh oh!

codope commented Apr 28, 2023

Uh oh!

xicm commented Apr 28, 2023

Uh oh!

danny0405 commented Apr 30, 2023

Uh oh!

xicm commented Apr 30, 2023

Uh oh!

danny0405 commented May 2, 2023

Uh oh!

xicm commented May 2, 2023

Uh oh!

danny0405 commented May 3, 2023

Uh oh!

danny0405 May 3, 2023

Choose a reason for hiding this comment

Uh oh!

danny0405 May 3, 2023

Choose a reason for hiding this comment

Uh oh!

danny0405 May 3, 2023

Choose a reason for hiding this comment

xicm commented Dec 1, 2022 •

edited

Loading

xicm Dec 15, 2022 •

edited

Loading

danny0405 May 3, 2023 •

edited

Loading

danny0405 commented Apr 24, 2023 •

edited

Loading

xicm commented Apr 24, 2023 •

edited

Loading

xicm commented Apr 24, 2023 •

edited

Loading

codope commented Apr 28, 2023 •

edited

Loading