Skip to content

Conversation

@xicm
Copy link
Contributor

@xicm xicm commented Dec 1, 2022

Change Logs

We add partition fields to parquet file, while parquet file of hive doesn't, this will result returning null when the where clause has a partition field.

This pr add partition field id to hive.io.file.readcolumn.ids.

Details:
When hive queries, hive constructs a FilterPredicate. The FilterPredicate has a ValueInspector, which is initialized from hive.io.file.readcolumn.ids. If partition fields are in where clause and hive.io.file.readcolumn.ids does not contain partition fields, FilterPredicate won't have a ValueInspector, FilterPredicate will not set isKnown, and getCurrentRecord will return null.

Some hive code

    public void addBinary(Binary value) {
        ValueInspector[] var2 = this.valueInspectors;
        int var3 = var2.length;

        for(int var4 = 0; var4 < var3; ++var4) {
            ValueInspector valueInspector = var2[var4];
            valueInspector.update(value);  // isKnown is updated here
        }

        this.delegate.addBinary(value);
    }
            


    new ValueInspector() {
         public void updateNull() {
              this.setResult(false);
         }

          public void update(float value) {
               this.setResult(value == target);
          }
   };

      protected final void setResult(boolean result) {
            if (this.isKnown) {
                throw new IllegalStateException("setResult() called on a ValueInspector whose result is already known! Did you forget to call reset()?");
            } else {
                this.result = result;
                this.isKnown = true;
            }
        }




    public T getCurrentRecord() {
        boolean keep = IncrementallyUpdatedFilterPredicateEvaluator.evaluate(this.filterPredicate);
        IncrementallyUpdatedFilterPredicateResetter.reset(this.filterPredicate);
        return keep ? this.delegate.getCurrentRecord() : null;
    }

    public static boolean evaluate(IncrementallyUpdatedFilterPredicate pred) {
        Preconditions.checkNotNull(pred, "pred");
        return pred.accept(INSTANCE);
    }

    public boolean visit(ValueInspector p) {
        if (!p.isKnown()) { // default value of isKnown is false
            p.updateNull();
        }

        return p.getResult();
    }

Impact

HoodieParquetInputFormat.java
HoodieParquetRealtimeInputFormat.java

Risk level (write none, low medium or high below)

low

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@xicm
Copy link
Contributor Author

xicm commented Dec 2, 2022

@hudi-bot run azure

@xicm xicm changed the title [HUDI-5308] The hive query returns null when the where condition has a partition field [HUDI-5308] Hive query returns null when the where condition has a partition field Dec 3, 2022
@xicm xicm changed the title [HUDI-5308] Hive query returns null when the where condition has a partition field [HUDI-5308] Hive3 query returns null when the where condition has a partition field Dec 3, 2022
@xicm xicm changed the title [HUDI-5308] Hive3 query returns null when the where condition has a partition field [HUDI-5308] Hive3 query returns null when the where clause has a partition field Dec 3, 2022
@nsivabalan nsivabalan added priority:blocker Production down; release blocker release-0.12.2 Patches targetted for 0.12.2 labels Dec 5, 2022
@codope codope added priority:critical Production degraded; pipelines stalled engine:hive Hive integration and removed priority:blocker Production down; release blocker release-0.12.2 Patches targetted for 0.12.2 labels Dec 7, 2022
@xicm xicm changed the title [HUDI-5308] Hive3 query returns null when the where clause has a partition field [HUDI-5308] [testing]Hive3 query returns null when the where clause has a partition field Dec 13, 2022
// Make the HDFS dataset non-hoodie and run the same query; Checks for interoperability with non-hoodie tables
// Delete Hoodie directory to make it non-hoodie dataset
executeCommandStringInDocker(ADHOC_1_CONTAINER, "hdfs dfs -rm -r " + hdfsPath + "/.hoodie", true);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need meta table when we query hive now, this case doesn't fit anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meta table when we query hive now

For meta table do you mean the metadata table?

@xicm xicm changed the title [HUDI-5308] [testing]Hive3 query returns null when the where clause has a partition field [HUDI-5308] Hive3 query returns null when the where clause has a partition field Dec 15, 2022
@xicm xicm changed the title [HUDI-5308] Hive3 query returns null when the where clause has a partition field [HUDI-5308] Hive query returns null when the where clause has a partition field Dec 15, 2022
HoodieTableConfig tableConfig = metaClient.getTableConfig();
addProjectionToJobConf(realtimeSplit, jobConf, metaClient.getTableConfig().getPreCombineField());
// add partition fields to hive job conf
HoodieRealtimeInputFormatUtils.addProjectionField(jobConf, metaClient.getTableConfig().getPartitionFields());
Copy link
Contributor Author

@xicm xicm Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root cause of this issue is we add partition fields to parquet file, while parquet file of hive doesn't,
the problem may be solved by hive via apache/hive#3742.

Copy link
Contributor

@danny0405 danny0405 May 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse line 72?

@xicm
Copy link
Contributor Author

xicm commented Jan 3, 2023

@hudi-bot run azure

@bvaradar
Copy link
Contributor

bvaradar commented Mar 6, 2023

Which version of Hive has this issue ?

@xicm
Copy link
Contributor Author

xicm commented Mar 7, 2023

Which version of Hive has this issue ?

I test with hive 3.1.2, I believe hive 2.x has this issue too.

@danny0405
Copy link
Contributor

cc @xiarixiaoyao , could you help the review please?

@danny0405
Copy link
Contributor

danny0405 commented Apr 24, 2023

Is the problematic table partitioning in hive style: par1=val1 ? I was astonished that the partition path queries for Hive return nulls.

@xicm
Copy link
Contributor Author

xicm commented Apr 24, 2023

Is the problematic table partitioning in hive style: par1=val1 ? I was astonished that the partition path queries for Hive return nulls.

Hive style and none hive style all return null.

@danny0405
Copy link
Contributor

Is the problematic table partitioning in hive style: par1=val1 ? I was astonished that the partition path queries for Hive return nulls.

Hive style and none hive style all return null.

I'm skeptical about the use case, because Hive queries have been supported for a long time, can you share you hive table properties ?

@xicm
Copy link
Contributor Author

xicm commented Apr 24, 2023

+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE EXTERNAL TABLE `test_partition`(            |
|   `_hoodie_commit_time` string,                    |
|   `_hoodie_commit_seqno` string,                   |
|   `_hoodie_record_key` string,                     |
|   `_hoodie_partition_path` string,                 |
|   `_hoodie_file_name` string,                      |
|   `id` bigint,                                     |
|   `name` string,                                   |
|   `ts` bigint)                                     |
| PARTITIONED BY (                                   |
|   `part` string)                                   |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'  |
| WITH SERDEPROPERTIES (                             |
|   'path'='hdfs://xxxxxxx/tmp/hoodie/test_partition')  |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'  |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' |
| LOCATION                                           |
|   'hdfs://xxxxxxxxxx/tmp/hoodie/test_partition' |
| TBLPROPERTIES (                                    |
|   'bucketing_version'='2',                         |
|   'hoodie.datasource.hive_sync.enable'='true',     |
|   'hoodie.datasource.hive_sync.metastore.uris'='thrift://xxxxxxxx:9083',  |
|   'hoodie.datasource.hive_sync.mode'='HMS',        |
|   'hoodie.datasource.hive_sync.partition_fields'='part',  |
|   'hoodie.datasource.hive_sync.table'='test_partition',  |
|   'hoodie.embed.timeline.server'='false',          |
|   'preCombineField'='ts',                          |
|   'primaryKey'='id',                               |
|   'spark.sql.create.version'='3.1.1',              |
|   'spark.sql.sources.provider'='hudi',             |
|   'spark.sql.sources.schema.numPartCols'='1',      |
|   'spark.sql.sources.schema.numParts'='1',         |
|   'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"id","type":"long","nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"ts","type":"long","nullable":true,"metadata":{}},{"name":"part","type":"string","nullable":true,"metadata":{}}]}',  |
|   'spark.sql.sources.schema.partCol.0'='part',     |
|   'transient_lastDdlTime'='1682327384',            |
|   'type'='mor')                                    |
+----------------------------------------------------+

@xicm
Copy link
Contributor Author

xicm commented Apr 24, 2023

image

@xicm
Copy link
Contributor Author

xicm commented Apr 24, 2023

image

@xicm
Copy link
Contributor Author

xicm commented Apr 24, 2023

steps to reproduce

  1. create table in spark sql
create table test_partition (
  id bigint,
  name string,
  ts bigint,
  part string
) using hudi
tblproperties (
  type = 'mor',
  primaryKey = 'id',
  preCombineField = 'ts',
  'hoodie.embed.timeline.server' = 'false',
  'hoodie.datasource.hive_sync.enable'='true',
  'hoodie.datasource.hive_sync.mode'='HMS',
  'hoodie.datasource.hive_sync.table'='test_partition',
  'hoodie.datasource.hive_sync.metastore.uris'='thrift://xxxxxxxxxx',
  'hoodie.datasource.hive_sync.partition_fields'='part'
)
partitioned by (part)
location '/tmp/hoodie/test_partition';
  1. insert a record
insert into test_partition partition (part)
select 1 as id, 'a1' as name, 1000 as ts, '2021-12-09' as part;
  1. query in hive
select * from test_partition_ro where part = '2021-12-09';

@codope
Copy link
Member

codope commented Apr 28, 2023

@xicm I have not come across this issue before. I have queried MOR partitioned table with Hive and Presto with partition predicates and it returns the correct results for me. I even ran the steps that you shared - #7355 (comment) - but I could not reproduce (screenshot for _ro and _rt tables below). Am I missing something?

I think querying by partition is the most common thing to do and if query returned null then it would have surfaced in earlier versions too. What version of Hudi are you using? Can you also share how you connect to Hive and any config that's being set, e.g. hive.input.format. For reference, I simply used beeline as:

beeline -u jdbc:hive2://hiveserver:10000 \
>   --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat \
>   --hiveconf hive.stats.autogather=false

Screenshot below
Screenshot 2023-04-28 at 6 21 16 PM

@xicm
Copy link
Contributor Author

xicm commented Apr 28, 2023

@codope Thanks for your testing. It's so confusing. I just use hive cli without any other conf. My hive version is 3.1.2, I don't have hive2, what's your hive version? Is this a matter of version differences?

@xicm xicm changed the title [HUDI-5308] Hive query returns null when the where clause has a partition field [HUDI-5308] Hive3 query returns null when the where clause has a partition field Apr 28, 2023
@codope
Copy link
Member

codope commented Apr 28, 2023

@codope Thanks for your testing. It's so confusing. I just use hive cli without any other conf. My hive version is 3.1.2, I don't have hive2, what's your hive version? Is this a matter of version differences?

Oh ok. I am using Hive 2.3.1. But, you should try connecting using hive2 even for Hive version 3.1.2. That's the latest Hive server interface version for both the 2.3.x and 3.1.x lines.

@xicm
Copy link
Contributor Author

xicm commented Apr 28, 2023

@hudi-bot run azure

@danny0405
Copy link
Contributor

So it is because the incorrect hive server version is used ?

@xicm
Copy link
Contributor Author

xicm commented Apr 30, 2023

So it is because the incorrect hive server version is used ?

yes, partition query returns null with hive3.

@danny0405
Copy link
Contributor

So it is because the incorrect hive server version is used ?

yes, partition query returns null with hive3.

So the fix is only necessary for old hive server and it is actually not a bug?

@xicm
Copy link
Contributor Author

xicm commented May 2, 2023

So the fix is only necessary for old hive server and it is actually not a bug?

The fix is necessary for hive3.1.2, for hive2, as codepe tested , partition query is ok.

@danny0405
Copy link
Contributor

5308.patch.zip
I'm not that familiar with Hive so just created a patch to refactor the tests.

Path inputPath = ((FileSplit)split).getPath();
FileSystem fs = inputPath.getFileSystem(job);
Option<Path> tablePath = TablePathUtils.getTablePath(fs, inputPath);
HoodieTableMetaClient metaClient = HoodieTableMetaClient.builder().setConf(job).setBasePath(tablePath.get().toString()).build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this code work here:

HoodieTableMetaClient metaClient = HoodieTableMetaClient.builder().setConf(jobConf).setBasePath(realtimeSplit.getBasePath()).build();

public static void addProjectionField(Configuration conf, Option<String[]> fieldName) {
if (fieldName.isPresent()) {
List<String> columnNameList = Arrays.stream(conf.get(serdeConstants.LIST_COLUMNS).split(",")).collect(Collectors.toList());
Arrays.stream(fieldName.get()).forEach(field -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what this value represents

conf.get(serdeConstants.LIST_COLUMNS

String[] fieldOrders = fieldOrdersSet.toArray(new String[0]);
List<String> fieldNames = fieldNameCsv.isEmpty() ? new ArrayList<>() : Arrays.stream(fieldNameCsv.split(","))
.filter(fn -> !partitioningFields.contains(fn)).collect(Collectors.toList());
List<String> fieldNames = fieldNameCsv.isEmpty() ? new ArrayList<>() : Arrays.stream(fieldNameCsv.split(",")).collect(Collectors.toList());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the removal of partition fiels affect the behavior of Hive 2.x?

}

if (!readColNames.contains(fieldName)) {
if (!Arrays.asList(readColNames.split(",")).contains(fieldName)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch


public static void addRequiredProjectionFields(Configuration configuration, Option<HoodieVirtualKeyInfo> hoodieVirtualKeyInfo, Option<String> preCombineKeyOpt) {
public static void addRequiredProjectionFields(Configuration configuration, Option<HoodieVirtualKeyInfo> hoodieVirtualKeyInfo) {
// Need this to do merge records in HoodieRealtimeRecordReader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addRequiredProjectionFields -> addVirtualKeysProjection

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@danny0405 danny0405 merged commit 15cd052 into apache:master May 10, 2023
yihua pushed a commit to yihua/hudi that referenced this pull request May 15, 2023
…ition field (apache#7355)

* Partition query in hive3 returns null for Hive 3.x.
yihua pushed a commit to yihua/hudi that referenced this pull request May 15, 2023
…ition field (apache#7355)

* Partition query in hive3 returns null for Hive 3.x.
yihua pushed a commit to yihua/hudi that referenced this pull request May 17, 2023
…ition field (apache#7355)

* Partition query in hive3 returns null for Hive 3.x.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

engine:hive Hive integration priority:critical Production degraded; pipelines stalled release-0.14.0

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

7 participants