[HUDI-314] Fix multi partition keys error when querying a realtime table #978

zhedoubushishi · 2019-10-28T17:53:07Z

Jira: https://jira.apache.org/jira/browse/HUDI-314

Hive uses slash to split multiple partition keys e.g. "/year/month/day".
I also checked different versions of Hive, it seems that all of them uses slash. For me, I didn't see any reason that should use comma.
Manual testing shows this could work.

n3nash · 2019-10-28T18:32:59Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java

    String partitionFields = jobConf.get(hive_metastoreConstants.META_TABLE_PARTITION_COLUMNS, "");
    List<String> partitioningFields =
-        partitionFields.length() > 0 ? Arrays.stream(partitionFields.split(",")).collect(Collectors.toList())
+        partitionFields.length() > 0 ? Arrays.stream(partitionFields.split("/")).collect(Collectors.toList())


If there are multiple partition keys, they should also be separated by a ",". Can you paste an example of multiple partition fields that you notice is passed with a "/" ?

Here is an example code I used:

import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.HoodieWriteConfig import org.apache.spark.sql.SaveMode var tableName = "hudi_multi_partitions_test" var tablePath = "s3://emr-users/wenningd/hudi/tables/events/" + tableName var tableType = "MERGE_ON_READ" val inputDF2 = Seq( ("100", "event_name_897", "2015-01-01T23:52:39.340396Z", "type1", "2015", "01", "01"), ("101", "event_name_236", "2015-01-01T22:14:58.597216Z", "type2", "2015", "01", "01"), ("104", "event_name_764", "2015-02-01T12:15:00.512679Z", "type1", "2015", "01", "01"), ("105", "event_name_675", "2015-02-01T13:51:42.248818Z", "type2", "2015", "01", "01"), ("106", "event_name_337", "2015-02-01T13:51:42.248818Z", "type2", "2015", "03", "16"), ("107", "event_name_452", "2015-02-01T13:51:42.248818Z", "type2", "2015", "03", "16"), ("108", "event_name_234", "2015-02-01T13:51:42.248818Z", "type2", "2015", "03", "16"), ("199", "event_name_011", "2015-02-01T13:51:42.248818Z", "type2", "2015", "03", "16") ).toDF("_row_key", "event_name", "timestamp", "event_type", "year", "month", "day") inputDF2.write.format("org.apache.hudi").option("hoodie.insert.shuffle.parallelism", "2") .option("hoodie.upsert.shuffle.parallelism", "2") .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "_row_key") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "year,month,day") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "timestamp") .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.ComplexKeyGenerator") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "year,month,day") .mode(SaveMode.Append) .save(tablePath)

If you tried following query in Hive:

Caused by: org.apache.avro.SchemaParseException: Illegal character in: year/month/day at org.apache.avro.Schema.validateName(Schema.java:1083) ~[avro-1.7.7.jar:1.7.7] at org.apache.avro.Schema.access$200(Schema.java:79) ~[avro-1.7.7.jar:1.7.7] at org.apache.avro.Schema$Field.<init>(Schema.java:372) ~[avro-1.7.7.jar:1.7.7] at org.apache.avro.Schema$Field.<init>(Schema.java:367) ~[avro-1.7.7.jar:1.7.7] at org.apache.hudi.common.util.HoodieAvroUtils.appendNullSchemaFields(HoodieAvroUtils.java:166) ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.addPartitionFields(AbstractRealtimeRecordReader.java:305) ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:328) ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.<init>(AbstractRealtimeRecordReader.java:103) ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.<init>(RealtimeCompactedRecordReader.java:48) ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:67) ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.<init>(HoodieRealtimeRecordReader.java:45) ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:233) ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376) ~[hive-exec-2.3.3.jar:2.3.3] at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:169) ~[hadoop-mapreduce-client-core-2.8.4.jar:?] at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) ~[hadoop-mapreduce-client-core-2.8.4.jar:?] at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) ~[hadoop-mapreduce-client-core-2.8.4.jar:?] at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270) ~[hadoop-mapreduce-client-common-2.8.4.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_212] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_212] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_212]

Also in the Hive log, you can see a snippet of printed job configuration like this:

fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem, yarn.nodemanager.windows-container.memory-limit.enabled=false, yarn.nodemanager.remote-app-log-dir=/var/log/hadoop-yarn/apps, mapreduce.reduce.shuffle.retry-delay.max.ms=60000, io.map.index.interval=128, partition_columns=year/month/day

The last one is partition_columns=year/month/day. And partitionFields is get from String partitionFields = jobConf.get(hive_metastoreConstants.META_TABLE_PARTITION_COLUMNS, "");

Thanks @zhedoubushishi for detailed information. This looks fine to me. Can you also add an integration test for MOR Table supporting multi-partition keys ? You can look at this class ITTestHoodieSanity.java. Currently, it does not have test case for MOR table. You can also look at HoodieJavaApp.java which is being called by ITTestHoodieSanity.java to get an idea. Let me know if you need any help and we can help you on this.

BTW, Good catch. Thanks for your contribution :)

Thanks for your suggestions. I have already worked on adding MOR tests. But the test won't pass because of some exceptions when querying realtime table. So I just try to fix these errors first. (Similarly, there is another related PR. #972)

@zhedoubushishi : Added a comment in #972 Regarding this PR, if you stuck share the code in this PR and we can help you point in the right direction.

Balaji.V

@zhedoubushishi : Added a comment in #972 Regarding this PR, if you stuck share the code in this PR and we can help you point in the right direction.

Balaji.V

Testing on MOR table is added in this PR.

vinothchandar

@bvaradar as well to chime in..

umehrot2

Good catch Wenning ! May be we can publish tests for MOR tables in this PR itself ?

zhedoubushishi · 2019-11-01T07:15:54Z

Good catch Wenning ! May be we can publish tests for MOR tables in this PR itself ?

Test is added.

vinothchandar · 2019-11-02T19:57:45Z

@bvaradar please review and merge

bvaradar

Looks clean and great. Awesome job :)

…#978)

…#978) (apache#1205)

[HUDI-314] Fix multi partition keys error when querying a realtime table

0591177

n3nash reviewed Oct 28, 2019

View reviewed changes

vinothchandar reviewed Oct 29, 2019

View reviewed changes

umehrot2 reviewed Oct 31, 2019

View reviewed changes

Add MOR integration testing

11d4934

bvaradar approved these changes Nov 3, 2019

View reviewed changes

bvaradar merged commit bd77dc7 into apache:master Nov 3, 2019

kroushan-nit pushed a commit to kroushan-nit/hudi-oss-fork that referenced this pull request Nov 10, 2024

[ENG-15012][INTERNAL] Making bootstrap index default to false (apache…

ac9d0a8

…#978)

vinishjail97 pushed a commit to vinishjail97/hudi that referenced this pull request Mar 24, 2025

[ENG-22496][INTERNAL] Making bootstrap index default to false (apache…

4ddd33e

…#978) (apache#1205)

[HUDI-314] Fix multi partition keys error when querying a realtime table #978

[HUDI-314] Fix multi partition keys error when querying a realtime table #978

Uh oh!

Conversation

zhedoubushishi commented Oct 28, 2019

Uh oh!

n3nash Oct 28, 2019

Choose a reason for hiding this comment

Uh oh!

zhedoubushishi Oct 28, 2019

Choose a reason for hiding this comment

Uh oh!

bvaradar Oct 29, 2019

Choose a reason for hiding this comment

Uh oh!

zhedoubushishi Oct 29, 2019

Choose a reason for hiding this comment

Uh oh!

bvaradar Oct 30, 2019

Choose a reason for hiding this comment

Uh oh!

n3nash Oct 30, 2019

Choose a reason for hiding this comment

Uh oh!

zhedoubushishi Nov 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

umehrot2 left a comment

Choose a reason for hiding this comment

Uh oh!

zhedoubushishi commented Nov 1, 2019

Uh oh!

vinothchandar commented Nov 2, 2019

Uh oh!

bvaradar left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhedoubushishi Nov 1, 2019 •

edited

Loading