[HUDI-5047] Add partition value in HoodieLogRecordReader when hoodie.datasource.write.drop.partition.columns=true #6986

xicm · 2022-10-18T08:34:18Z

Change Logs

If hoodie.datasource.write.drop.partition.columns=true, the log reader will not add partition values to record. The query with partition in where clause will return empty.

This pr adds the partition extracted from file path to the record.

Impact

hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java

Risk level (write none, low medium or high below)

low

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

…datasource.write.drop.partition.columns=true

YannByron · 2022-10-18T09:06:46Z

@xicm better to have some UT to know the effect of this changes.

xicm · 2022-10-19T07:34:23Z

@xicm better to have some UT to know the effect of this changes.

Hi @YannByron I added an UT, have a review if you are free.:)

...asource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala

xicm · 2022-10-21T01:51:24Z

@hudi-bot run azure

alexeykudinkin · 2022-10-21T19:42:47Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java

          Pair.of(tableConfig.getRecordKeyFieldProp(), tableConfig.getPartitionFieldProp()));
    }
    this.partitionName = partitionName;
+    this.partitionValues  = getPartitionValues();


It's not a good practice to call non-static methods from the ctor, since it relies implicitly on the ordering of the initialization. Let's instead make this method static and pass it all values it needs.

alexeykudinkin · 2022-10-21T19:43:52Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java

+    if (this.hiveStylePartition) {
+      return Option.of(Arrays.stream(partitionValues)
+          .map(partition -> partition.split("="))
+          .filter(partition -> partition.length == 2)


Why do we filter on specific length here?

alexeykudinkin · 2022-10-21T19:46:06Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java

+          String[] partitionFields = this.partitionFields.get();
+          String[] partitionValues = this.partitionValues.get();
+
+          if (partitionFields.length == partitionValues.length) {


This condition should be asserted on during init

alexeykudinkin · 2022-10-21T19:47:00Z

hudi-common/src/test/java/org/apache/hudi/common/functional/TestHoodieLogRecordReader.java

+    // Append is not supported in LocalFileSystem. HDFS needs to be setup.
+    Configuration conf = new Configuration();
+    // lower heartbeat interval for fast recognition of DN
+    conf.set(MiniDFSCluster.HDFS_MINIDFS_BASEDIR, dfsBaseDir.getAbsolutePath());


I don't think we really need HDFS for this test. What's the idea behind doing it on HDFS?

alexeykudinkin · 2022-10-21T19:48:10Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java

+
+          if (partitionFields.length == partitionValues.length) {
+            for (int i = 0; i < partitionValues.length; i++) {
+              record.put(schema.getField(partitionFields[i]).pos(), partitionValues[i]);


We can't just blindly inject strings into record, we need to make sure we coercing it to a type of the column

xicm · 2022-10-25T09:45:16Z

@alexeykudinkin Thanks for your advice. I updated the pr, review again when you are free.

alexeykudinkin

Thanks for addressing the feedback @xicm!

What i've been thinking about in the background is that ideally we'd want to leverage the logic already implemented in SparkParsePartitionUtil for ex by making it more generic )(ie engine-agnostic).

alexeykudinkin · 2022-10-27T19:36:11Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

    switch (newSchema.getType()) {
      case NULL:
      case BOOLEAN:
+        if (oldSchema.getType() == Schema.Type.STRING) {


Appreciate your approach of extending existing functionality, but i think in that case we actually shouldn't be doing that -- conversions we're doing here aren't "canonical" (there's no permitted conversion from string to int, for ex in Spark) and as such, we'd rather keep this conversion limited in scope to be applicable only to partition values where it's both inevitable and is actually sensible (since we know that we're not trying to convert some random string to int, but the int that was previously converted to a string)

Also i'd suggest you to check out Spark3ParsePartitionUtil

xicm · 2022-11-02T02:18:51Z

Hi @alexeykudinkin,
I create a new model HoodieLogFileWithPartition extends HoodieLogFile (is it a good way?), and init partitionValues when buildSplits in MergeOnReadSnapshotRelation, finally pass partitionValues to AbstractHoodieLogRecordReader.

Please review when you are available. If there are unreasonable implementations, correct me, I will fix. Thanks. :)

xicm · 2022-11-02T02:19:29Z

@hudi-bot run azure

hudi-bot · 2022-11-02T05:08:31Z

CI report:

cd21d98 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

alexeykudinkin · 2022-11-11T20:21:51Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java

  }

+  @SuppressWarnings("unchecked")
+  protected HoodieMergedLogRecordScanner(FileSystem fs, String basePath, List<String> logFilePaths, Schema readerSchema,


Let's avoid copying the whole ctor. Instead, generalize existing one and redirect this one to it.

alexeykudinkin · 2022-11-11T21:21:53Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java

    // Use scanV2 method.
    private boolean useScanV2 = false;

+    private Object[] partitionValues;


Let's create an object akin to PartitionPath we have in Spark

alexeykudinkin · 2022-11-11T21:24:48Z

...atasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala


      val partitionedBaseFile = baseFile.map { file =>
        val filePath = getFilePath(file.getFileStatus.getPath)
        PartitionedFile(getPartitionColumnsAsInternalRow(file.getFileStatus), filePath, 0, file.getFileLen)


Let's also update getPartitionColumnsAsInternalRow to reuse parsePartitionColumnValues instead

alexeykudinkin · 2022-11-11T21:25:58Z

...atasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala

      }

-      HoodieMergeOnReadFileSplit(partitionedBaseFile, logFiles)
+      if (shouldExtractPartitionValuesFromPartitionPath && logFiles.nonEmpty) {


I don't think we actually need this -- log-files always co-located w/ base-file and as such we can extract partition-values from there

alexeykudinkin · 2022-11-11T21:26:17Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/LogFileIterator.scala

        logRecordScannerBuilder.withPartition(
          getRelativePartitionPath(new Path(tableState.tablePath), logFiles.head.getPath.getParent))
+
+        if (logFiles.head.isInstanceOf[HoodieLogFileWithPartition]) {


Please check my comment below

alexeykudinkin · 2022-11-11T21:26:30Z

hudi-common/src/main/java/org/apache/hudi/common/model/HoodieLogFileWithPartition.java

+
+package org.apache.hudi.common.model;
+
+public class HoodieLogFileWithPartition extends HoodieLogFile {


Please check my comment below regarding this class

xicm · 2022-12-08T01:57:26Z

The issue of log reader is caused by setting "hoodie.datasource.write.drop.partition.columns=true", and the purpose of setting "hoodie.datasource.write.drop.partition.columns=true" is to solve the problem of hive query.
Hive3 query returns null when the where clause has a partition field. Setting "hoodie.datasource.write.drop.partition.columns=true" can solve this problem.
I created a new pr #7355 to solve the hive query problem. I think we can close this pr.

[HUDI-5047] Add partition value in HoodieLogRecordReader when hoodie.…

8b09437

…datasource.write.drop.partition.columns=true

add ut

6c3f427

xicm added 3 commits October 19, 2022 22:15

fix ut

970ad79

Merge branch 'master' into HUDI-5047

df2c48f

fix flaky test

088cfee

xicm commented Oct 20, 2022

View reviewed changes

...asource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala Outdated Show resolved Hide resolved

nsivabalan assigned alexeykudinkin Oct 21, 2022

nsivabalan added priority:high Significant impact; potential bugs reader-core labels Oct 21, 2022

xicm added 2 commits October 21, 2022 21:14

Merge branch 'master' into HUDI-5047

22f7b31

resove conflict

ab5cfec

alexeykudinkin reviewed Oct 21, 2022

View reviewed changes

xicm added 2 commits October 24, 2022 11:23

Merge branch 'master' into HUDI-5047

29919d7

convert partition value to schema specified type

74787e6

xicm added 2 commits October 26, 2022 10:17

remove blank line

240b1a4

convert UNION to its real type

1d34a26

xicm requested a review from alexeykudinkin October 27, 2022 03:23

alexeykudinkin reviewed Oct 27, 2022

View reviewed changes

get partition values with SparkParsePartitionUtil

350592c

xicm changed the title ~~[HUDI-5047] Add partition value in HoodieLogRecordReader when hoodie.datasource.write.drop.partition.columns=true~~ [WIP][HUDI-5047] Add partition value in HoodieLogRecordReader when hoodie.datasource.write.drop.partition.columns=true Oct 31, 2022

xicm added 2 commits November 1, 2022 10:44

fix params

599f8a2

fix param type

cd21d98

xicm changed the title ~~[WIP][HUDI-5047] Add partition value in HoodieLogRecordReader when hoodie.datasource.write.drop.partition.columns=true~~ [HUDI-5047] Add partition value in HoodieLogRecordReader when hoodie.datasource.write.drop.partition.columns=true Nov 2, 2022

xicm requested a review from alexeykudinkin November 2, 2022 05:43

alexeykudinkin reviewed Nov 11, 2022

View reviewed changes

nsivabalan added the release-0.12.2 Patches targetted for 0.12.2 label Dec 6, 2022

codope removed the release-0.12.2 Patches targetted for 0.12.2 label Dec 7, 2022

alexeykudinkin closed this Dec 8, 2022


		package org.apache.hudi.common.model;

		public class HoodieLogFileWithPartition extends HoodieLogFile {

[HUDI-5047] Add partition value in HoodieLogRecordReader when hoodie.datasource.write.drop.partition.columns=true #6986

[HUDI-5047] Add partition value in HoodieLogRecordReader when hoodie.datasource.write.drop.partition.columns=true #6986

Uh oh!

Conversation

xicm commented Oct 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

YannByron commented Oct 18, 2022

Uh oh!

xicm commented Oct 19, 2022

Uh oh!

Uh oh!

xicm commented Oct 21, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xicm commented Oct 25, 2022

Uh oh!

alexeykudinkin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xicm commented Nov 2, 2022

Uh oh!

xicm commented Nov 2, 2022

Uh oh!

hudi-bot commented Nov 2, 2022

CI report:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xicm commented Dec 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

xicm commented Oct 18, 2022 •

edited

Loading

xicm commented Dec 8, 2022 •

edited

Loading