[HUDI-4365] Fixing URL-encoding in Bulk Insert row-writing path #6049

alexeykudinkin · 2022-07-05T20:58:37Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

Currently when doing bulk-insert using partition paths with slashes in it's being laid out incorrectly missing URL-encoding for the partition path, even though it's set to true.

This fix is purely a duct-tape until it's properly addressed by HUDI-3993

Brief change log

See above

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

alexeykudinkin · 2022-07-06T23:18:54Z

@hudi-bot run azure

codope · 2022-07-09T13:05:06Z

...-spark-common/src/main/java/org/apache/hudi/internal/BulkInsertDataInternalWriterHelper.java

+          boolean hiveStylePartitioningEnabled = writeConfig.isHiveStylePartitioningEnabled();
+
+          partitionPath = KeyGenUtils.handlePartitionPathDecoration(partitionPathField,
+              partitionPathValue == null ? null : partitionPathValue.toString(), shouldURLEncodePartitionPath, hiveStylePartitioningEnabled);


Consider passing Option.of(partitionPathValue) instead of null

Yeah, done that initially but then decided to optimize it out given that this is a hot-path

codope · 2022-07-09T13:05:53Z

...atasource/hudi-spark-common/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java

+      if (shouldURLEncodePartitionPath || isHiveStylePartitioned) {
+        sqlContext.udf().register(
+            partitionPathDecorationUDFName,
+            (UDF1<String, String>) partitionPathValue ->


So, I assume this UDF registration would be gone after HUDI-3993?

codope · 2022-07-09T13:06:22Z

hudi-common/src/test/java/org/apache/hudi/common/testutils/FileSystemTestUtils.java

  public static final String FILE_SCHEME = "file";
  public static final String COLON = ":";
-  public static final Random RANDOM = new Random();
+  public static final Random RANDOM = new Random(0xDEED);


…path as well; Cleaned up `BulkInsertDataInternalWriterHelper` to leverage the same sequence for partition-path handling

alexeykudinkin · 2022-07-13T00:41:49Z

@hudi-bot run azure

alexeykudinkin · 2022-07-13T01:13:08Z

@hudi-bot run azure

alexeykudinkin · 2022-07-13T21:00:02Z

@hudi-bot run azure

TengHuo · 2022-07-14T08:09:20Z

...atasource/hudi-spark-common/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java

+            DataTypes.StringType);
+
+        rowDatasetWithRecordKeysAndPartitionPath =
+            rows.withColumn(HoodieRecord.RECORD_KEY_METADATA_FIELD, functions.col(recordKeyFields).cast(DataTypes.StringType))


In ComplexKeyGenerator.getRecordKey(Row row), we setup prefixFieldName as true in method RowKeyGeneratorHelper.getRecordKeyFromRow(row, getRecordKeyFields(), recordKeySchemaInfo, true), so the record key will have a prefix, which is record key field name, when we use ComplexKeyGenerator.

As I understand here, we use withColumn here for recordKeyFields, then we will get the same value in RECORD_KEY_METADATA_FIELD as the original recordKeyFields, so no prefix when key generator is ComplexKeyGenerator. Will it cause problem?

@TengHuo That's a good point. It can be a problem if you mix the write operation type or switch row-writing config for a table. I would suggest filing another JIRA ticket to keep it consistent across. I don't deem it to be a blocker but would be good to keep it consistent.

okay, got it

then think it will have duplicate data issue if user upgrade from 0.10 or older version when they only setup one column as record key and use ComplexKeyGenerator . The same issue as upgrading from 0.10 to 0.11.

hudi-bot · 2022-07-14T08:19:47Z

CI report:

be896fe Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codope

LGTM
@alexeykudinkin Can you please rebase? CI instability was fixed recently.

alexeykudinkin · 2022-07-20T22:43:33Z

Closing in favor of #5523

codope reviewed Jul 9, 2022

View reviewed changes

codope self-assigned this Jul 9, 2022

codope added priority:blocker Production down; release blocker writer-core labels Jul 9, 2022

Alexey Kudinkin added 3 commits July 12, 2022 15:10

Duct-taped URL-encoding handling in bulk-inserting seq

25dbc1a

Fixed seeds

cfe2527

Fixed HoodieDatasetBulkInsertHelper to properly decorate partition-…

1155e72

…path as well; Cleaned up `BulkInsertDataInternalWriterHelper` to leverage the same sequence for partition-path handling

alexeykudinkin force-pushed the ak/blk-ins-url-enc-fix branch from 48dc9f1 to 06d50fd Compare July 12, 2022 22:11

alexeykudinkin mentioned this pull request Jul 13, 2022

[SUPPORT]'hoodie.datasource.write.hive_style_partitioning':'true' does not take effect in hudi-0.11.1 & spark 3.2.1 #6070

Closed

Tidying up

be896fe

alexeykudinkin force-pushed the ak/blk-ins-url-enc-fix branch from 06d50fd to be896fe Compare July 14, 2022 07:00

TengHuo mentioned this pull request Jul 14, 2022

[HUDI-4384] fix hive style partition and record key prefix missing in bulk_insert #6085

Closed

5 tasks

TengHuo reviewed Jul 14, 2022

View reviewed changes

codope approved these changes Jul 16, 2022

View reviewed changes

codope removed the priority:blocker Production down; release blocker label Jul 20, 2022

alexeykudinkin closed this Jul 20, 2022

[HUDI-4365] Fixing URL-encoding in Bulk Insert row-writing path #6049

[HUDI-4365] Fixing URL-encoding in Bulk Insert row-writing path #6049

Uh oh!

Conversation

alexeykudinkin commented Jul 5, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

alexeykudinkin commented Jul 6, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin commented Jul 13, 2022

Uh oh!

alexeykudinkin commented Jul 13, 2022

Uh oh!

alexeykudinkin commented Jul 13, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jul 14, 2022

CI report:

Uh oh!

codope left a comment

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin commented Jul 20, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants