-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-4365] Fixing URL-encoding in Bulk Insert row-writing path #6049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@hudi-bot run azure |
| boolean hiveStylePartitioningEnabled = writeConfig.isHiveStylePartitioningEnabled(); | ||
|
|
||
| partitionPath = KeyGenUtils.handlePartitionPathDecoration(partitionPathField, | ||
| partitionPathValue == null ? null : partitionPathValue.toString(), shouldURLEncodePartitionPath, hiveStylePartitioningEnabled); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider passing Option.of(partitionPathValue) instead of null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, done that initially but then decided to optimize it out given that this is a hot-path
| if (shouldURLEncodePartitionPath || isHiveStylePartitioned) { | ||
| sqlContext.udf().register( | ||
| partitionPathDecorationUDFName, | ||
| (UDF1<String, String>) partitionPathValue -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I assume this UDF registration would be gone after HUDI-3993?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
| public static final String FILE_SCHEME = "file"; | ||
| public static final String COLON = ":"; | ||
| public static final Random RANDOM = new Random(); | ||
| public static final Random RANDOM = new Random(0xDEED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
…path as well; Cleaned up `BulkInsertDataInternalWriterHelper` to leverage the same sequence for partition-path handling
48dc9f1 to
06d50fd
Compare
|
@hudi-bot run azure |
2 similar comments
|
@hudi-bot run azure |
|
@hudi-bot run azure |
06d50fd to
be896fe
Compare
| DataTypes.StringType); | ||
|
|
||
| rowDatasetWithRecordKeysAndPartitionPath = | ||
| rows.withColumn(HoodieRecord.RECORD_KEY_METADATA_FIELD, functions.col(recordKeyFields).cast(DataTypes.StringType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ComplexKeyGenerator.getRecordKey(Row row), we setup prefixFieldName as true in method RowKeyGeneratorHelper.getRecordKeyFromRow(row, getRecordKeyFields(), recordKeySchemaInfo, true), so the record key will have a prefix, which is record key field name, when we use ComplexKeyGenerator.
As I understand here, we use withColumn here for recordKeyFields, then we will get the same value in RECORD_KEY_METADATA_FIELD as the original recordKeyFields, so no prefix when key generator is ComplexKeyGenerator. Will it cause problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TengHuo That's a good point. It can be a problem if you mix the write operation type or switch row-writing config for a table. I would suggest filing another JIRA ticket to keep it consistent across. I don't deem it to be a blocker but would be good to keep it consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, got it
then think it will have duplicate data issue if user upgrade from 0.10 or older version when they only setup one column as record key and use ComplexKeyGenerator . The same issue as upgrading from 0.10 to 0.11.
codope
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@alexeykudinkin Can you please rebase? CI instability was fixed recently.
|
Closing in favor of #5523 |
Tips
What is the purpose of the pull request
Currently when doing bulk-insert using partition paths with slashes in it's being laid out incorrectly missing URL-encoding for the partition path, even though it's set to true.
This fix is purely a duct-tape until it's properly addressed by HUDI-3993
Brief change log
See above
Verify this pull request
This pull request is already covered by existing tests, such as (please describe tests).
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.