Skip to content

Conversation

@alexeykudinkin
Copy link
Contributor

Tips

What is the purpose of the pull request

Currently when doing bulk-insert using partition paths with slashes in it's being laid out incorrectly missing URL-encoding for the partition path, even though it's set to true.

This fix is purely a duct-tape until it's properly addressed by HUDI-3993

Brief change log

See above

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

boolean hiveStylePartitioningEnabled = writeConfig.isHiveStylePartitioningEnabled();

partitionPath = KeyGenUtils.handlePartitionPathDecoration(partitionPathField,
partitionPathValue == null ? null : partitionPathValue.toString(), shouldURLEncodePartitionPath, hiveStylePartitioningEnabled);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider passing Option.of(partitionPathValue) instead of null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, done that initially but then decided to optimize it out given that this is a hot-path

if (shouldURLEncodePartitionPath || isHiveStylePartitioned) {
sqlContext.udf().register(
partitionPathDecorationUDFName,
(UDF1<String, String>) partitionPathValue ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I assume this UDF registration would be gone after HUDI-3993?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

public static final String FILE_SCHEME = "file";
public static final String COLON = ":";
public static final Random RANDOM = new Random();
public static final Random RANDOM = new Random(0xDEED);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@codope codope self-assigned this Jul 9, 2022
@codope codope added priority:blocker Production down; release blocker writer-core labels Jul 9, 2022
Alexey Kudinkin added 3 commits July 12, 2022 15:10
…path as well;

Cleaned up `BulkInsertDataInternalWriterHelper` to leverage the same sequence for partition-path handling
@alexeykudinkin alexeykudinkin force-pushed the ak/blk-ins-url-enc-fix branch from 48dc9f1 to 06d50fd Compare July 12, 2022 22:11
@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

2 similar comments
@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

DataTypes.StringType);

rowDatasetWithRecordKeysAndPartitionPath =
rows.withColumn(HoodieRecord.RECORD_KEY_METADATA_FIELD, functions.col(recordKeyFields).cast(DataTypes.StringType))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ComplexKeyGenerator.getRecordKey(Row row), we setup prefixFieldName as true in method RowKeyGeneratorHelper.getRecordKeyFromRow(row, getRecordKeyFields(), recordKeySchemaInfo, true), so the record key will have a prefix, which is record key field name, when we use ComplexKeyGenerator.

As I understand here, we use withColumn here for recordKeyFields, then we will get the same value in RECORD_KEY_METADATA_FIELD as the original recordKeyFields, so no prefix when key generator is ComplexKeyGenerator. Will it cause problem?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TengHuo That's a good point. It can be a problem if you mix the write operation type or switch row-writing config for a table. I would suggest filing another JIRA ticket to keep it consistent across. I don't deem it to be a blocker but would be good to keep it consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, got it

then think it will have duplicate data issue if user upgrade from 0.10 or older version when they only setup one column as record key and use ComplexKeyGenerator . The same issue as upgrading from 0.10 to 0.11.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
@alexeykudinkin Can you please rebase? CI instability was fixed recently.

@codope codope removed the priority:blocker Production down; release blocker label Jul 20, 2022
@alexeykudinkin
Copy link
Contributor Author

Closing in favor of #5523

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants