Skip to content

Conversation

@xicm
Copy link
Contributor

@xicm xicm commented Oct 18, 2022

Change Logs

If hoodie.datasource.write.drop.partition.columns=true, the log reader will not add partition values to record. The query with partition in where clause will return empty.

This pr adds the partition extracted from file path to the record.

Impact

hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java

Risk level (write none, low medium or high below)

low

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

…datasource.write.drop.partition.columns=true
@YannByron
Copy link
Contributor

@xicm better to have some UT to know the effect of this changes.

@xicm
Copy link
Contributor Author

xicm commented Oct 19, 2022

@xicm better to have some UT to know the effect of this changes.

Hi @YannByron I added an UT, have a review if you are free.:)

@xicm
Copy link
Contributor Author

xicm commented Oct 21, 2022

@hudi-bot run azure

@nsivabalan nsivabalan added priority:high Significant impact; potential bugs reader-core labels Oct 21, 2022
Pair.of(tableConfig.getRecordKeyFieldProp(), tableConfig.getPartitionFieldProp()));
}
this.partitionName = partitionName;
this.partitionValues = getPartitionValues();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a good practice to call non-static methods from the ctor, since it relies implicitly on the ordering of the initialization. Let's instead make this method static and pass it all values it needs.

if (this.hiveStylePartition) {
return Option.of(Arrays.stream(partitionValues)
.map(partition -> partition.split("="))
.filter(partition -> partition.length == 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we filter on specific length here?

String[] partitionFields = this.partitionFields.get();
String[] partitionValues = this.partitionValues.get();

if (partitionFields.length == partitionValues.length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition should be asserted on during init

// Append is not supported in LocalFileSystem. HDFS needs to be setup.
Configuration conf = new Configuration();
// lower heartbeat interval for fast recognition of DN
conf.set(MiniDFSCluster.HDFS_MINIDFS_BASEDIR, dfsBaseDir.getAbsolutePath());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we really need HDFS for this test. What's the idea behind doing it on HDFS?


if (partitionFields.length == partitionValues.length) {
for (int i = 0; i < partitionValues.length; i++) {
record.put(schema.getField(partitionFields[i]).pos(), partitionValues[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't just blindly inject strings into record, we need to make sure we coercing it to a type of the column

@xicm
Copy link
Contributor Author

xicm commented Oct 25, 2022

@alexeykudinkin Thanks for your advice. I updated the pr, review again when you are free.

@xicm xicm requested a review from alexeykudinkin October 27, 2022 03:23
Copy link
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the feedback @xicm!

What i've been thinking about in the background is that ideally we'd want to leverage the logic already implemented in SparkParsePartitionUtil for ex by making it more generic )(ie engine-agnostic).

switch (newSchema.getType()) {
case NULL:
case BOOLEAN:
if (oldSchema.getType() == Schema.Type.STRING) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate your approach of extending existing functionality, but i think in that case we actually shouldn't be doing that -- conversions we're doing here aren't "canonical" (there's no permitted conversion from string to int, for ex in Spark) and as such, we'd rather keep this conversion limited in scope to be applicable only to partition values where it's both inevitable and is actually sensible (since we know that we're not trying to convert some random string to int, but the int that was previously converted to a string)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also i'd suggest you to check out Spark3ParsePartitionUtil

@xicm xicm changed the title [HUDI-5047] Add partition value in HoodieLogRecordReader when hoodie.datasource.write.drop.partition.columns=true [WIP][HUDI-5047] Add partition value in HoodieLogRecordReader when hoodie.datasource.write.drop.partition.columns=true Oct 31, 2022
@xicm
Copy link
Contributor Author

xicm commented Nov 2, 2022

Hi @alexeykudinkin,
I create a new model HoodieLogFileWithPartition extends HoodieLogFile (is it a good way?), and init partitionValues when buildSplits in MergeOnReadSnapshotRelation, finally pass partitionValues to AbstractHoodieLogRecordReader.

Please review when you are available. If there are unreasonable implementations, correct me, I will fix. Thanks. :)

@xicm xicm changed the title [WIP][HUDI-5047] Add partition value in HoodieLogRecordReader when hoodie.datasource.write.drop.partition.columns=true [HUDI-5047] Add partition value in HoodieLogRecordReader when hoodie.datasource.write.drop.partition.columns=true Nov 2, 2022
@xicm
Copy link
Contributor Author

xicm commented Nov 2, 2022

@hudi-bot run azure

@hudi-bot
Copy link
Collaborator

hudi-bot commented Nov 2, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@xicm xicm requested a review from alexeykudinkin November 2, 2022 05:43
}

@SuppressWarnings("unchecked")
protected HoodieMergedLogRecordScanner(FileSystem fs, String basePath, List<String> logFilePaths, Schema readerSchema,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid copying the whole ctor. Instead, generalize existing one and redirect this one to it.

// Use scanV2 method.
private boolean useScanV2 = false;

private Object[] partitionValues;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create an object akin to PartitionPath we have in Spark


val partitionedBaseFile = baseFile.map { file =>
val filePath = getFilePath(file.getFileStatus.getPath)
PartitionedFile(getPartitionColumnsAsInternalRow(file.getFileStatus), filePath, 0, file.getFileLen)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also update getPartitionColumnsAsInternalRow to reuse parsePartitionColumnValues instead

}

HoodieMergeOnReadFileSplit(partitionedBaseFile, logFiles)
if (shouldExtractPartitionValuesFromPartitionPath && logFiles.nonEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we actually need this -- log-files always co-located w/ base-file and as such we can extract partition-values from there

logRecordScannerBuilder.withPartition(
getRelativePartitionPath(new Path(tableState.tablePath), logFiles.head.getPath.getParent))

if (logFiles.head.isInstanceOf[HoodieLogFileWithPartition]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my comment below


package org.apache.hudi.common.model;

public class HoodieLogFileWithPartition extends HoodieLogFile {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my comment below regarding this class

@nsivabalan nsivabalan added the release-0.12.2 Patches targetted for 0.12.2 label Dec 6, 2022
@codope codope removed the release-0.12.2 Patches targetted for 0.12.2 label Dec 7, 2022
@xicm
Copy link
Contributor Author

xicm commented Dec 8, 2022

The issue of log reader is caused by setting "hoodie.datasource.write.drop.partition.columns=true", and the purpose of setting "hoodie.datasource.write.drop.partition.columns=true" is to solve the problem of hive query.
Hive3 query returns null when the where clause has a partition field. Setting "hoodie.datasource.write.drop.partition.columns=true" can solve this problem.
I created a new pr #7355 to solve the hive query problem. I think we can close this pr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:high Significant impact; potential bugs

Projects

Status: Awaiting Triage
Archived in project

Development

Successfully merging this pull request may close these issues.

6 participants