Skip to content

Conversation

@yuzhaojing
Copy link
Contributor

Tips

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@yuzhaojing yuzhaojing force-pushed the HUDI-2207 branch 4 times, most recently from c795bbe to 9443126 Compare September 7, 2021 02:07
@yuzhaojing yuzhaojing force-pushed the HUDI-2207 branch 5 times, most recently from 2828a67 to 1bb7f93 Compare October 21, 2021 03:38
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the unit is MB ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already fix it.

@danny0405 danny0405 added the engine:flink Flink integration label Oct 21, 2021
*/
public class FlinkRecentDaysClusteringPlanStrategy<T extends HoodieRecordPayload<T>>
extends PartitionAwareClusteringPlanStrategy<T, List<HoodieRecord<T>>, List<HoodieKey>, List<WriteStatus>> {
private static final Logger LOG = LogManager.getLogger(FlinkRecentDaysClusteringPlanStrategy.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we extend from FlinkSizeBasedClusteringPlanStrategy instead ?

if (writeStats.stream().mapToLong(s -> s.getTotalWriteErrors()).sum() > 0) {
throw new HoodieClusteringException("Clustering failed to write to files:"
+ writeStats.stream().filter(s -> s.getTotalWriteErrors() > 0L).map(s -> s.getFileId()).collect(Collectors.joining(",")));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeTableMetadata is missing.


private static final Logger LOG = LogManager.getLogger(FlinkClusteringPlanActionExecutor.class);

public FlinkClusteringPlanActionExecutor(HoodieEngineContext context,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the HoodieData to merge the code with SparkClusteringPlanActionExecutor, this can be done with separate following PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

.key("clustering.schedule.enabled")
.booleanType()
.defaultValue(false) // default false for pipeline
.withDescription("Async clustering, default false for pipeline");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schedule the compaction plan, default false

.key("clustering.tasks")
.intType()
.defaultValue(10)
.withDescription("Parallelism of tasks that do actual clustering, default is 10");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the default value same with compaction.tasks, which is 4.

* The clustering task identifier.
*/
private int taskID;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The event should include a fileId to deduplicate for tasks failover/retry. Take CompactionCommitEvent as a reference. Because there are multiple input file ids for a HoodieClusteringGroup thus the CompactionCommitEvent, we can use the first file group id to distinguish.

this.schema = new Schema.Parser().parse(writeConfig.getSchema());
this.readerSchema = HoodieAvroUtils.addMetadataFields(this.schema);
this.requiredPos = getRequiredPositions();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the requiredPos used for ?

for (ClusteringOperation clusteringOp : clusteringOps) {
try {
Schema readerSchema = HoodieAvroUtils.addMetadataFields(new Schema.Parser().parse(writeConfig.getSchema()));
HoodieFileReader<? extends IndexedRecord> baseFileReader = HoodieFileReaderFactory.getFileReader(table.getHadoopConf(), new Path(clusteringOp.getDataFilePath()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the readerSchema ?

private void doClustering(String instantTime, HoodieClusteringGroup clusteringGroup, Collector<ClusteringCommitEvent> collector) throws IOException {
List<ClusteringOperation> clusteringOps = clusteringGroup.getSlices().stream().map(ClusteringOperation::create).collect(Collectors.toList());
boolean hasLogFiles = clusteringOps.stream().anyMatch(op -> op.getDeltaFilePaths().size() > 0);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HoodieClusteringGroup has num of output file groups, the current code has only one file group (or more if the parquet size hits the threshold), can we find a way to set up the parallelism of bulk_insert writer as that ?

conf.setInteger(FlinkOptions.CLUSTERING_PLAN_STRATEGY_SKIP_PARTITIONS_FROM_LATEST, config.skipFromLatestPartitions);
if (config.sortColumns != null) {
conf.setString(FlinkOptions.CLUSTERING_SORT_COLUMNS, config.sortColumns);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the CLUSTERING_SORT_COLUMNS used for ?

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @yihua wondering if we can reuse a lot more code here?

@yihua
Copy link
Contributor

yihua commented Dec 16, 2021

cc @yihua wondering if we can reuse a lot more code here?

Yes, the core clustering action should be extracted out, independent of engines, using HoodieData abstraction. Right now Spark and Java have their own classes. I filed a ticket here: https://issues.apache.org/jira/browse/HUDI-3042

@yuzhaojing yuzhaojing force-pushed the HUDI-2207 branch 3 times, most recently from f86df2e to d3eb4e3 Compare March 1, 2022 14:30
@yuzhaojing
Copy link
Contributor Author

@hudi-bot run azure

@danny0405
Copy link
Contributor

Hello, there seems come conflicts and the compile failure, can you fix that @yuzhaojing ?

@yuzhaojing
Copy link
Contributor Author

Sure, I will fix this.

@yihua
Copy link
Contributor

yihua commented Mar 10, 2022

@yuzhaojing The refactoring of clustering action is done in #4847, in a way to have engine-agnostic clustering logic in BaseCommitActionExecutor and ClusteringPlanActionExecutor. You can reuse those code to reduce duplication in this PR.

@yuzhaojing yuzhaojing force-pushed the HUDI-2207 branch 3 times, most recently from e5838e1 to 777c0ac Compare March 17, 2022 12:36
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clazz can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

this.rowType = rowType;
this.gComputer = createSortCodeGenerator().generateNormalizedKeyComputer("SortComputer");
this.gComparator = createSortCodeGenerator().generateRecordComparator("SortComparator");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make gComputer and gComparator local variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

* @param record Generic record.
* @param columns Names of the columns to get values.
* @return Column value if a single column, or concatenated String values by comma.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change still necessary ?

@xushiyan xushiyan added area:table-service Table services and removed clustering labels May 18, 2022
@XuQianJin-Stars
Copy link
Contributor

hi @yuzhaojing rebase thise pr. the CI's error is fixed.

@yuzhaojing yuzhaojing force-pushed the HUDI-2207 branch 3 times, most recently from e1689c4 to cf4355d Compare May 21, 2022 15:06
@yuzhaojing

This comment was marked as resolved.

@yuzhaojing yuzhaojing closed this May 21, 2022
@yuzhaojing yuzhaojing reopened this May 21, 2022
@yuzhaojing yuzhaojing force-pushed the HUDI-2207 branch 2 times, most recently from eb878d4 to 9f98d28 Compare May 22, 2022 03:35
@yuzhaojing
Copy link
Contributor Author

@hudi-bot run azure

@yuzhaojing yuzhaojing closed this May 23, 2022
@yuzhaojing yuzhaojing reopened this May 23, 2022
@yuzhaojing
Copy link
Contributor Author

@hudi-bot run azure

<artifactId>flink-avro</artifactId>
<version>${flink.version}</version>
<scope>compile</scope>
</dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HoodieClusteringGroup is a avro model in ClusteringPlanEvent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you should not depend on flink-avro i guess, there are already many model clazzes in hudi-flink code paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. use ClusteringGroupInfo instead of HoodieClusteringGroup.

<groupId>org.apache.flink</groupId>
<artifactId>flink-avro</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@danny0405
Copy link
Contributor

We may need to resolve the conflict.

@yuzhaojing
Copy link
Contributor Author

We may need to resolve the conflict.

fixed.


private void updateTableMetadata(HoodieTable<T, List<HoodieRecord<T>>, List<HoodieKey>, List<WriteStatus>> table,
HoodieCommitMetadata commitMetadata,
HoodieInstant hoodieInstant) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateTableMetadat seems useless now ~

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks for contribution @yuzhaojing , we may need to fix the clustering plan scheduling issue in the following PR.

@yuzhaojing
Copy link
Contributor Author

+1, thanks for contribution @yuzhaojing , we may need to fix the clustering plan scheduling issue in the following PR.

Sure, I will fix the issue in following PR to support clustering plan scheduling in coordinator.

@yuzhaojing
Copy link
Contributor Author

Thanks for suggest, Danny!

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yuzhaojing yuzhaojing merged commit 18635b5 into apache:master May 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:table-service Table services engine:flink Flink integration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants