Skip to content

[HUDI-3045] New clustering regex match config to choose partitions when building clustering plan#4346

Merged
yihua merged 6 commits intoapache:masterfrom
zhangyue19921010:SparkRegexMatchPartitionsClusteringPlanStrategy
Jan 12, 2022
Merged

[HUDI-3045] New clustering regex match config to choose partitions when building clustering plan#4346
yihua merged 6 commits intoapache:masterfrom
zhangyue19921010:SparkRegexMatchPartitionsClusteringPlanStrategy

Conversation

@zhangyue19921010
Copy link
Copy Markdown
Contributor

@zhangyue19921010 zhangyue19921010 commented Dec 17, 2021

What is the purpose of the pull request

https://issues.apache.org/jira/browse/HUDI-3045

new ClusteringPlanStrategy to use regex choose partitions when building clustering plan

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@zhangyue19921010 zhangyue19921010 changed the title [HUDI-3045]new ClusteringPlanStrategy to use regex choose partitions when building clustering plan. [HUDI-3045]new ClusteringPlanStrategy to use regex choose partitions when building clustering plan Dec 17, 2021
@zhangyue19921010 zhangyue19921010 changed the title [HUDI-3045]new ClusteringPlanStrategy to use regex choose partitions when building clustering plan [HUDI-3045] New ClusteringPlanStrategy to use regex choose partitions when building clustering plan Dec 17, 2021
@zhangyue19921010
Copy link
Copy Markdown
Contributor Author

@hudi-bot run azure

@yihua yihua self-assigned this Dec 17, 2021
@zhangyue19921010
Copy link
Copy Markdown
Contributor Author

@hudi-bot run azure

1 similar comment
@zhangyue19921010
Copy link
Copy Markdown
Contributor Author

@hudi-bot run azure

Copy link
Copy Markdown
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the good contribution. Could we just add this as a config instead of a new clustering strategy. I think this can be useful more broadly even.

@zhangyue19921010
Copy link
Copy Markdown
Contributor Author

Ack! Will do it asap :)

@zhangyue19921010
Copy link
Copy Markdown
Contributor Author

Thanks a lot for your attention @yihua and @vinothchandar. Just add a new config to do regex pattern match.
PTAL :)

@zhangyue19921010 zhangyue19921010 changed the title [HUDI-3045] New ClusteringPlanStrategy to use regex choose partitions when building clustering plan [HUDI-3045] New clustering regex match config to choose partitions when building clustering plan Dec 28, 2021
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Left a couple of nits.

.withDocumentation("Files smaller than the size specified here are candidates for clustering");

public static final ConfigProperty<String> PARTITION_REGEX_PATTERN = ConfigProperty
.key(CLUSTERING_STRATEGY_PARAM_PREFIX + "cluster.partition.regex.pattern")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: partition.regex.pattern?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.


@Test
public void testFilterPartitionPaths() {
PartitionAwareClusteringPlanStrategy sg = new DummyPartitionAwareClusteringPlanStrategy(table, context, hoodieWriteConfig);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: better variable naming here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. Thanks a lot for your review :)

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit 9fe28e5 into apache:master Jan 12, 2022
@YuweiXiao
Copy link
Copy Markdown
Contributor

@yihua I feel it would be better to add a new option in ClusteringPlanPartitionFilterMode rather than doing regex in place.

@vinishjail97 vinishjail97 mentioned this pull request Jan 24, 2022
5 tasks
vingov pushed a commit to vingov/hudi that referenced this pull request Jan 26, 2022
…en building clustering plan (apache#4346)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
@yihua
Copy link
Copy Markdown
Contributor

yihua commented Feb 2, 2022

@yihua I feel it would be better to add a new option in ClusteringPlanPartitionFilterMode rather than doing regex in place.

Yes, that could allow more flexible filtering. @zhangyue19921010 @YuweiXiao do either of you want to take a stab at this before 0.11.0 release?

@zhangyue19921010
Copy link
Copy Markdown
Contributor Author

Sure, pick it up

liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022
…en building clustering plan (apache#4346)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
…en building clustering plan (apache#4346)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants