Skip to content

Conversation

@chenkovsky
Copy link
Contributor

What changes were proposed in this pull request?

We can add randomUUID as an suffix to solve it

Why are the changes needed?

currently, we cannot guarantee application id is really unique. this may lead to data issue.

Does this PR introduce any user-facing change?

No

How was this patch tested?

test locally

}

def appUniqueIdWithUUID(appUiqueId: String): String = {
appUiqueId + "-" + UUID.randomUUID()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add config to allow user to add UUID suffix? IMO, not all cases should add UUID suffix. Right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to move appUniqueIdWithUUID(appUiqueId: String) to CelebornConf. Additionally, we can simplify the UUID by removing the dashes to create a shorter version, then we can use celebornConf.appUniqueIdWithUUIDSuffix

Suggested change
appUiqueId + "-" + UUID.randomUUID()
def appUniqueIdWithUUIDSuffix(appId: String): String = {
if (clientApplicationUUIDSuffixEnabled) {
appId + "-" + UUID.randomUUID().toString().replaceAll("-", "")
} else {
appId
}
}

if (lifecycleManager == null) {
celebornAppId = FlinkUtils.toCelebornAppId(lifecycleManagerTimestamp, jobID);
celebornAppId =
Utils.appUniqueIdWithUUID(
Copy link
Member

@SteNicholas SteNicholas Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does flink app id need to add UUID suffix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you think there's no application id colision problem for flink?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. cc @reswqa, @RexXiong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. cc @reswqa, @RexXiong.

Currently Flink Appid is already random and concat with current timestamp, but I think it's fine if we add flag for UUID suffix

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't expect that flink app id(JobId + timestamp) might conflict. Did you actually suffer from this problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SteNicholas please review it again, I reverted the change.

.createWithDefault(0.4)

val CLIENT_APPLICATION_UUID: ConfigEntry[Boolean] =
buildConf("celeborn.client.application.uuid.enabled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfer celeborn.client.application.uuidSuffix.enabled

def clientExcludeReplicaOnFailureEnabled: Boolean =
get(CLIENT_EXCLUDE_PEER_WORKER_ON_FAILURE_ENABLED)
def clientMrMaxPushData: Long = get(CLIENT_MR_PUSH_DATA_MAX)
def clientApplicationUUID: Boolean = get(CLIENT_APPLICATION_UUID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clientApplicationUUID -> clientApplicationUUIDSuffixEnabled

}

def appUniqueIdWithUUID(appUiqueId: String): String = {
appUiqueId + "-" + UUID.randomUUID()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to move appUniqueIdWithUUID(appUiqueId: String) to CelebornConf. Additionally, we can simplify the UUID by removing the dashes to create a shorter version, then we can use celebornConf.appUniqueIdWithUUIDSuffix

Suggested change
appUiqueId + "-" + UUID.randomUUID()
def appUniqueIdWithUUIDSuffix(appId: String): String = {
if (clientApplicationUUIDSuffixEnabled) {
appId + "-" + UUID.randomUUID().toString().replaceAll("-", "")
} else {
appId
}
}

buildConf("celeborn.client.application.uuid.enabled")
.categories("client")
.version("0.6.0")
.doc("When `true`, add uuid suffix to application id")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.doc("When `true`, add uuid suffix to application id")
.doc("Whether to add UUID suffix for application id for unique. When `true`, add UUID suffix for unique application id.")

if (lifecycleManager == null) {
celebornAppId = FlinkUtils.toCelebornAppId(lifecycleManagerTimestamp, jobID);
LOG.info("CelebornAppId: {}", celebornAppId);
String applicationId = FlinkUtils.toCelebornAppId(lifecycleManagerTimestamp, jobID);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this change. Meanwhile, remove line 87~89.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this change. Meanwhile, remove line 87~89.

ok, let's revert it.
should we describe that the flag is only for spark and mr in doc ?

synchronized (CelebornTierMasterAgent.class) {
if (lifecycleManager == null) {
celebornAppId = FlinkUtils.toCelebornAppId(lifecycleManagerTimestamp, jobID);
String applicationId = FlinkUtils.toCelebornAppId(lifecycleManagerTimestamp, jobID);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Copy link
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only a minor comment

buildConf("celeborn.client.application.uuidSuffix.enabled")
.categories("client")
.version("0.6.0")
.doc("Whether to add UUID suffix for application id for unique. When `true`, add UUID suffix for unique application id.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This configuration only takes effect when using the Spark and MR engines, so we must specify this in doc.

buildConf("celeborn.client.application.uuidSuffix.enabled")
.categories("client")
.version("0.6.0")
.doc("Whether to add UUID suffix for application id for unique. When `true`, add UUID suffix for unique application id. Currently, this only applies to Spark and MapPartition.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MapPartition?

Copy link
Member

@reswqa reswqa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update.

@RexXiong RexXiong closed this in eb49ed7 Oct 17, 2024
@RexXiong
Copy link
Contributor

Thanks, merge to main(v0.6.0)

zaynt4606 pushed a commit to zaynt4606/celeborn that referenced this pull request Oct 21, 2024
### What changes were proposed in this pull request?

We can add randomUUID as an suffix to solve it

### Why are the changes needed?

currently, we cannot guarantee application id is really unique. this may lead to data issue.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?
test locally

Closes apache#2810 from chenkovsky/feature/uuid_appid.

Authored-by: Chongchen Chen <[email protected]>
Signed-off-by: Shuang <[email protected]>
@mridulm
Copy link
Contributor

mridulm commented Nov 3, 2024

Circling back to this - is this a case where multiple application attempts were involved ? (in which case, using appId + attemptId is the right approach) - or when different application;s application id actually collided?

Given spark (and MR ?) use application id + attempt id for a few things already (like event files), I am trying to understand why UUID was necessary.

@chenkovsky
Copy link
Contributor Author

Circling back to this - is this a case where multiple application attempts were involved ? (in which case, using appId + attemptId is the right approach) - or when different application;s application id actually collided?

Given spark (and MR ?) use application id + attempt id for a few things already (like event files), I am trying to understand why UUID was necessary.

Hi, @mridulm . In our cluster, most of time, applicationId is unique, but id collision will take place occasionally. when it happens, it's very hard to find the root cause of data missing problem. For spark or MR, user can also override appId, add uuid suffix is also a fool-proof design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants