Skip to content

Conversation

@yorksity
Copy link
Contributor

What changes were proposed in this pull request?

Why are the changes needed?

solve the problem occurs in long running tasks, such as stream tasks

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions github-actions bot added the CORE label Nov 24, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Nov 26, 2022

What problem does it cause? there is no detail here or in the JIRA

return id
}
this.synchronized {
id = nextRddId.getAndIncrement()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid the duplicate call of 'nextRddId.getAndIncrement()'?

}
this.synchronized {
id = nextRddId.getAndIncrement()
if (id < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this if (id < 0) condition, else part of the previous if (id >= 0) can be used

id = nextRddId.getAndIncrement()
if (id < 0) {
nextRddId = new AtomicInteger(0)
id = nextRddId.getAndIncrement()
Copy link
Contributor

@vinodkc vinodkc Nov 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try this ?

nextRddId = new AtomicInteger(1)
id = nextRddId 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review. I updated the code.

private[spark] def newShuffleId(): Int = nextShuffleId.getAndIncrement()

private val nextRddId = new AtomicInteger(0)
private var nextRddId = new AtomicInteger(0)
Copy link

@schlosna schlosna Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the RDD ID overflows within a SparkContext? Are there tests that cover these cases?

Curious if it would be better to switch to an AtomicLong and just modulo max int?

Suggested change
private var nextRddId = new AtomicInteger(0)
private[spark] def newRddId(): Int = (nextRddId.getAndIncrement() % Integer.MAX_VALUE).toInt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the RDD ID overflows within a SparkContext?
When BlockManager generates BlockId, BlockId only supports positive rddid, so BlockId generation fails.

switch to an AtomicLong ?
The scope of influence is very broad, which may require extensive discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You dont need to make this a var - see below.

@mridulm
Copy link
Contributor

mridulm commented Nov 28, 2022

Agree with @srowen , what is the issue here ?

Also, is it applicable to DataSet ? (already long)

Additionally, if RDD id is wrapping around, it can cause rdd id conflicts - we have to think through the impact of that as well.

@yorksity
Copy link
Contributor Author

update detail n the JIRA

When BlockManager generates BlockId, BlockId only supports positive rddid, so BlockId generation fails

object BlockId {
val RDD = "rdd_([0-9])_([0-9])".r

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details, I can see how this can happen for very long running and high frequency streaming applications.

Unfortunately, this also means we can end up with multiple RDD's with the same id - which will require some thinking.

I think moving to long might be a better idea - though that would be fairly disruptive.

+CC @Ngone51, @tgravescs, @cloud-fan

private[spark] def newShuffleId(): Int = nextShuffleId.getAndIncrement()

private val nextRddId = new AtomicInteger(0)
private var nextRddId = new AtomicInteger(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You dont need to make this a var - see below.

Comment on lines +2576 to +2586
private[spark] def newRddId(): Int = {
var id = nextRddId.getAndIncrement()
if (id >= 0) {
return id
}
this.synchronized {
nextRddId = new AtomicInteger(0)
id = nextRddId.getAndIncrement()
}
id
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this would remove the need for synchronization, etc

Suggested change
private[spark] def newRddId(): Int = {
var id = nextRddId.getAndIncrement()
if (id >= 0) {
return id
}
this.synchronized {
nextRddId = new AtomicInteger(0)
id = nextRddId.getAndIncrement()
}
id
}
private[spark] def newRddId(): Int = {
nextRddId.getAndUpdate { i =>
var nextValue = i + 1
if (nextValue < 0) {
nextValue = 0
}
nextValue
}
}

@mridulm
Copy link
Contributor

mridulm commented Dec 12, 2022

Please update the PR details with description @yorksity

@cloud-fan
Copy link
Contributor

is the problem caused by integer overflow?

@Ngone51
Copy link
Member

Ngone51 commented Dec 12, 2022

is the problem caused by integer overflow?

@cloud-fan According to the JIRA description, it is.

@Ngone51
Copy link
Member

Ngone51 commented Dec 12, 2022

@yorksity Did you see this issue in the real world? If so, do you have any details about the job? Like batch or streaming? How long it takes to hit this issue? etc.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Mar 23, 2023
@github-actions github-actions bot closed this Mar 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants