Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 2 additions & 4 deletions core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala
Original file line number Diff line number Diff line change
Expand Up @@ -266,17 +266,15 @@ private class DefaultPartitionCoalescer(val balanceSlack: Double = 0.10)
numCreated += 1
}
}
tries = 0
// if we don't have enough partition groups, create duplicates
while (numCreated < targetLen) {
val (nxt_replica, nxt_part) = partitionLocs.partsWithLocs(tries)
tries += 1
val (nxt_replica, nxt_part) = partitionLocs.partsWithLocs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add comment to explain the purpose of this change here?

rnd.nextInt(partitionLocs.partsWithLocs.length))
val pgroup = new PartitionGroup(Some(nxt_replica))
groupArr += pgroup
groupHash.getOrElseUpdate(nxt_replica, ArrayBuffer()) += pgroup
addPartToPGroup(nxt_part, pgroup)
numCreated += 1
if (tries >= partitionLocs.partsWithLocs.length) tries = 0
}
}

Expand Down
43 changes: 43 additions & 0 deletions core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -1129,6 +1129,36 @@ class RDDSuite extends SparkFunSuite with SharedSparkContext {
}.collect()
}

test("SPARK-23496: order of input partitions can result in severe skew in coalesce") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test looks to me as a good candidate for flakiness, since we are are picking random numbers. Can we set the seed in order to avoid this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is in fact deterministic. The seed is already fixed here:

val rnd = new scala.util.Random(7919) // keep this class deterministic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks, sorry, I missed it

val numInputPartitions = 100
val numCoalescedPartitions = 50
val locations = Array("locA", "locB")

val inputRDD = sc.makeRDD(Range(0, numInputPartitions).toArray[Int], numInputPartitions)
assert(inputRDD.getNumPartitions == numInputPartitions)

val locationPrefRDD = new LocationPrefRDD(inputRDD, { (p: Partition) =>
if (p.index < numCoalescedPartitions) {
Seq(locations(0))
} else {
Seq(locations(1))
}
})
val coalescedRDD = new CoalescedRDD(locationPrefRDD, numCoalescedPartitions)

val numPartsPerLocation = coalescedRDD
.getPartitions
.map(coalescedRDD.getPreferredLocations(_).head)
.groupBy(identity)
.mapValues(_.size)

// Without the fix these would be:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally we don't write the comment this way, maybe just saying we want to ensure the location preferences distribute evenly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// numPartsPerLocation(locations(0)) == numCoalescedPartitions - 1
// numPartsPerLocation(locations(1)) == 1
assert(numPartsPerLocation(locations(0)) > 0.4 * numCoalescedPartitions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How confident are we on the assert condition to be true? How is the fraction 0.4 chosen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, the result is deterministic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment about flakiness & fixed seed.

assert(numPartsPerLocation(locations(1)) > 0.4 * numCoalescedPartitions)
}

// NOTE
// Below tests calling sc.stop() have to be the last tests in this suite. If there are tests
// running after them and if they access sc those tests will fail as sc is already closed, because
Expand Down Expand Up @@ -1210,3 +1240,16 @@ class SizeBasedCoalescer(val maxSize: Int) extends PartitionCoalescer with Seria
groups.toArray
}
}

/** Alters the preferred locations of the parent RDD using provided function. */
class LocationPrefRDD[T: ClassTag](
@transient var prev: RDD[T],
val locationPicker: Partition => Seq[String]) extends RDD[T](prev) {
override protected def getPartitions: Array[Partition] = prev.partitions

override def compute(partition: Partition, context: TaskContext): Iterator[T] =
null.asInstanceOf[Iterator[T]]

override def getPreferredLocations(partition: Partition): Seq[String] =
locationPicker(partition)
}