@@ -53,7 +53,18 @@ import org.apache.spark.storage.{BlockObjectWriter, BlockId}
5353 * probably want to pass None as the ordering to avoid extra sorting. On the other hand, if you do
5454 * want to do combining, having an Ordering is more efficient than not having it.
5555 *
56- * At a high level, this class works as follows:
56+ * Users interact with this class in the following way:
57+ *
58+ * 1. Instantiate an ExternalSorter.
59+ *
60+ * 2. Call insertAll() with a set of records.
61+ *
62+ * 3. Request an iterator() back to traverse sorted/aggregated records.
63+ * - or -
64+ * Invoke writePartitionedFile() to create a file containing sorted/aggregated outputs
65+ * that can be used in Spark's sort shuffle.
66+ *
67+ * At a high level, this class works internally as follows:
5768 *
5869 * - We repeatedly fill up buffers of in-memory data, using either a SizeTrackingAppendOnlyMap if
5970 * we want to combine by key, or an simple SizeTrackingBuffer if we don't. Inside these buffers,
@@ -65,11 +76,11 @@ import org.apache.spark.storage.{BlockObjectWriter, BlockId}
6576 * aggregation. For each file, we track how many objects were in each partition in memory, so we
6677 * don't have to write out the partition ID for every element.
6778 *
68- * - When the user requests an iterator, the spilled files are merged, along with any remaining
69- * in-memory data, using the same sort order defined above (unless both sorting and aggregation
70- * are disabled). If we need to aggregate by key, we either use a total ordering from the
71- * ordering parameter, or read the keys with the same hash code and compare them with each other
72- * for equality to merge values.
79+ * - When the user requests an iterator or file output , the spilled files are merged, along with
80+ * any remaining in-memory data, using the same sort order defined above (unless both sorting
81+ * and aggregation are disabled). If we need to aggregate by key, we either use a total ordering
82+ * from the ordering parameter, or read the keys with the same hash code and compare them with
83+ * each other for equality to merge values.
7384 *
7485 * - Users are expected to call stop() at the end to delete all the intermediate files.
7586 *
@@ -259,8 +270,8 @@ private[spark] class ExternalSorter[K, V, C](
259270 * Spill our in-memory collection to a sorted file that we can merge later (normal code path).
260271 * We add this file into spilledFiles to find it later.
261272 *
262- * Alternatively, if bypassMergeSort is true, we spill to separate files for each partition.
263- * See spillToPartitionedFiles() for that code path .
273+ * This should not be invoked if bypassMergeSort is true. In that case, spillToPartitionedFiles()
274+ * is used to write files for each partition .
264275 *
265276 * @param collection whichever collection we're using (map or buffer)
266277 */
0 commit comments