Skip to content

Commit 8d8f44f

Browse files
committed
[Minor] Comment improvements in ExternalSorter.
1. Clearly specifies the contract/interactions for users of this class. 2. Minor fix in one doc to avoid ambiguity.
1 parent c25ca7c commit 8d8f44f

File tree

1 file changed

+19
-8
lines changed

1 file changed

+19
-8
lines changed

core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala

Lines changed: 19 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,18 @@ import org.apache.spark.storage.{BlockObjectWriter, BlockId}
5353
* probably want to pass None as the ordering to avoid extra sorting. On the other hand, if you do
5454
* want to do combining, having an Ordering is more efficient than not having it.
5555
*
56-
* At a high level, this class works as follows:
56+
* Users interact with this class in the following way:
57+
*
58+
* 1. Instantiate an ExternalSorter.
59+
*
60+
* 2. Call insertAll() with a set of records.
61+
*
62+
* 3. Request an iterator() back to traverse sorted/aggregated records.
63+
* - or -
64+
* Invoke writePartitionedFile() to create a file containing sorted/aggregated outputs
65+
* that can be used in Spark's sort shuffle.
66+
*
67+
* At a high level, this class works internally as follows:
5768
*
5869
* - We repeatedly fill up buffers of in-memory data, using either a SizeTrackingAppendOnlyMap if
5970
* we want to combine by key, or an simple SizeTrackingBuffer if we don't. Inside these buffers,
@@ -65,11 +76,11 @@ import org.apache.spark.storage.{BlockObjectWriter, BlockId}
6576
* aggregation. For each file, we track how many objects were in each partition in memory, so we
6677
* don't have to write out the partition ID for every element.
6778
*
68-
* - When the user requests an iterator, the spilled files are merged, along with any remaining
69-
* in-memory data, using the same sort order defined above (unless both sorting and aggregation
70-
* are disabled). If we need to aggregate by key, we either use a total ordering from the
71-
* ordering parameter, or read the keys with the same hash code and compare them with each other
72-
* for equality to merge values.
79+
* - When the user requests an iterator or file output, the spilled files are merged, along with
80+
* any remaining in-memory data, using the same sort order defined above (unless both sorting
81+
* and aggregation are disabled). If we need to aggregate by key, we either use a total ordering
82+
* from the ordering parameter, or read the keys with the same hash code and compare them with
83+
* each other for equality to merge values.
7384
*
7485
* - Users are expected to call stop() at the end to delete all the intermediate files.
7586
*
@@ -259,8 +270,8 @@ private[spark] class ExternalSorter[K, V, C](
259270
* Spill our in-memory collection to a sorted file that we can merge later (normal code path).
260271
* We add this file into spilledFiles to find it later.
261272
*
262-
* Alternatively, if bypassMergeSort is true, we spill to separate files for each partition.
263-
* See spillToPartitionedFiles() for that code path.
273+
* This should not be invoked if bypassMergeSort is true. In that case, spillToPartitionedFiles()
274+
* is used to write files for each partition.
264275
*
265276
* @param collection whichever collection we're using (map or buffer)
266277
*/

0 commit comments

Comments
 (0)