Skip to content

Commit 2ef37c9

Browse files
jyotiskarxin
authored andcommitted
Merge pull request #562 from jyotiska/master. Closes #562.
Added example Python code for sort I added an example Python code for sort. Right now, PySpark has limited examples for new people willing to use the project. This example code sorts integers stored in a file. I was able to sort 5 million, 10 million and 25 million integers with this code. Author: jyotiska <[email protected]> == Merge branch commits == commit 8ad8faf6c8e02ae1cd68565d98524edf165f54df Author: jyotiska <[email protected]> Date: Sun Feb 9 11:00:41 2014 +0530 Added comments in code on collect() method commit 6f98f1e313f4472a7c2207d36c4f0fbcebc95a8c Author: jyotiska <[email protected]> Date: Sat Feb 8 13:12:37 2014 +0530 Updated python example code sort.py commit 945e39a5d68daa7e5bab0d96cbd35d7c4b04eafb Author: jyotiska <[email protected]> Date: Sat Feb 8 12:59:09 2014 +0530 Added example python code for sort
1 parent b6d40b7 commit 2ef37c9

File tree

1 file changed

+36
-0
lines changed

1 file changed

+36
-0
lines changed

python/examples/sort.py

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one or more
3+
# contributor license agreements. See the NOTICE file distributed with
4+
# this work for additional information regarding copyright ownership.
5+
# The ASF licenses this file to You under the Apache License, Version 2.0
6+
# (the "License"); you may not use this file except in compliance with
7+
# the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
#
17+
18+
import sys
19+
20+
from pyspark import SparkContext
21+
22+
23+
if __name__ == "__main__":
24+
if len(sys.argv) < 3:
25+
print >> sys.stderr, "Usage: sort <master> <file>"
26+
exit(-1)
27+
sc = SparkContext(sys.argv[1], "PythonSort")
28+
lines = sc.textFile(sys.argv[2], 1)
29+
sortedCount = lines.flatMap(lambda x: x.split(' ')) \
30+
.map(lambda x: (int(x), 1)) \
31+
.sortByKey(lambda x: x)
32+
# This is just a demo on how to bring all the sorted data back to a single node.
33+
# In reality, we wouldn't want to collect all the data to the driver node.
34+
output = sortedCount.collect()
35+
for (num, unitcount) in output:
36+
print num

0 commit comments

Comments
 (0)