Skip to content

Commit 675c7ae

Browse files
committed
Releasing assignment 7.
1 parent 00b4074 commit 675c7ae

File tree

2 files changed

+186
-31
lines changed

2 files changed

+186
-31
lines changed

Diff for: assignment7.html

+141-31
Original file line numberDiff line numberDiff line change
@@ -80,30 +80,29 @@ <h1>Assignments <small>CS 489/698 Big Data Infrastructure (Winter 2017)</small><
8080
<div>
8181
<h3>Assignment 7: Inverted Indexing (Redux) <small>due 1:00pm April 3</small></h3>
8282

83-
<p><b>Note:</b> This assignment is a draft and incomplete. If you
84-
begin working on this assignment, be aware that parts can still
85-
hange. This message will be removed once the assignment is
86-
complete.</p>
87-
8883
<p>In this assignment you'll revisit the inverted indexing and boolean
8984
retrieval program in <a href="assignment3.html">assignment 3</a>. In
9085
assignment 3, your indexer program wrote postings to HDFS
9186
in <code>MapFile</code>s and your boolean retrieval program read
9287
postings from those <code>MapFile</code>s. In this assignment, you'll
9388
write postings to and read postings from HBase instead. In other
94-
words, the program logic should not change, except for the backend
89+
words, the core program logic should not change, except for the backend
9590
storage that you are using. This assignment is to be completed using
9691
MapReduce in Java.</p>
9792

98-
<p>Due to the complexities of setting up HBase in Altiscale, this
99-
assignment will be completed entirely in the Linux Student CS
100-
Environment, where we have stood up a single-node HBase cluster. Check
101-
out <a href="http://ubuntu1404-010.student.cs.uwaterloo.ca:16010/master-status"><code>http://ubuntu1404-010.student.cs.uwaterloo.ca:16010/master-status</code></a>. As
102-
a result, you won't be able to play with HBase on sizeable
103-
collections, although the assignment will still give you some
104-
experience of what developing against HBase "feels like".</p>
105-
106-
<p>For this assignment ssh
93+
<p><b>Note</b>: Due to the complexities of setting up HBase in
94+
Altiscale, we have not been able to stand up an HBase cluster yet. For
95+
now, work in the Linux Student CS environment (more details below). If
96+
we manage to get a stable HBase cluster running on Altiscale, you will
97+
complete the Altiscale portion of the assignment; otherwise, you will
98+
not be responsible for getting code to run on Altiscale.</p>
99+
100+
<p>We have stood up a single-node HBase cluster in the Linux Student
101+
CS environment. Check
102+
out <a href="http://ubuntu1404-010.student.cs.uwaterloo.ca:16010/master-status"><code>http://ubuntu1404-010.student.cs.uwaterloo.ca:16010/master-status</code></a>. You
103+
won't be able to play with HBase on sizeable collections, although the
104+
assignment will still give you some experience of what developing
105+
against HBase "feels like". For this assignment ssh
107106
into <code>ubuntu1404-010.student.cs.uwaterloo.ca</code> and work
108107
specifically on that host.</p>
109108

@@ -138,15 +137,15 @@ <h4 style="padding-top: 10px">HBase Word Count</h4>
138137
</pre>
139138

140139
<p>Use the <code>-config</code> option to specify the HBase config
141-
file: point to a version on the Altiscale workspace that we've
142-
prepared for you. This config file tells the program how to connect to
143-
the HBase cluster. Use the <code>-table</code> option to name the
144-
table you're inserting the word counts into. The other options should
145-
be straightforward to understand.</p>
140+
file: point to a version that we've prepared for you. This config file
141+
tells the program how to connect to the HBase cluster. Use
142+
the <code>-table</code> option to name the table you're inserting the
143+
word counts into. The other options should be straightforward to
144+
understand.</p>
146145

147-
<p><B>Note:</b> Since HBase is a shared resource across the cluster,
148-
please make your tables unique by using your username as part of the
149-
table name, per above.</p>
146+
<p><B>Note:</b> Since HBase is a shared resource, please make your
147+
tables unique by using your username as part of the table name, per
148+
above.</p>
150149

151150
<p>You should then be able to fetch the word counts from HBase:</p>
152151

@@ -168,9 +167,9 @@ <h4 style="padding-top: 10px">HBase Storage</h4>
168167

169168
<p>You will write three programs: <code>BuildInvertedIndexHBase</code>,
170169
<code>InsertCollectionHBase</code>,
171-
and <code>BooleanRetrievalHBase</code>. Feel free to use your code
172-
from assignment 3 a starting point. Note that you don't need to worry
173-
about index compression for this assignment!</p>
170+
and <code>BooleanRetrievalHBase</code>. Feel free to use code from
171+
Bespin or assignment 3 starting points. Note that you don't need to
172+
worry about index compression for this assignment.</p>
174173

175174
<p>The <code>BuildInvertedIndexHBase</code> program is the HBase
176175
version of <code>BuildInvertedIndex</code> from the Bespin
@@ -187,9 +186,13 @@ <h4 style="padding-top: 10px">HBase Storage</h4>
187186
ca.uwaterloo.cs.bigdata2017w.assignment7.BuildInvertedIndexHBase \
188187
-config /u0/cs489/hbase-site.xml \
189188
-input data/Shakespeare.txt \
190-
-table cs489-2017w-lintool-a7-index-shakespeare -reducers 4
189+
-index cs489-2017w-lintool-a7-index-shakespeare -reducers 4
191190
</pre>
192191

192+
<p>The <code>-index</code> option specifies the name of the HBase
193+
table. If it exists already, your program should drop the table and
194+
recreate it.</p>
195+
193196
<p>The <code>InsertCollectionHBase</code> program will insert the
194197
collection into HBase, where the row key is the long offset of each
195198
line, what we've been using as the document id. The simplest
@@ -203,7 +206,7 @@ <h4 style="padding-top: 10px">HBase Storage</h4>
203206
ca.uwaterloo.cs.bigdata2017w.assignment7.InsertCollectionHBase \
204207
-config /u0/cs489/hbase-site.xml \
205208
-input data/Shakespeare.txt \
206-
-table cs489-2017w-lintool-a7-collection-shakespeare -reducers 4
209+
-index cs489-2017w-lintool-a7-collection-shakespeare -reducers 4
207210
</pre>
208211

209212
<p>The <code>BooleanRetrievalHBase</code> program is the HBase version
@@ -215,7 +218,7 @@ <h4 style="padding-top: 10px">HBase Storage</h4>
215218
<pre>
216219
hadoop jar target/bigdata2017w-0.1.0-SNAPSHOT.jar \
217220
ca.uwaterloo.cs.bigdata2017w.assignment7.BooleanRetrievalHBase \
218-
-config /Users/jimmylin/workspace/hbase-1.2.4/conf/hbase-site.xml \
221+
-config /u0/cs489/hbase-site.xml \
219222
-index cs489-2017w-lintool-a7-index-shakespeare \
220223
-collection cs489-2017w-lintool-a7-collection-shakespeare \
221224
-query "outrageous fortune AND"
@@ -224,11 +227,118 @@ <h4 style="padding-top: 10px">HBase Storage</h4>
224227
<p>Note that <code>-index</code> and <code>-collection</code> specify
225228
HBase tables (the result of <code>BuildInvertedIndexHBase</code> and
226229
<code>InsertCollectionHBase</code>, respectively). You should verify
227-
that all the sample queries (from assignment 3) on both collections
228-
work.</p>
230+
that all the sample queries (from assignment 3) work. Your output
231+
should match the output of assignment 3.</p>
229232

230233
<h4 style="padding-top: 10px">Search API Endpoint</h4>
231234

235+
<p>Finally, write a search API REST endpoint in a program
236+
named <code>HBaseSearchEndpoint</code>, wrapping around
237+
the <code>BooleanRetrievalHBase</code> program above. We'll start up
238+
the server as follows:</p>
239+
240+
<pre>
241+
hadoop jar target/bigdata2017w-0.1.0-SNAPSHOT.jar \
242+
ca.uwaterloo.cs.bigdata2017w.assignment7.HBaseSearchEndpoint \
243+
-config /u0/cs489/hbase-site.xml \
244+
-index cs489-2017w-lintool-a7-index-shakespeare \
245+
-collection cs489-2017w-lintool-a7-collection-shakespeare \
246+
-port 7890
247+
</pre>
248+
249+
<p>The <code>-port</code> option specifies the port number that the
250+
services listen on. When you are testing the server, please use random
251+
port numbers, because otherwise everyone's service will collide.</p>
252+
253+
<p>The search API endpoint should behave as follows, returning
254+
JSON.</p>
255+
256+
<pre>
257+
$ curl http://localhost:7890/search?query=fair+nature+AND
258+
[
259+
{"docid": 425450, "text": " CELIA. No; when Nature hath made a fair creature, may she not by"},
260+
{"docid": 1553567, "text": " Disguise fair nature with hard-favour'd rage;"},
261+
{"docid": 5327159, "text": " Showing fair nature is both kind and tame;"}
262+
]
263+
$ curl http://localhost:7890/search?query=outrageous+fortune+AND
264+
[
265+
{"docid": 1073319, "text": " The slings and arrows of outrageous fortune"}
266+
]
267+
</pre>
268+
269+
<p>The reference solution uses Jetty, but you are welcome to use any
270+
framework for the service, so long as the API endpoint conforms to the
271+
behavior above.</p>
272+
273+
<h4 style="padding-top: 10px">Turning in the Assignment</h4>
274+
275+
<p>Please follow these instructions carefully!</p>
276+
277+
<p>Make sure your repo has the following items:</p>
278+
279+
<ul>
280+
281+
<li>If you have any notes you wish to convey to us, put it
282+
in <code>bigdata2017w/assignment7.md</code>. Otherwise, please create
283+
an empty file&mdash;following previous assignments, this is where your
284+
grade with go.</li>
285+
286+
<li>Your implementations should go in
287+
package <code>ca.uwaterloo.cs.bigdata2017w.assignment7</code>. At the
288+
minimum, you should have <code>BuildInvertedIndexHBase</code>,
289+
<code>InsertCollectionHBase</code>, <code>BooleanRetrievalHBase</code>,
290+
and <code>HBaseSearchEndpoint</code>. Feel free to include helper
291+
classes also.</li>
292+
293+
</ul>
294+
295+
<p>The following check script is provided for you:</p>
296+
297+
<ul>
298+
299+
<li><a href="assignments/check_assignment7_linux.py"><code>check_assignment7_linux.py</code></a></li>
300+
301+
</ul>
302+
303+
<p>Note that the check script does not check the behavior of your
304+
search endpoint.</p>
305+
306+
<p>When you've done everything, commit to your repo and remember to
307+
push back to origin. You should be able to see your edits in the web
308+
interface. Before you consider the assignment "complete", we would
309+
recommend that you verify everything above works by performing a clean
310+
clone of your repo and run the public check scripts.</p>
311+
312+
<p>That's it!</p>
313+
314+
<h4 style="padding-top: 10px">Grading</h4>
315+
316+
<p>The entire assignment is worth 50 points:</p>
317+
318+
<ul>
319+
320+
<li>The implementation of <code>BuildInvertedIndexHBase</code>,
321+
<code>InsertCollectionHBase</code>, <code>BooleanRetrievalHBase</code>,
322+
and <code>HBaseSearchEndpoint</code> are each worth 5 points.
323+
324+
<li>Linux Student CS environment: Getting your code to run on sample
325+
queries (the same as the ones in assignment 3) is worth 10
326+
points. That is, to earn all 10 points, we should be able to run
327+
your code on the Shakespeare collection, following exactly the
328+
procedure above.</li>
329+
330+
<li>Altiscale: Getting your code to run on sample queries (the same
331+
as the ones in assignment 3) is worth 10 points. That is, to earn
332+
all 10 points, we should be able to run your code on the Wikipedia
333+
collection, following exactly the procedure above.</li>
334+
335+
<li>Another 10 points is allotted to us verifying the behavior and
336+
output of your program in ways that we will not tell you. We're
337+
giving you the "public" versions of the check scripts; we'll run a
338+
"private" version to examine your output further (i.e., think blind
339+
test cases).</li>
340+
341+
</ul>
232342

233343

234344
<p style="padding-top: 20px"><a href="#">Back to top</a></p>

Diff for: assignments/check_assignment7_linux.py

+45
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
#!/usr/bin/python
2+
"""CS 489 Big Data Infrastructure (Winter 2017): Self-check script
3+
4+
This file can be open to students
5+
6+
Usage:
7+
run this file on 'bigdata2017w' repository with github-username
8+
ex) check_assignment7_public_linux.py [github-username]
9+
"""
10+
11+
import sys,os,re,argparse
12+
from subprocess import call
13+
14+
def check_a7(username,reducers):
15+
"""Run assignment7 in linux environment"""
16+
call(["mvn","clean","package"])
17+
call(["hadoop","jar","target/bigdata2017w-0.1.0-SNAPSHOT.jar",
18+
"ca.uwaterloo.cs.bigdata2017w.assignment7.BuildInvertedIndexHBase",
19+
"-config", "/u0/cs489/hbase-site.xml",
20+
"-input", "data/Shakespeare.txt",
21+
"-index", "cs489-2017w-{0}-a7-index-shakespeare".format(username),
22+
"-reducers", str(reducers)])
23+
call(["hadoop","jar","target/bigdata2017w-0.1.0-SNAPSHOT.jar",
24+
"ca.uwaterloo.cs.bigdata2017w.assignment7.InsertCollectionHBase",
25+
"-config", "/u0/cs489/hbase-site.xml",
26+
"-input", "data/Shakespeare.txt",
27+
"-index", "cs489-2017w-{0}-a7-collection-shakespeare".format(username),
28+
"-reducers", str(reducers)])
29+
call(["hadoop","jar","target/bigdata2017w-0.1.0-SNAPSHOT.jar",
30+
"ca.uwaterloo.cs.bigdata2017w.assignment7.BooleanRetrievalHBase",
31+
"-config", "/u0/cs489/hbase-site.xml",
32+
"-index", "cs489-2017w-{0}-a7-index-shakespeare".format(username),
33+
"-collection", "cs489-2017w-{0}-a7-collection-shakespeare".format(username),
34+
"-query", "outrageous fortune AND"])
35+
36+
37+
if __name__ == "__main__":
38+
parser = argparse.ArgumentParser(description="CS 489/689 2017w assignment public check script for Linux")
39+
parser.add_argument('username',metavar='[Github Username]', help="Github username",type=str)
40+
parser.add_argument('-r','--reducers',help="Number of reducers to use.",type=int,default=1)
41+
args=parser.parse_args()
42+
try:
43+
check_a7(args.username,args.reducers)
44+
except Exception as e:
45+
print(e)

0 commit comments

Comments
 (0)