-
Notifications
You must be signed in to change notification settings - Fork 239
MapReduce Pattern
Attention: this Wiki hosts an outdated version of the TinkerPop framework and Gremlin language documentation.
Please visit the Apache TinkerPop website and latest documentation.
MapReduce is a data processing pattern that groups objects under a key (the map and shuffle) and then applies some function to those groupings (the reduce). Gremlin supports this pattern with the groupBy
step. We will demonstrate its use over the provided Grateful Dead song/artist graph.
gremlin> g = new TinkerGraph()
==>tinkergraph[vertices:0 edges:0]
gremlin> g.loadGraphML('data/graph-example-2.xml')
==>null
The groupBy
step can take 2 or 3 function parameters. These functions are:
- key function: for the incoming object, generate a key to group it by.
- value function: before storing the object under the key, process it to yield the value to be stored.
- reduce function: when the stream is empty, apply this function to all the values stored at each key.
The traversal below will group all the song vertices by whether they are an original or cover song. NOTE: the hasNot('song_type',null)
is a filter that ensures only song vertices are analyzed. A more efficient expression would be to use an index as opposed to a linear scan and filter.
gremlin> g.V.hasNot('song_type',null).groupBy{it.song_type}{it}.cap.next()
==>cover=[v[332], v[330], v[331], v[336], v[337], v[334], v[335], v[531], v[747], v[746], v[535], v[740], v[538], v[744], v[742], v[3], ...]
==>original=[v[349], v[341], v[343], v[346], v[348], v[4], v[9], v[327], v[390], v[393], v[392], v[395], v[396], v[397], v[398], v[399], ...]
It is possible to provide a “richer” value function so that the grouped songs are represented by their name as opposed to their vertex.
gremlin> g.V.hasNot('song_type',null).groupBy{it.song_type}{it.name}.cap.next()
==>cover=[HIDEAWAY, TANGLED UP IN BLUE, SHELTER FROM THE STORM, GOODNIGHT IRENE, CHINESE BONES, NEIGHBORHOOD GIRLS, FOREVER YOUNG, ...]
==>original=[BLACK MUDDY RIVER, ALICE D MILLIONAIRE, ALLIGATOR, AT A SIDING, BARBED WIRE WHIPPING PARTY, BERTHA, HERE COMES SUNSHINE, ...]
The reduce function is applied to all the values generated for a key. A simple example is provided which simply counts the number of songs per song type. Before showing a full reduce, see the intermediate results of the map and shuffle.
gremlin> g.V.hasNot('song_type',null).groupBy{it.song_type}{1}.cap.next()
==>cover=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
==>original=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Now, a reduce function is added where the 1s are counted.
gremlin> g.V.hasNot('song_type',null).groupBy{it.song_type}{1}{it.size()}.cap.next()
==>cover=313
==>original=184
The previous example was for demonstration purposes as the more efficient way to get such count distributions is via groupCount
.
gremlin> g.V.hasNot('song_type',null).groupCount{it.song_type}.cap.next()
==>cover=313
==>original=184
However, here is a reduce-based groupBy
traversal that is difficult to achieve any other way. The traversal below provides a nested distribution. It answers the question: “How many songs was a particular artist the primary singer for, for the different song types?”
gremlin> m = g.V.hasNot('song_type',null).groupBy{it.song_type}{it.out('sung_by').next()}{it._().name.groupCount.cap.next()}.cap.next()
==>cover={Unknown=6, Weir=66, John_Fogerty=5, Boz_Scaggs=1, Pigpen_Garcia=1, All=9, Grateful_Dead=1, Garcia=74, Pigpen_Mydland=3, Etta_James=1, None=6, Lesh_Mydland=1, Neal_Cassady=1, Pigpen_Weir_Mydland=2, Weir_Garcia=1, Donna=4, Elvin_Bishop=4, Joey_Covington=2, Hall_and_Oates=2, Bob_Dylan=22, Pigpen?=1, Pigpen=29, Welnick=9, Garcia_Weir=3, Jorma_Kaukonen=4, Allman_Brothers=1, Neville_Brothers=2, Mydland=6, Stephen_Stills=2, Hornsby=2, Bo_Diddley=6, Spencer_Davis=2, Joan_Baez=10, Jon_Hendricks=2, Beach_Boys=3, Suzanne_Vega=2, Weir_Mydland=2, Pigpen_Weir=8, Neil_Young=1, Lesh=6}
==>original={Weir_Wasserman=1, Mydland_Lesh=1, Lesh_Hart_Kreutzmann=1, Weir=33, Weir_Hart_Welnick=1, Welnick_Bralove=1, Grateful_Dead=15, Garcia_Dawson=1, Garcia=71, Weir_Hart=3, Garcia_Lesh=3, Garcia_Lesh_Weir=1, Hunter=3, Weir_Bralove=1, Pigpen=7, Welnick=1, Donna_Godchaux=2, Hart=2, Mydland=12, Lesh_Pigpen=1, Hornsby=1, Keith_Godchaux=1, instrumental=1, Garcia_Kreutzmann=2, Hart_Kreutzmann=2, Weir_Mydland=1, Garcia_Weir_Lesh=1, Lesh=13, Weir_Kreutzmann=1}
The resultant “nested distribution” map can then be analyzed by walking the nested map. In the example below, we see that Jerry Garcia sung 71 of the original Grateful Dead songs.
gremlin> m.original
==>Weir_Wasserman=1
==>Mydland_Lesh=1
==>Lesh_Hart_Kreutzmann=1
==>Weir=33
==>Weir_Hart_Welnick=1
==>Welnick_Bralove=1
==>Grateful_Dead=15
==>Garcia_Dawson=1
==>Garcia=71
...
gremlin> m.original.Garcia
==>71
Moreover, to demonstrate the flexibility of this pattern, these nested maps can be sorted.
gremlin> m.original.sort{a,b -> b.value <=> a.value}[0..4]
==>Garcia=71
==>Weir=33
==>Grateful_Dead=15
==>Lesh=13
==>Mydland=12
Finally, note that such nesting can happen indefinitely by simply repeating this pattern within itself.