by Alexey Grigorev
by Alexey Grigorev
map
reduce
= fold
= accumulate
(define (+1 el) (+ el 1))
(map +1 (list 1 2 3))
$\Rightarrow$
(list 2 3 4)
(reduce + 0 (list 2 3 4))
$\Rightarrow$
9
(reduce + 0 (map +1 (list 1 2 3)))
$\Rightarrow$
9
(list 1 2 3 4)
$\Rightarrow$ (list 1 2)
and (list 3 4)
reduce
) them together(define res1 (reduce + 0 (map +1 (list 1 2)))
(reduce + res1 (map +1 (list 3 4)))
(reduce + 0 (map +1 (list 1 2 3 4)))
reduce
the function must be additivemap
function
(in_key, in_val)
reduce
function
map
to each key-value pairreduce
to each groupLorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean dictum justo est, quis sagittis leo tincidunt sit amet. Donec scelerisque rutrum quam non sagittis. Phasellus sem nisi, cursus eu lacinia eu, tempor ac eros. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. In mollis elit quis orci congue, quis aliquet mauris mollis. Interdum et malesuada fames ac ante ipsum primis in faucibus.
Proin euismod non quam vitae pretium. Quisque vel nisl et leo volutpat rhoncus quis ac eros. Sed lacus tellus, aliquam non ullamcorper in, dictum at magna. Vestibulum consequat egestas lacinia. Proin tempus rhoncus mi, et lacinia elit ornare auctor. Sed sagittis euismod massa ut posuere. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis fringilla dolor ornare mi dictum ornare.
def map(String input_key, String doc):
for each word w in doc:
EmitIntermediate(w, 1)
def reduce(String output_key, Iterator output_vals):
int res = 0
for each v in output_vals:
res += v
Emit(res)
... is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
STATUS UPDATE(user id int, status string, ds string)
PROFILES(userid int, school string, gender int)
LOAD DATA LOCAL INPATH 'logs/status_updates'
INTO TABLE status_updates
PARTITION (ds='2009-03-20')
FROM
(SELECT a.status, b.school, g.gender
FROM status_updates a JOIN profiles b
ON (a.userid = b.userid and a.ds = '2009-03-20') subq1
INSERT OVERWRITE TABLE gender_summary
PARTITION (ds='2009-03-20')
SELECT subq1.gender, count(1)
GROUP BY subq1.gender
INSERT OVERWRITE TABLE school_summary
PARTITION (ds='2009-03-20')
SELECT subq.school, count(1)
GROUP BY subq1.school
FROM
(SELECT a.status, b.school, g.gender
FROM status_updates a JOIN profiles b
ON (a.userid = b.userid and a.ds = '2009-03-20') subq1
INSERT OVERWRITE TABLE gender_summary
PARTITION (ds='2009-03-20')
SELECT subq1.gender, count(1)
GROUP BY subq1.gender
INSERT OVERWRITE TABLE school_summary
PARTITION (ds='2009-03-20')
SELECT subq.school, count(1)
GROUP BY subq1.school
REDUCE subq2.school, subq2.meme, subq2.cnt
USING 'top10.py' AS (school, meme, cnt)
FROM (
SELECT subq1.school, subq1.meme, count(1) as cnt
FROM
(MAP b.school, a.status
USING 'meme_extractor.py'
AS (school, meme)
FROM status_update a JOIN profiles b
ON (a.userid = b.userid)) subq1
GROUP BY subq1.school, subq1.meme
DISTRIBURE BY school, meme
SORT BY school, meme, cnt desc)
) subq2