by Alexey Grigorev
by Alexey Grigorev
map
reduce = fold = accumulate
(define (+1 el) (+ el 1))(map +1 (list 1 2 3)) $\Rightarrow$
(list 2 3 4)
(reduce + 0 (list 2 3 4)) $\Rightarrow$
9
(reduce + 0 (map +1 (list 1 2 3))) $\Rightarrow$
9
(list 1 2 3 4) $\Rightarrow$ (list 1 2) and (list 3 4) reduce) them together(define res1 (reduce + 0 (map +1 (list 1 2)))
(reduce + res1 (map +1 (list 3 4)))
(reduce + 0 (map +1 (list 1 2 3 4)))reduce the function must be additivemap function
(in_key, in_val)reduce function
map to each key-value pairreduce to each groupLorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean dictum justo est, quis sagittis leo tincidunt sit amet. Donec scelerisque rutrum quam non sagittis. Phasellus sem nisi, cursus eu lacinia eu, tempor ac eros. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. In mollis elit quis orci congue, quis aliquet mauris mollis. Interdum et malesuada fames ac ante ipsum primis in faucibus.
Proin euismod non quam vitae pretium. Quisque vel nisl et leo volutpat rhoncus quis ac eros. Sed lacus tellus, aliquam non ullamcorper in, dictum at magna. Vestibulum consequat egestas lacinia. Proin tempus rhoncus mi, et lacinia elit ornare auctor. Sed sagittis euismod massa ut posuere. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis fringilla dolor ornare mi dictum ornare.
def map(String input_key, String doc):for each word w in doc:EmitIntermediate(w, 1)def reduce(String output_key, Iterator output_vals):int res = 0for each v in output_vals:res += vEmit(res)
... is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.






[Abouzeid, Azza et al 2009]
STATUS UPDATE(user id int, status string, ds string)PROFILES(userid int, school string, gender int)
LOAD DATA LOCAL INPATH 'logs/status_updates'INTO TABLE status_updatesPARTITION (ds='2009-03-20')
FROM(SELECT a.status, b.school, g.genderFROM status_updates a JOIN profiles bON (a.userid = b.userid and a.ds = '2009-03-20') subq1INSERT OVERWRITE TABLE gender_summaryPARTITION (ds='2009-03-20')SELECT subq1.gender, count(1)GROUP BY subq1.genderINSERT OVERWRITE TABLE school_summaryPARTITION (ds='2009-03-20')SELECT subq.school, count(1)GROUP BY subq1.school
FROM(SELECT a.status, b.school, g.genderFROM status_updates a JOIN profiles bON (a.userid = b.userid and a.ds = '2009-03-20') subq1INSERT OVERWRITE TABLE gender_summaryPARTITION (ds='2009-03-20')SELECT subq1.gender, count(1)GROUP BY subq1.genderINSERT OVERWRITE TABLE school_summaryPARTITION (ds='2009-03-20')SELECT subq.school, count(1)GROUP BY subq1.school
REDUCE subq2.school, subq2.meme, subq2.cntUSING 'top10.py' AS (school, meme, cnt)FROM (SELECT subq1.school, subq1.meme, count(1) as cntFROM(MAP b.school, a.statusUSING 'meme_extractor.py'AS (school, meme)FROM status_update a JOIN profiles bON (a.userid = b.userid)) subq1GROUP BY subq1.school, subq1.memeDISTRIBURE BY school, memeSORT BY school, meme, cnt desc)) subq2