Archive for November 17th, 2009
Map Reduce Using R
There’s been a lot of buzz recently around the MapReduce algorithm and its famous open-source implementation, Hadoop. It’s the go-to algorithm for performing any kind of analytical computation on very large data sets. But what is, the MapReduce algorithm, exactly? Well, if you’re an R programmer, you’ve probably been using it routinely without even knowing it. As a functional language, R has a whole class of functions — the “apply” functions — designed to evaluate a function over a series of data values (the “map” step) and collate and condense the results (the “reduce” step).
In fact, you can almost boil it down to a single line of R code:
sapply(map(data), reduce)
where map is a function, which when applied to a data set data, splits the data into a list with each list element collecting values with a common key assignment, and reduce is a function that processes each element of the list to create a single value from all the data mapped to each key value.
It’s not quite that simple, of course: one of the strengths of Hadoop is that it provides the infrastructure for distributing these map and reduce computations across a vast cluster of networked machines. But R has parallel programming tools too, and the Open Data Group has created a package to implement the MapReduce algorithm in parallel in R. The MapReduce package is available from any CRAN mirror.


