Finding Delta

Discovering gems of wisdom from massive data sets

Posts Tagged ‘R

Map Reduce Using R

leave a comment »

From Revolutions

There’s been a lot of buzz recently around the MapReduce algorithm and its famous open-source implementation, Hadoop. It’s the go-to algorithm for performing any kind of analytical computation on very large data sets. But what is, the MapReduce algorithm, exactly? Well, if you’re an R programmer, you’ve probably been using it routinely without even knowing it. As a functional language, R has a whole class of functions — the “apply” functions — designed to evaluate a function over a series of data values (the “map” step) and collate and condense the results (the “reduce” step).

In fact, you can almost boil it down to a single line of R code:

sapply(map(data), reduce)

where map is a function, which when applied to a data set data, splits the data into a list with each list element collecting values with a common key assignment, and reduce is a function that processes each element of the list to create a single value from all the data mapped to each key value.

It’s not quite that simple, of course: one of the strengths of Hadoop is that it provides the infrastructure for distributing these map and reduce computations across a vast cluster of networked machines. But R has parallel programming tools too, and the Open Data Group has created a package to implement the MapReduce algorithm in parallel in R. The MapReduce package is available from any CRAN mirror.

Written by mattalcock

November 17, 2009 at 12:00 pm

Posted in Data Analysis

Tagged with , ,