Finding Delta

Discovering gems of wisdom from massive data sets

Archive for November 2009

Map Reduce Using R

leave a comment »

From Revolutions

There’s been a lot of buzz recently around the MapReduce algorithm and its famous open-source implementation, Hadoop. It’s the go-to algorithm for performing any kind of analytical computation on very large data sets. But what is, the MapReduce algorithm, exactly? Well, if you’re an R programmer, you’ve probably been using it routinely without even knowing it. As a functional language, R has a whole class of functions — the “apply” functions — designed to evaluate a function over a series of data values (the “map” step) and collate and condense the results (the “reduce” step).

In fact, you can almost boil it down to a single line of R code:

sapply(map(data), reduce)

where map is a function, which when applied to a data set data, splits the data into a list with each list element collecting values with a common key assignment, and reduce is a function that processes each element of the list to create a single value from all the data mapped to each key value.

It’s not quite that simple, of course: one of the strengths of Hadoop is that it provides the infrastructure for distributing these map and reduce computations across a vast cluster of networked machines. But R has parallel programming tools too, and the Open Data Group has created a package to implement the MapReduce algorithm in parallel in R. The MapReduce package is available from any CRAN mirror.


Written by mattalcock

November 17, 2009 at 12:00 pm

Posted in Data Analysis

Tagged with , ,

Business Intelligence And Data Warehousing On A Budget

leave a comment »

Business intelligence suits/products offer a lot and companies offer great service and system support. However these suit/vendor product solutions can be extremely expensive! I really don’t think you need an expensive ETL suit or Business Intelligence product to run a rewarding data warehouse and analytical plant. I run a large very plant with the following open simple components and open source technologies.

• A Linux/Unix scheduling system.
• A general script to load delimited data into a db table
• A general script to run a proc
• A general script to extract delimited data from another db via proc or table extract.
• A general script to extract delimited data from the web.

Procs on a scratch db that sits alongside your main data warehouse db can be sued to transform the data and load into the main warehouse.

I’d recommend an open source stack of the following:

Scheduler: cron/puppet
ETL Scripts: Python (Perl would also work well)
DB Storage: MySQL
Data Analysis: Excel, R, Python

Obviously this will not solve everybody’s needs however with the correct schema architecture this warehouse would scale for the majority of businesses at very little cost to build and maintain.

In future posts I aim to outline why these agile technique help you build a plant for you own needs without the extraordinarily high yearly BI toolset costs.

Let me know if this appeals to you and I’ll create more detailed posts to follow…

Written by mattalcock

November 11, 2009 at 6:32 pm

Social Brand Comparison Infographic

leave a comment »

Ionz have built and inforgraph that is a simplified map of your personality in relation to the universe of people that have particiapted in there online graphic. It shows how you as a brand realtes to the norm. Good insight into how your behvaiours and choices compare to our soclities average.

See more at


Written by mattalcock

November 6, 2009 at 11:29 am

Posted in Uncategorized

Google Spell Checker Using Probability Theory

leave a comment »

Peter Norvig outlines how Google’s ‘did you mean’ spelling corrector uses probability theory, large training sets and some elegant statistical language processing to be so effective.  Type in a search like [speling] and Google comes back in 0.1 seconds or so with Did you mean: spelling. Here is a toy spelling corrector in python that achieves 80 to 90% accuracy and is very fast. (see  code below)

The big.txt file that is referenced here consists of about a million words. The file is a concatenation of several public domain books from Project Gutenberg and lists of most frequent words from Wiktionary and the British National Corpus. It uses a simple training method of just counting the occurrences of each word in the big text file. Obviously Google has a lot more data to seed this spelling checker with but I was suprised at how effective this relatively small seed was.

import re, collections

def words(text): return re.findall('[a-z]+', text.lower())

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
   s = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in s if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in s for c in alphabet if b]
   inserts    = [a + c + b     for a, b in s for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

See more details, test results and further work at Peter Novig’s site.

Written by mattalcock

November 6, 2009 at 10:24 am

How Statistics Can Fool

leave a comment »

A fantastic and very funny video from Peter Donnelly at TED on statistics and how they can oftern be missued or misunderstood. A good insight into how basic statistics can offer insights into patterns in complex data sets like DNA sequences.

Written by mattalcock

November 5, 2009 at 2:56 pm

Posted in Statistics

Tagged with , , ,

Training To Deal With Mega-Scale Data

leave a comment »

From Revolutions….

In a New York Times article (sub. req.) published on the weekend, IBM and Google expressed doubts that the students graduating from US universities today have the chops to deal with the mulit-terabyte datasets that are becoming commonplace online and in domains like bioscience and astronomy today. From the article:

For the most part, university students have used rather modest computing systems to support their studies. They are learning to collect and manipulate information on personal computers or what are known as clusters, where computer servers are cabled together to form a larger computer. But even these machines fail to churn through enough data to really challenge and train a young mind meant to ponder the mega-scale problems of tomorrow.

The article reveals how Google and IBM are promoting internet-scale research at places like the University of Washington and Purdue. But a curious omission from the article is any mention of open-source technologies which are spurring the innovation in processing and analyzing these data sets. Tools like Hadoop, for processing internet-scale data sets and R, for analyzing the processed data (most likely in some parallelized form), and other open-source projects not yet conceived, are going to be critical in this endeavour.

Written by mattalcock

November 5, 2009 at 12:29 pm

Posted in Data Analysis

Tagged with , ,