There’s been a lot of buzz recently around the MapReduce algorithm and its famous open-source implementation, Hadoop. It’s the go-to algorithm for performing any kind of analytical computation on very large data sets. But what is, the MapReduce algorithm, exactly? Well, if you’re an R programmer, you’ve probably been using it routinely without even knowing it. As a functional language, R has a whole class of functions — the “apply” functions — designed to evaluate a function over a series of data values (the “map” step) and collate and condense the results (the “reduce” step).
In fact, you can almost boil it down to a single line of R code:
where map is a function, which when applied to a data set data, splits the data into a list with each list element collecting values with a common key assignment, and reduce is a function that processes each element of the list to create a single value from all the data mapped to each key value.
It’s not quite that simple, of course: one of the strengths of Hadoop is that it provides the infrastructure for distributing these map and reduce computations across a vast cluster of networked machines. But R has parallel programming tools too, and the Open Data Group has created a package to implement the MapReduce algorithm in parallel in R. The MapReduce package is available from any CRAN mirror.
Business intelligence suits/products offer a lot and companies offer great service and system support. However these suit/vendor product solutions can be extremely expensive! I really don’t think you need an expensive ETL suit or Business Intelligence product to run a rewarding data warehouse and analytical plant. I run a large very plant with the following open simple components and open source technologies.
• A Linux/Unix scheduling system.
• A general script to load delimited data into a db table
• A general script to run a proc
• A general script to extract delimited data from another db via proc or table extract.
• A general script to extract delimited data from the web.
Procs on a scratch db that sits alongside your main data warehouse db can be sued to transform the data and load into the main warehouse.
I’d recommend an open source stack of the following:
ETL Scripts: Python (Perl would also work well)
DB Storage: MySQL
Data Analysis: Excel, R, Python
Obviously this will not solve everybody’s needs however with the correct schema architecture this warehouse would scale for the majority of businesses at very little cost to build and maintain.
In future posts I aim to outline why these agile technique help you build a plant for you own needs without the extraordinarily high yearly BI toolset costs.
Let me know if this appeals to you and I’ll create more detailed posts to follow…
Ionz have built and inforgraph that is a simplified map of your personality in relation to the universe of people that have particiapted in there online graphic. It shows how you as a brand realtes to the norm. Good insight into how your behvaiours and choices compare to our soclities average.
See more at http://www.ionz.com.br/
Peter Norvig outlines how Google’s ‘did you mean’ spelling corrector uses probability theory, large training sets and some elegant statistical language processing to be so effective. Type in a search like [speling] and Google comes back in 0.1 seconds or so with Did you mean: spelling. Here is a toy spelling corrector in python that achieves 80 to 90% accuracy and is very fast. (see code below)
The big.txt file that is referenced here consists of about a million words. The file is a concatenation of several public domain books from Project Gutenberg and lists of most frequent words from Wiktionary and the British National Corpus. It uses a simple training method of just counting the occurrences of each word in the big text file. Obviously Google has a lot more data to seed this spelling checker with but I was suprised at how effective this relatively small seed was.
import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
s = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in s if b]
transposes = [a + b + b + b[2:] for a, b in s if len(b)>1]
replaces = [a + c + b[1:] for a, b in s for c in alphabet if b]
inserts = [a + c + b for a, b in s for c in alphabet]
return set(deletes + transposes + replaces + inserts)
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
See more details, test results and further work at Peter Novig’s site. http://norvig.com/spell-correct.html
A fantastic and very funny video from Peter Donnelly at TED on statistics and how they can oftern be missued or misunderstood. A good insight into how basic statistics can offer insights into patterns in complex data sets like DNA sequences.
In a New York Times article (sub. req.) published on the weekend, IBM and Google expressed doubts that the students graduating from US universities today have the chops to deal with the mulit-terabyte datasets that are becoming commonplace online and in domains like bioscience and astronomy today. From the article:
For the most part, university students have used rather modest computing systems to support their studies. They are learning to collect and manipulate information on personal computers or what are known as clusters, where computer servers are cabled together to form a larger computer. But even these machines fail to churn through enough data to really challenge and train a young mind meant to ponder the mega-scale problems of tomorrow.
The article reveals how Google and IBM are promoting internet-scale research at places like the University of Washington and Purdue. But a curious omission from the article is any mention of open-source technologies which are spurring the innovation in processing and analyzing these data sets. Tools like Hadoop, for processing internet-scale data sets and R, for analyzing the processed data (most likely in some parallelized form), and other open-source projects not yet conceived, are going to be critical in this endeavour.
Google Fellow Jeff Dean gave a keynote talk at LADIS 2009 on “Designs, Lessons and Advice from Building Large Distributed Systems”. Slides (PDF) are available.
Some of this talk is similar to Jeff’s past talks but with updated numbers. Let me highlight a few things that stood out:
A standard Google server appears to have about 16G RAM and 2T of disk. If we assume Google has 500k servers (which seems like a low-end estimate given they used 25.5k machine years of computation in Sept 2009 just on MapReduce jobs), that means they can hold roughly 8 petabytes of data in memory and, after x3 replication, roughly 333 petabytes on disk. For comparison, a large web crawl with history, the Internet Archive, is about 2 petabytes and “the entire [written] works of humankind, from the beginning of recorded history, in all languages” has been estimated at 50 petabytes, so it looks like Google easily can hold an entire copy of the web in memory, all the world’s written information on disk, and still have plenty of room for logs and other data sets. Certainly no shortage of storage at Google.
Jeff says, “Things will crash. Deal with it!” He then notes that Google’s datacenter experience is that, in just one year, 1-5% of disks fail, 2-4% of servers fail, and each machine can be expected to crash at least twice. Worse, as Jeff notes briefly in this talk and expanded on in other talks, some of the servers can have slowdowns and other soft failure modes, so you need to track not just up/down states but whether the performance of the server is up to the norm. As he has said before, Jeff suggests adding plenty of monitoring, debugging, and status hooks into your systems so that, “if your system is slow or misbehaving” you can quickly figure out why and recover. From the application side, Jeff suggests apps should always “do something reasonable even if it is not all right” on a failure because it is better to give users limited functionality than an error page.”
Jeff emphasizes the importance of back of the envelope calculations on performance, “the ability to estimate the performance of a system design without actually having to build it.” To help with this, on slide 24, Jeff provides “numbers everyone should know” with estimates of times to access data locally from cache, memory, or disk and remotely across the network. On the next slide, he walks through an example of estimating the time to render a page with 30 thumbnail images under several design options. Jeff stresses the importance of having an at least high-level understanding of the operation of the performance of every major system you touch, saying, “If you don’t know what’s going on, you can’t do decent back-of-the-envelope calculations!” and later adding, “Think about how much data you’re shuffling around.”
Jeff makes an insightful point that, when designing for scale, you should design for expected load, ensure it still works at x10, but don’t worry about scaling to x100. The problem here is that x100 scale usually calls for a different and usually more complicated solution than what you would implement for x1; a x100 solution can be unnecessary, wasteful, slower to implement, and have worse performance at a x1 load. I would add that you learn a lot about where the bottlenecks will be at x100 scale when you are running at x10 scale, so it often is better to start simpler, learn, then redesign rather than jumping into a more complicated solution that might be a poor match for the actual load patterns.
The talk covers BigTable, which was discussed in previous talks but now has some statistics updated, and then goes on to talk about a new storage and computation system called Spanner. Spanner apparently automatically moves and replicates data based on usage patterns, optimizes the resources of the entire cluster, uses a hierarchical directory structure, allows fine-grained control of access restrictions and replication on the data, and supports distributed transactions for applications that need it (and can tolerate the performance hit). I have to say, the automatic replication of data based on usage sounds particularly cool; it has long bothered me that most of these data storage systems create three copies for all data rather than automatically creating more than three copies of frequently accessed head data (such as the last week’s worth of query logs) and then disposing of the extra replicas when they are no longer in demand. Jeff says they want Spanner to scale to 10M machines and an exabyte (1k petabytes) of data, so it doesn’t look like Google plans on cutting their data center growth or hardware spend any time soon.