Deprecated: Function set_magic_quotes_runtime() is deprecated in /home/mwexler/public_html/tp/textpattern/lib/txplib_db.php on line 14
The Net Takeaway: SQL and Hadoop


Danny Flamberg's Blog
Danny has been marketing for a while, and his articles and work reflect great understanding of data driven marketing.

Eric Peterson the Demystifier
Eric gets metrics, analytics, interactive, and the real world. His advice is worth taking...

Geeking with Greg
Greg Linden created Amazon's recommendation system, so imagine what can write about...

Ned Batchelder's Blog
Ned just finds and writes interesting things. I don't know how he does it.

R at LoyaltyMatrix
Jim Porzak tells of his real-life use of R for marketing analysis.






SQL and Hadoop · 11/20/2008 12:23 PM, Database Analysis

I don’t know why there is so much confusion over the role of MapReduce oriented databases like Hadoop vs. SQL oriented databases. It’s actually pretty simple.

There are 2 things people want to do with databases: Select and Aggregate/Report, aka Process.

The Select portion is filtering: finding specific data points based on attributes like time, category, etc. The Aggregate/Report is the most common form of data processing: once you have all those rows, you want to do something with them.

So, how do we tell databases to do these 2 things? For the past 30 years, we’ve used a language called SQL, “Structured Query Language”, to access the data. SQL worked best when the data was organized in “relational tables”. SQL as a language has some cool features, including the ability to create tables, modify and insert data, and return aggregations in a set-oriented fashion. It’s also over 30 years old, is wordy, and cannot easily deal with any world other than sets of textual relational tables.

While some programmers immediately get what SQL can do, others find it to be “YAL”, “Yet Another Language”. Object-oriented databases and other “persistent storage” systems have popped up to help these programmers treat the database as just another portion of their program, by “integrating” persistent storage systems into their current programming approach. Python has “pickling”, Perl used the DBM tied hash, etc.

MapReduce is a programming concept that’s been around for a while in the object-oriented world, but has recently become more popular as scripting languages rise and as processors become more parallel. The MapReduce paradigm basically forces/allows the programmer to pick a way to split a task across various “compute groups”, have those groups compute something, and then fold it all back up at the end. This approach maps nicely to the way many modern languages treat data, so having the database handle the heavy lifting is a nice touch.

Therefore, if you think about it, both Hadoop and SQL databases are doing the same thing: Selecting some data (the Map phase) and Processing it (the Reduce phase).

So, why the sturm und drang? A couple of things; I’ll mention a few here:

There are efforts underway to put a pretty face on the MapReduce systems. Facebook has contributed Hive and has released a Hadoop variant called CloudBase which looks really nice in it’s SQL support; other approaches are in the “not-SQL-but-still-easier-than-raw-MapReduce” language area: Microsoft has created Dryad for their cloud systems, and Yahoo! Research has a language called PIG.

Some database players have also started to combine MapReduce engines for processing with SQL/Relational engines for the storage layer. Greenplum, who has had a parallelized PostGreSQL for a few years now (and open sourced their now abandoned BizGreSQL BI-oriented PostGreSQL) and AsterData, who is less well known but is regarded for high capacity database systems.

Look, there are no shortage of distracting things and buzzwords here: When you parallelize, you can distribute across the “cloud”, you can run your analyses in the cloud using “Software as a Service (SaaS)”, yadda yadda yadda.

At the end of the day, ask what you are trying to solve with your program: If it’s massive processing of data, then a Hadoop solution may be your best bet. If the reporting and storage aspects are relatively simple, just persistent storage and simple sums of reasonable size data, then a relational database will be easier to get going with.

And yes, these will eventually converge such that you won’t have to decide which tool to use: all of the major database systems will have a SQL layer with multiple engines and a controller which optimizes which engine to use for which query; you will also have the ability to use direct MapReduce or SQL, as you see fit.

But we aren’t there yet. So, don’t just assume that Hadoop is the answer to all data processing problems: if you aren’t processing the data, it’s really the wrong tool. And don’t just assume that an Oracle “grid” or a Teradata are the only way to solve your massive data processing. You might be surprised how easily Hadoop can solve your needs.

Some things to watch:
Data Mining in Hadoop
Hama : Matrix libraries with emphasis on compute intensive like inversion… all within Hadoop
Mahout : Mahout’s goal is to build scalable, Apache licensed machine learning libraries. Initially, we are interested in building out the ten machine learning libraries detailed in using Hadoop.

SQL vs. Hadoop articles
A dime a dozen. Here’s a recent one: The Commoditization of Massive Data Analysis. At the end of the day, almost every article is either by
1) a traditional DB guy who doesn’t understand the fuss b/c Hadoop can’t seem to do basic SQL or other relational stuff out of the box, and so doesn’t understand the sea change from easy to access parallelized processing or
2) Hadoop lovers who never understood how SQL can simplify data queries (b/c it’s yet another language to learn) and see all data as something to process, not as a valuable resource in it’s own light.

So, read each SQL vs. Hadoop article with a grain of salt, including this one.

Trying Hadoop
I have played with Hadoop on the Amazon EC2 and S3 systems. Basically, you can plop a ready-to-go Linux image with Hadoop installed, and you pay for use (pennies per month based on my game playing, maybe a buck or two a month on larger data tests). See and for some docs… but it was pretty easy. My next experiment will be playing with Python and Hadoop thanks to articles like and

More soon…

(Thanks to Bob Muenchen of R for SAS and SPSS Users fame for corrections.)

* * *


  1. This is my first time at your blog and I’ve really enjoyed looking around. I will come back again in the future to check out some of the other articles.

    Hadoop ecosystem    Apr 24, 05:02 AM    #

  Textile Help
Please note that your email will be obfuscated via entities, so its ok to put a real one if you feel like it...

powered by Textpattern 4.0.4 (r1956)