Deprecated: Function set_magic_quotes_runtime() is deprecated in /home/mwexler/public_html/tp/textpattern/lib/txplib_db.php on line 14
The Net Takeaway: Page 4


Danny Flamberg's Blog
Danny has been marketing for a while, and his articles and work reflect great understanding of data driven marketing.

Eric Peterson the Demystifier
Eric gets metrics, analytics, interactive, and the real world. His advice is worth taking...

Geeking with Greg
Greg Linden created Amazon's recommendation system, so imagine what can write about...

Ned Batchelder's Blog
Ned just finds and writes interesting things. I don't know how he does it.

R at LoyaltyMatrix
Jim Porzak tells of his real-life use of R for marketing analysis.







Another fine Mesh... · 12/02/2008 11:26 AM, Tech’ve gotten me into, with apologies to Abbott and Costello.

For the past few years, I’ve been able to link my various machines together with a collection of tools.

Now, these 3 tools give you remote control of another desktop and the ability to move files, up to 2gb at a time. (There are many file splitters/combiners if you need to move more than that at a time). (BTW, still looking for a good way to compare 2 directories and see what’s missing from each: a free Dir-Sync that doesn’t force me to copy but simply let’s me choose what to transfer. Suggestions welcome below, either standalone or as part of a free file manager.)

But a new one has been growing in popularity: Microsoft Live Mesh. This is MS’s first move into the cloud computing world as an “end user” offering, and it’s a pretty impressive first step. After signing in with a LiveID and installing the product, you basically get an icon in the tray and the ability to share any folder into the “mesh” (you control who has access to it, etc.). You can have multiple machines share that folder (or any number of folders), and you have a “web desktop” to add or get to your files from any machine.

There is a 5gb limit, but interestingly, that’s only for the web desktop side. You can actually share as much as you want between machines, but you only access 5gb of it on the web desktop.

But wait, there’s more. You also get Remote Desktop access to the Windows boxes you have added to the mesh. And all this works through most every firewall, etc.

(BTW, MS has a few offerings in this space that you should play with. Besides Mesh, they also have Skydrive which is an online 5gb, no installable client, just online drive space (shareable if you wish, but 50mg per file limit!), aka… and FolderShare which is a pretty clever and cult-popular syncing solution between 2 machines (no cloud storage). So, you’ve got Mesh (beta) which has sync and cloud, SkyDrive which is just cloud storage, and FolderShare which is just sync. FolderShare will be replaced by LiveSync which will sync up to 20 “libraries” each with 20K files each… and I suspect we’ll see them all converge in the future. More good details at the

So, slam dunk, right? Well, there are some differences between the “best of breed” solution I mentioned above, and this one. Some notes:

Some issues I’ve run into:

To their credit, the Mesh team is unbelievably responsive to concerns. They respond to bug reports, they post on the forum… it’s unlike any other part of MS. Gold star to them.

So, in the end, does it replace the motley crew I mention above? Well, Mesh is clearly still beta. But it does have lots of promise, and has MS behind it. So, (duh), I’ll keep playing with both sets, wasting precious memory and confusing myself as to which tool I last did what in. But I’ll sacrifice for you, my readers, in the interest of science.

Feel free to comment with your findings as you try these tools, or suggest others that you think are better.

PS: Why didn’t I mention all the other “sync” and other shared web storage offerings? B/c either a) they don’t work as well as DropBox or b) they don’t offer as many features as Mesh or c) they offer lots of syncing/versioning that I just don’t need.

But I will mention that I expect to see more “integrated” plays (ala Mesh) like Gladinet which I haven’t tried, but includes Remote Desktop (VNC or RDP), a virtual drive via a drive letter into the cloud (ala JungleDisk, but subsuming S3, Skydrive, Picasa, and others), web-desktop integration (double click a doc to load Zoho, for example), on demand file sharing, a shared favorites folder… and in the future, a web desktop for online access to all this. LifeHacker gave this a positive quick overview. Sounds too good to be true, but even if this one fails, there will be others which will include all this… and the kitchen sink.

PPS: I often hear “Oh, you can’t trust MS or Google or ____ with your data”. OK, folks, get a grip. If you really think MS is actually looking at any of the files or cares about them, you should take off your tin-foil hat every once in a while. When you have 5gb of files per user, you are doing everything you can to keep the thing running. You don’t have time to search every node looking for “account numbers” or other silly stuff. Yes, it’s POSSIBLE that some jerk could discover something personal. If that bothers you, don’t put personal stuff on the cloud. Buy your own storage and use your own VPN. But for the rest of you, this should be the least of your worries.


* * *


Thoughts while raking leaves · 11/29/2008 07:34 PM, Analysis Trivial

Comments? [1]

* * *


SQL and Hadoop · 11/20/2008 12:23 PM, Database Analysis

I don’t know why there is so much confusion over the role of MapReduce oriented databases like Hadoop vs. SQL oriented databases. It’s actually pretty simple.

There are 2 things people want to do with databases: Select and Aggregate/Report, aka Process.

The Select portion is filtering: finding specific data points based on attributes like time, category, etc. The Aggregate/Report is the most common form of data processing: once you have all those rows, you want to do something with them.

So, how do we tell databases to do these 2 things? For the past 30 years, we’ve used a language called SQL, “Structured Query Language”, to access the data. SQL worked best when the data was organized in “relational tables”. SQL as a language has some cool features, including the ability to create tables, modify and insert data, and return aggregations in a set-oriented fashion. It’s also over 30 years old, is wordy, and cannot easily deal with any world other than sets of textual relational tables.

While some programmers immediately get what SQL can do, others find it to be “YAL”, “Yet Another Language”. Object-oriented databases and other “persistent storage” systems have popped up to help these programmers treat the database as just another portion of their program, by “integrating” persistent storage systems into their current programming approach. Python has “pickling”, Perl used the DBM tied hash, etc.

MapReduce is a programming concept that’s been around for a while in the object-oriented world, but has recently become more popular as scripting languages rise and as processors become more parallel. The MapReduce paradigm basically forces/allows the programmer to pick a way to split a task across various “compute groups”, have those groups compute something, and then fold it all back up at the end. This approach maps nicely to the way many modern languages treat data, so having the database handle the heavy lifting is a nice touch.

Therefore, if you think about it, both Hadoop and SQL databases are doing the same thing: Selecting some data (the Map phase) and Processing it (the Reduce phase).

So, why the sturm und drang? A couple of things; I’ll mention a few here:

There are efforts underway to put a pretty face on the MapReduce systems. Facebook has contributed Hive and has released a Hadoop variant called CloudBase which looks really nice in it’s SQL support; other approaches are in the “not-SQL-but-still-easier-than-raw-MapReduce” language area: Microsoft has created Dryad for their cloud systems, and Yahoo! Research has a language called PIG.

Some database players have also started to combine MapReduce engines for processing with SQL/Relational engines for the storage layer. Greenplum, who has had a parallelized PostGreSQL for a few years now (and open sourced their now abandoned BizGreSQL BI-oriented PostGreSQL) and AsterData, who is less well known but is regarded for high capacity database systems.

Look, there are no shortage of distracting things and buzzwords here: When you parallelize, you can distribute across the “cloud”, you can run your analyses in the cloud using “Software as a Service (SaaS)”, yadda yadda yadda.

At the end of the day, ask what you are trying to solve with your program: If it’s massive processing of data, then a Hadoop solution may be your best bet. If the reporting and storage aspects are relatively simple, just persistent storage and simple sums of reasonable size data, then a relational database will be easier to get going with.

And yes, these will eventually converge such that you won’t have to decide which tool to use: all of the major database systems will have a SQL layer with multiple engines and a controller which optimizes which engine to use for which query; you will also have the ability to use direct MapReduce or SQL, as you see fit.

But we aren’t there yet. So, don’t just assume that Hadoop is the answer to all data processing problems: if you aren’t processing the data, it’s really the wrong tool. And don’t just assume that an Oracle “grid” or a Teradata are the only way to solve your massive data processing. You might be surprised how easily Hadoop can solve your needs.

Some things to watch:
Data Mining in Hadoop
Hama : Matrix libraries with emphasis on compute intensive like inversion… all within Hadoop
Mahout : Mahout’s goal is to build scalable, Apache licensed machine learning libraries. Initially, we are interested in building out the ten machine learning libraries detailed in using Hadoop.

SQL vs. Hadoop articles
A dime a dozen. Here’s a recent one: The Commoditization of Massive Data Analysis. At the end of the day, almost every article is either by
1) a traditional DB guy who doesn’t understand the fuss b/c Hadoop can’t seem to do basic SQL or other relational stuff out of the box, and so doesn’t understand the sea change from easy to access parallelized processing or
2) Hadoop lovers who never understood how SQL can simplify data queries (b/c it’s yet another language to learn) and see all data as something to process, not as a valuable resource in it’s own light.

So, read each SQL vs. Hadoop article with a grain of salt, including this one.

Trying Hadoop
I have played with Hadoop on the Amazon EC2 and S3 systems. Basically, you can plop a ready-to-go Linux image with Hadoop installed, and you pay for use (pennies per month based on my game playing, maybe a buck or two a month on larger data tests). See and for some docs… but it was pretty easy. My next experiment will be playing with Python and Hadoop thanks to articles like and

More soon…

(Thanks to Bob Muenchen of R for SAS and SPSS Users fame for corrections.)

Comments? [1]

* * *


Great site for Math, Accounting, Common Sense · 11/18/2008 03:54 PM, Analysis

One of the better sites for understanding things like what we mean when we say average, math, statistics, accounting, basic business, etc.

Highly recommended, whether you are an analyst, you want to be an analyst, or you want to debunk your analyst and impress people at parties…

Read at least one article there a day in the new year.


* * *


Scales, or how to think about a weighty subject · 11/18/2008 01:36 PM, Tech

Scales are a fun thing. There is a ton of research on how these things work, mostly in Psychometrics.

For example, does it matter if the scale has 5 points or 7 points? Should you anchor the endpoints (Very Dissatisfied 1 2 3 4 5 Very Satisfied)? How about the midpoint? I won’t even go into whether you should treat these as categorical or continuous or ordinals, nor when a scale is a Likert vs. not.

Survey analysis, psychometrics, all sorts of studies have given some answers, but as in so many cases, the answers tend to be “it depends”.

But there are some cases where scale choices were just dumb.

For example, the Vista Windows Experience score. This is a tool that you can run to see if your PC is ready for Vista, and it’s also built into vista. It basically takes 4 components of performance, picks the lowest, and makes that your score.

So, my new PC scored a 5.5, and I was upset. For what I paid, I expected more than the middle of the range. And MS didn’t provide any guides as to what was a good or bad score. I tried various tweaks, but still, I was hampered by my memory speed.

Running out of options, I looked for configs of people who had scored higher… and discovered that lo and behold, the range is from 1 to 5.9 (not 6, 5.9).

Hmm. So there is no mention of this on the tool, and only a few sites mention it. How many rules of information design, usability, and psychometrics does this break? Lots.

Why did MS choose this odd range? They were leaving some headroom. Of course, this is a silly reason. A score is either a sum of attributes, or its a scaled representation of a state. So, even though you got, so 20 out of 37 right, we might give you a scaled score of 7 out of 10. The scale is a conversion of units. So, it will get to 10… and then 11 (requisite Spinal Tap reference).

So, why not just make the scale 1 to 5 now? Or 1 to 10? As soon as you hit 10, you either rescale (ie, make a new scale) or just extend it again and now the scale is 1 to 13 or whatever. So, leaving headroom keeps the endpoints arbitrary, and makes comparisons really strange. Also, assigning score by picking the lowest seems strange, compared to how we actually make composites (ie, see factor analysis).

This really doesn’t help. Arbitrary scales confuse people, and they make comparisons difficult. I happen to know MS has a bunch of really sharp psychometricians on their staff (Hello, Maria!). Perhaps they should start understanding more about how scales work instead of picking arbitrary limits, confusing users, and demonstrating yet again just how messed up Vista really is.


* * *


Start to Understand Search and Display Ad Interactions... · 11/05/2008 11:33 AM, Marketing Analysis

Great post by the smart Matt Lillig from the Yahoo! Analytics Team on the Yahoo Search Marketing Blog… It’s short, but packed full of ideas. Well worth reading.

Measuring Search and Display for Success.

If you’d like more detail, Matt has a post on his own blog with some implementation tips at Tracking Display Ads Is Easy With Yahoo’s Full Analytics

I’ve worked on more advanced versions of what Matt describes, and it’s amazing how these 2 approaches influence each other. Handling search with an eye on what targeting and creative you’re running in display can substantially increase your ROI, and recognizing from search term specificity where a person is in their buying cycle and having display ads reflect that can really shrink time-to-buy and impact. This is just one example; getting these two groups (media buyers and the “direct search” teams) in organizations and agencies to talk to each other and coordinate can be tough, but once you get over that hurdle, it will change how you use both approaches.

Bonus: check the comments in the YSMBlog for additional thoughts from Mitch Spolan, who acts like a salesguy but roars with the heart of an analyst. Remember, not every thinker is in the analyst or research group; folks across your company can help you think about how to interpret your data and recommend effective approaches…

Comments? [1]

* * *


Enterprise Class? · 10/27/2008 07:00 PM, Analysis

Nice post by Aaron Gray of Webtrends:

What is Enterprise Class?

Worth reading; it mentions some of the same issues I’ve brought up around What Web Analytics is Missing….


* * *


Let down by databases... · 10/16/2008 11:02 AM, Tech Analysis

I’m mad… but mostly at myself. Here’s the story.

I was given a translation table linking 2 types of userIDs for some data we collected. Each system anonymized differently, so we needed the lookup to join the data. We’ll call this the Lookup table; 13 million rows means it is not tiny for your average analyst.

We also have various slices or segments of users we want to examine. These slices range from a 120k list… to a 19 million line list.

Varying table sizes, need for a join… sounds like a database problem to me! Just to set the stage, the IDs in question for both system are 32 character alpha-nums.

I’ve already written the query in my head:

select ID2 
from lookup lkup, segmentlist seglst
where seglst.dID=lkup.dID 

(Yes, I still write my joins in the where clause; much more readable to me than those “inner join on” of the modern SQL)

Should be easy.

12 days later.

We’ve gone from MS Access (who’s idea was that?) to SQLite to MySQL. We’ve played with indexes and configuration files, tweaking memory and caches all over the place. We’ve turned off everything else on the box so that this database is the only thing running, and even made just 1 query at a time. Some of these queries are running 3 days at a time! People say that joining varchars/text are slower than joining on numbers, but come on…

We’re looking now into Oracle XE and the MS SQLServer developer edition: both are limited to 1gb ram, 1 proc and 4gb of data, but they may be more powerful than MySQL

Why? Well, it seems that none of these databases allow you to override the “avoid indexes when reading lots of rows” rule. It varies from DB to DB, but the rule of thumb is that if you have to read more than 10% of a table, you should skip the index and row-scan.

Now, to be fair, there is some good research around why this is so (see, for example, the PDF In Defense of Full-Table Scans by Jeff Maresh). But they all have a flaw: they don’t know how many rows really need to be read, so they guess… and when they guess wrong, the use suffers. And they don’t always take the hint.

In my case, I know that we should pick the smaller table, walk it, and use the index… because there (sadly for me) won’t be lots of matches. Because I know this, the index makes more sense: we don’t need to read every row looking for the match, we can just pop into the index, see if the row is there (it won’t be) and move on.

But no matter what type of index I tried (multiple single column indexes, combined or clustered indexes, indexes using a foreign key), the databases still wanted to do full table scans on both. This basically becomes a looped nested join, which in effect means “read 1 row from first table, read the entire second table looking for the match, read next row from first table, etc.”. While this might make sense if there are lots of overlaps… it didn’t fit what I needed.

So, struggling over these almost 2 weeks, limited in my hardware choices, I rethought the problem.

Grabbing my copy of Beginning Python by the awesome Magnus Lie Hetland (highly, highly recommended), I wrote my own “read and merge” program.

That one that took 3 days, the merging of the 19 million to the 13 million row lookup? 15 minutes.

I did all the files in under 2 hours, including qa.

I basically read the lookup table into a Dictionary, and the “segment” table into a List. On my windows laptop, it appears that a List can hold even 19 million items… but the Dictionary gave MemoryErrors after only 5 million or so. I wound up writing a thing which the entire segment into memory, but then chunking through the lookup table in 2 million row chunks. Even with all that junk, it still flew.

Lessons learned?

1) Unless you have a reasonably powered box with a db that is configured for these kinds of data warehouse queries, you may find that stock/off the shelf databases will struggle under these loads. Lots of page churning and disk I/O will spread your query out.

2) Indexes aren’t everything… but they help. Oracle has Bitmap indexes which allow nice index merging (when you can force a query to use them) and I look forward to that dribbling down to MySQL (and other open source dbs)

3) Sometimes, a database is the obvious answer. It’s also sometimes the wrong answer (and I hate to admit that).

What about Hadoop, Bigtable, etc.? Supposedly these are the answers to big data problems… cept that they are still programmer tools, requiring Java expertise. Also, still no SQL on these things though Facebook’s Hive (PDF) may be the most helpful thing when it gets more accessible (well, either that or MS’s Dryad which uses a SQL variant (and will probably move to LINQ)).

What about SAS or SPSS? I don’t have SAS on my box, so it’s out. And SPSS, even in version 17, lacks the speed that I’ve had in previous database work around complex joins… so I skipped it.

There are also columnar store databases, which might have helped in this case, but few are open source and most require more hardware than I had available to me.

And yes, before you jump on me, I’m sure there are some config tricks I didn’t think about… but I didn’t think I should have to get so tricky. It was a 1 column 19 million row table joining to a 2 column 13 million row table.

To analysts out there: learn a programming language. Pick either Ruby or Python (Ruby is more like Perl, so your perl friends can help out; Python is becoming fully integrated into SPSS) and don’t be afraid to use it to munch files as converters or even simple joiners.

Also, one trick suggested to me: Skip the import stage if you are using Oracle, and use their External Table feature. Wouldn’t have sped up the joins, but allows you to skip the imports, which can be a nightmare in it’s own right.

So, I’m still mad that the databases let me down. This should have been an easy task. But it’s clear that the database world is so busy adding overhead by being all things to all people (we handle transactions! We handle complex multi-table joins in views! We handle BLOBs and Full Text Indexing) that the simple stuff gets washed away. And that’s really sad.

Also, I should probably request a beefier box sometime soon. So, I’m mad at the bad economy making it hard to allocate HW effectively.

But I’m mostly mad at myself. For making everything look like a nail to my database hammer, and for trying to solve the general situation (merge files!) with the DB instead of just writing a one-off point solution to just solve the problem (merge these files!). And wasting 12 days of my (and others’) time.

Lesson learned. I hope.

(PS: Useful tools:
Notepad++ (PortableApps Edition)
PortableDB (Nice portable MySQL)
Python (though the Activestate one is not bad as well)
HeidiSQL (PortableApps Edition, MySQL query tool)
SQLiteMan (query tool for SQLite databases, which also are used by Firefox and Chrome for various storage things)
AstroGrep (Fantastic multi-file searcher and replacer, for fixing quotes/tabs/etc. Free, open source)


* * *


One of the most amazing videos... · 10/10/2008 03:03 PM, Trivial Personal

Ok, it is of a watch, so if you don’t love watches, then you may not be as impressed. But it’s a pretty amazing watch.

Reverso Gyrotourbillon 2
Uploaded by jaegerlecoultre

A Gyrotourbillon basically is a rotating spring to counteract the effects of gravity on the movement. A watch will lose time more or less depending on it’s angle during the day. The gyrotourbillon’s constant rotation eliminates that specific impact.

For more info, see Jaeger LeCoultre’s page on the watch.

And no, I don’t have one. But someday…

Comments? [1]

* * *


Nice interview with Bob Muenchen, author of R for SPSS and SAS Users · 09/30/2008 06:11 PM, Analysis

I’ve mentioned before how helpful Bob’s work has been as I struggle with, I mean, learn R. This interview paints a nice picture of a guy who saw a need and decided to just fill it.

As part of the interview, he talks about commercial versions of R, the role of textbooks in modern learning, and integration of R with other packages.

The original version of his document on how to use R if you know SAS or SPSS is online at, but you’ll want to get the full expanded book when it’s released in October, give or take.


* * *


On a previous episode...

powered by Textpattern 4.0.4 (r1956)