OTHER PLACES OF INTEREST
Danny Flamberg's Blog
Danny has been marketing for a while, and his articles and work reflect great understanding of data driven marketing.
Eric Peterson the Demystifier
Eric gets metrics, analytics, interactive, and the real world. His advice is worth taking...
Geeking with Greg
Greg Linden created Amazon's recommendation system, so imagine what can write about...
Ned Batchelder's Blog
Ned just finds and writes interesting things. I don't know how he does it.
R at LoyaltyMatrix
Jim Porzak tells of his real-life use of R for marketing analysis.
HOW DID YOU GET HERE?
Sometimes, in this world of magical analytics and open source software, the most basic stuff doesn’t get done. Today, we’ll talk about Tables. Yes, the simple process of making a nice looking table for a presentation is still a manual process of pasting into Excel and manually fiddling. When I can cluster gigs of data but I can’t get a good looking table of the results, something seems wrong.
Why is this so hard? Well, easy tables are easy but it can get more complex than you think.
1) Univariate or Bivariate
You can think of these as the classic list or the “2×2” crosstabs we see. These are very popular in that they are easy to create, pretty easy to understand, and have been presented in every Stats 101 class. The most common thing to see inside these tables are counts, but you could also have means of some measures, %age of row/col/table, running total, etc. Your table could be made of 2 Independent Variables, or 1 IV and 1 Dependent Variable. The stats for 2×2 tables are very well known; almost everyone can rattle off “Chi-Square Expected and Observed” though there are others that are used as well.
This can also be extended to more than 2×2: you can have 3×6, etc, but these are just more complex extension of the 2×2 case. In each one, we only have 2 variables we are examining, though each may have many levels.
Most every tool can do this, including Excel (Pivot Tables rock!).
2) More than 2 variables
Now, this gets more tricky. Let’s take Gender, Purchased vs. Not-Purchased, and Presence of Children (Y/N). If we run tables in most packages, we’d get something like this:
Children = Y
Children = N
That is, the tool just prints a 2×2 (or 3×3 or whatever) table filtered for each level of the 3rd variable. If you have a 4th variable, you can usually get the tool to run a bunch of 2×2s for each level of the 3rd and 4th variable combined. The tools can, of course, give all the usual stats for the 2×2 so you can figure out which are useful, and you can change the order of variables in the tables command to see different things in the 2×2… but this isn’t really what you wanted.
In fact, what you may have wanted might have been something like this…
In fact, you probably wanted the Y cell merged across, and the N cell merged across to make things look nicer, but my HTML isn’t so great. I don’t feel bad, however, because neither can most tools.
These types of tables, where you put additional splits layered on top of bivariates, are often called stub and banner tables or just banner tables or tabs in the market research world. And you’ve seen them in tons of reports, market research output, and even hand made them in Excel.
So, it seems like it should be easy, right? Well, I’ll run down the tools in a moment, but it’s really sad. Pretty much none of the open source options work, and even the commercial ones aren’t much better.
Ok, but it could be fixed, right? Well, there is one other rub…
3) Sample Weighting and Stratification
I mentioned that one of the most popular uses of these tables, these “tabs” or “banner reports”, is in market research. And most folks just assume that their survey sample represents the population and just do their counts. But more advanced researchers know that they have to weight the statistics to account for response bias. If you know that Females make up 50.7% of the US Popl (see http://quickfacts.census.gov/qfd/states/00000.html) and you only have 30% in your sample, you have to weight up their responses. This is easy in some cases, but some stats become very complex, especially if you have stratified sampling (Wikipedia explains it pretty well at http://en.wikipedia.org/wiki/Stratified_sampling).
So, not only would your tool need to display the tables better, but it should also handle the necessary statistics to display properly weighted counts, percentages, and analyses. In R, Thomas Lumley’s survey package does the stats, but even this package doesn’t display banners or tables very well.
So, just how bad is it? Well, let’s see.
Here are just 13 of the many ways to make tables in R: table, xtabs, ftable, ctab, summary (from hmisc), contingency.tables (from Deducer), VCD’s structable, aggregate, epi’s stat.table, rreport (but exports only Latex), xtable, gmodel’s CrossTable, ecodist’s crosstab.
Here are all the ways to make a true banner table: (cricket chirp, cricket chirp)
So, that’s painful. Another problem is the lack of graphical output. Since all of R’s table commands spit out text formatted tables, you can’t just copy and paste them into Excel (or other spreadsheet tools) to reformat them. This is a huge stopper to productivity. Instead, what you really want is either a) formatting control in the program to create a graphically appealing, copy and pastable table, or b) direct output to a system which facilitates this, like Openoffice, Excel, or HTML. Some R commands output to Latex (via sweave), but for the average analyst, this is unusable (I love academics, but come on, asking analysts to use Latex is just wrong! If sweave is the best we can do, then we are all in deep trouble.).
I’ll talk more about this below. Ok, we love R, but what about other options?
SPSS: If you can afford it, the wonderful CTABLEs aka Custom Tables module is really nice. Besides a great gui:
they also have the ability to treat a variable as a measure or a dimension as you wish. You can build the table WYSIWYG with as many layers as make sense, and put multiple measures in each cell with various constraints and conditions. You can combine levels on the fly and recalc the counts, which is fantastic. Gold star to SPSS for this one. If you use SPSS, you really should be using Custom Tables
Systat: Can only do 2×2 with the filter header. No banners at all.
Minitab: Usual 2×2 with filter header, no banners at all.
Statistica: Does offer stub and banner tables, but not much control over them… While it’s not SPSS, it’s also cheaper
Stata: Stata’s table commands are all text output based, and don’t really offer a banner table.
SAS: SAS has proc tabulate and proc report, and these start to get you to stub and banner… but require some coding, and are still text output. That being said, they are pretty far along, and so are 3rd after SPSS and Statistica.
Spotfire S-Plus: Same as R, just the 2×2 with filter header.
So, what do real market research folks do? Most use SPSS and SAS, or settle for the small number of overpriced “tabulation” programs still on the market which just make tables. Programs like Wincross, Quantum (the classic solution for many large survey houses, now owned by SPSS as SPSS-MR Quantum), Uncle Tabulation, and Marketsight [SAAS] are sort of helpful, but not cheap. For example, Wincross costs $2500 per user!
Are there any open source solutions? cCount is a DOS program which requires a compiler to run Quantum scripts. And that’s it.
I’m disappointed by what I’ve learned. If you want to make banner tables, you literally have to use Excel and hand construct them, or buy SPSS. There has to be another way.
I think R will be the solution. It’s current myriad of table commands stink, tis true, but I’ve started to put together an approach that combines the best of those table-oriented commands, the amazing magic of reshape, and some HTML/word/excel output to create good looking tables. It’s not there yet, but I’ll keep plugging away. In addition, the useR!2010 R Conference has some good posters and talks about similar problems of needing higher quality output, so we’ll see what comes about.
* * *