OTHER PLACES OF INTEREST
Danny Flamberg's Blog
Danny has been marketing for a while, and his articles and work reflect great understanding of data driven marketing.
Eric Peterson the Demystifier
Eric gets metrics, analytics, interactive, and the real world. His advice is worth taking...
Geeking with Greg
Greg Linden created Amazon's recommendation system, so imagine what can write about...
Ned Batchelder's Blog
Ned just finds and writes interesting things. I don't know how he does it.
R at LoyaltyMatrix
Jim Porzak tells of his real-life use of R for marketing analysis.
HOW DID YOU GET HERE?
In looking over how little web analytic tools have changed over the last few years, I’ve been struck by one of the things that have held the tools back: the adherance, not (just) to the page view, but to:
Aka the visit, the session has caused no end of explanations. It’s one of those things, like unique visitors, that create confusion to non-analytic folks. I always dread that part of the conversation where we explain that the “30 minute timeout” doesn’t mean that we chop sessions off after 30 minutes.
When you think about “processing” your own logs or tracking, what’s the first compute intensive task that comes to mind? “Sessionizing”. (Counting Unique Users gets partial credit, but that’s something databases do pretty well on their own, no special coding required).
In fact, I even start to wonder if it’s really relevant these days, at least the way we currently define it. It made sense back when we started all this stuff, but perhaps our definition, like so many things, needs some modernization to reflect current needs.
Define a Session
First, let’s all get on the same page (view. Hah!). Every web analytics tool these days follows the WAA definition, found in this PDF of Web Analytic Definitions:
A visit is an interaction, by an individual, with a web site consisting of one or more requests for an analyst-definable unit of content (i.e. “page view”). If an individual has not taken another action (typically additional page views) on the site within a specified time period, the visit session will terminate.
As you all know, we usually use 30 minutes as a timeout. (For those newbies out there, basically, we stitch all the requests together by some ID (cookie, whatever) and “walk along them”, starting a session with the first time we see the ID, and when we don’t see any activity from the ID for 30 minutes, we “close” the session. If the user pokes the site in a measurable way every 29 minutes for a day, we can have a 24 hour session totally legit.)
The Problem with Sessions
What’s the problem? A session is the wrong level of data tracking. Just like Page Views are what the tools want but we are all moving to Events, Sessions roll up lots of really important information (and by roll up, I mean either don’t track, or make it near impossible to get out of the tool).
Back in the day, when session were invented in a world where there was no Facebook or Twitter, the average website visit or session looked something like this:
We might see that most users are spending 30 seconds or less on each page, and a site visit may be 7-10 minutes in total; some a bunch longer, most shorter. But, it’s all pretty contiguous: the user enters, they read or shop or play a game or whatever, and then they leave.
But anyone looking at their data these days might see the following session:
What happened in this second session? Why did we have 7 minutes on a page, when we are used to much less? It’s hard to tell. You see, in that 2nd visit, our tool sessionized the entire thing, so we have what appears to be a user spending 7 minutes on a page that others spend usually 30 seconds on… and then converting with a coupon code from an affiliate that wasn’t present as part of the session campaign tracking. The session was attributed to the SEM vendor; why don’t we see the click from the affiliate? They actually closed the sale, but they aren’t getting credited. In some ways, this is the opposite of “last click” attribution: It’s “first click in the session” attribution!
So, we dig in with our web analytics tool. Did this user leave the site during that time, or open another tab to hunt around? Well, I can’t tell: since it was sessionized, most every tool throws away external referrers that are not part of a session initiation. That is, if it’s not the first page in a session, the tool doesn’t keep the referrer, even if it’s external. When its in the middle of a session, the referrer is, by our session definition, the previous page in the session… even when we kind of know it’s not.
Now, when I go to my logs (as sports fans know it, “Let’s go to the tape!”), it becomes more obvious what happened. User saw a product page, then popped a new tab/browser, hunted for a better price, didn’t find it (we rock!). They then hunted for a coupon code on an affiliate (which we recognize as one that does lots of good SEM/SEO work, hence the suspicion that the user was searching) , clicked on that link… and since that was less than 30 minutes of dead time, that marketing query string was just tossed into the middle of the session, the tool plops it into the middle of the previously open session, we lose the referrer and potentially the query string… and we miss what really happened.
Most tools will throw away that mid-session query string and other mid-session data, unless we take special efforts to tuck it in somewhere. The campaign code query string won’t show up in the usual marketing reporting places, for example. We can make it an event in GA or Coremetrics, or dedicate a “track em’ all!” eVar in Omniture, but at the end of the day, the tools mostly assume a session has one source, and it’s on the first request of the session.
(Now, to be fair, Omniture does have a workaround of sorts for their conversion variables (including s.campaigns): http://blogs.omniture.com/2008/08/19/conversion-variables-part-ii/ points out that in the admin console, you can set First, Last, and Linear allocation in-session approaches. That’s pretty good… but it doesn’t solve the issues with referrers, nor other “in-session” questions like order effects of which marketing came first, etc. And, if you want to have more than one of these types of attributions (to compare First- vs. Last-in-session, for example), you need to fork out more variables, and Omniture only uses s.campaigns in all it’s “fancy” marketing dashes. So, workaround, but not a solution, imho.).
Do Sessions Really Answer Our Business Questions?
In fact, sessions kind of peanut butter over lots of interesting behaviors when we realize that a) our sites are part of a larger internet, where users hop between sites at the drop of a mouse, and b) modern large sites are often made up of multiple business interests (merchandisers of toys vs. dresses, business units of B2B arm vs. B2C arm, etc.) and their “sub-sections” need to be treated like mini-sites on their own.
This leads to some tough questions to answer with modern tools. Some users don’t care about “the site”, they care about their section: what drives traffic there? How do external-to-deep-drops work for them? How about internal promotions? How can they track time-per-visit “in their section” (vs. total site visit)? Others pay for clicks, and they want to account for all of them: sessionizing can hide away the interaction of how these multiple influencers drive your business. Besides, the “view through” and earlier-in-the-funnel clicks, what about all those that happen in the session?
In fact, so many of the questions I get are intra-session questions: How do people use this “Section” of my site? Does “any one of these pages or types of pages drive certain behaviors”? Do people use multiple marketing channels in a session? Are people clicking into my site multiple times in a row from social as people point out different parts of my new products or new pages? How do I track internal promotions with the same powerful tools given for external marketing?
(If you got this far, I hope it’s clear that I’m not talking about last click attribution across sessions, I’m talking about multiple marketing drivers impacting the same session: people looking for pricing breaks by clicking on offers, clicking on coupons, trying aggregators, whatnot: they may “leave and come back” to your site 3-5 times in a session to see if a coupon code or cookie makes a difference. And guess what: that’s all 1 session to your tool.)
This session issue, btw, also affects reports like “exit pages”, “entry pages”, anything that is a by-product of the sessionizing experience. If we need to wait 30 minutes for a session to end, then an exit-and-then-an-entry within 20-30 minutes is still part of the same session… when we may really want to understand what’s going on here, the tool has already got it’s answer for us (hammer, meet nail).
So, if a user comes back and forth like that, hopping out to compare prices, coming back, then hopping out to find free shipping or discount codes, should we consider all of that behavior one session? Point, meet counterpoint:
So, yes, like every analyst, I want to solve the cross-session attribution problem. But I also spend time trying to explain why my affiliate vendor reports, on first glance, seem so much higher than what my web analytics tool would report based on campaign codes. Ignoring the usual slop of bad redirectors, yadda, yadda, we often find that it’s not just multiple channels that add up to drive online behaviors, but that they add up even in the same session... and my web analytics tool is not helping me understand this.
So, My Proposal
We can still do Visits, or Sessions and not lose history… but we should also “sub-sessionize” on arbitrary boundaries for the problems at hand. By arbitrary, I mean using boundaries that answer your business questions or your need for a deeper level of analysis.
Yes, these “adjusted sessions” will be bigger counts than the usual sessions, and they will cause pv/session and other ratios to change. We aren’t throwing away the old stuff, so no worries, but but just consider these new metrics a different unit of analysis, and all those worries go away. You can now analyze and optimize processes in your site that sum up to the big picture, only you couldn’t see them under the old metrics.
The point is a commonly overlooked one: Web analytics isn’t about analyzing your web site. It’s about analyzing how your web site drives your goals. Don’t get stuck into trying to cram your business into how web analytics decided things would be 10 years ago. Think instead about how you can bend the tool to talk your business language, and around your needs.
Some benefits of “sub-sessions”: we can use the same technology and approaches to handle fractional attribution, pathing, etc, and understand what’s really driving conversions both across but now also within a session. If a bunch of my SEM budget is starting a session, but I’m also paying my affiliates in the same session, well, that just feels like a waste of money. And in most web analytic tools, I’d never see it (sure, I can use other tools to track this, but a web analytic tool seems like a good place to handle this, right?)
Is this happening to me?
How to tell? Besides going to your logs, one way is to set a tag/variable to always look for the campaign code info and tuck it into a page-level variable or event. Then count these up. If you get a different count here than you do from the various marketing-dashboard visit-level tools, then you know you have this “session-absorption” problem. If it’s pennies, don’t worry about it.
But I think you’ll find that customers are doing a lot of things inside your session that you wish you could break apart. In some ways, this is just a recursion: Take the same tech and approaches used for “last click, first click, any click” attribution across session and run it across these “adjusted sessions” to show how multiple channels are being used to “seal the deal”.
It’s the forest for the trees: while standing in “Website Meadow” in the center of the forest and staring out to the horizon to see how those “first touches” impact your customers way at the edge of the forest, you may be missing the fact that down at the treeline in this meadow, you have too many last touches munching away at your sandals.
(BTW: I too thought that Omniture’s Linear Allocation or Participation concepts would solve these in-session problems, but they only apply to pages as of this writing, not other data types: http://blogs.omniture.com/2009/01/13/participation-inside-omniture-sitecatalyst/. And the Cross-Visit-Participation plugin is really for cross-visit attribution, not in-session, though it could be used here, with the same problems as mentioned for the allocation issues above. But excellent Omniture-Fu you had in thinking those would apply! Call me for a job!)
* * *
We have sort of gotten used to airlines charging us now for services that used to be included… while not lowering the original price. In fact, they used the excuse of high fuel prices as the original excuse, but when fuel prices when down, the prices didn’t roll back, did they… hmmm.
But in the past few weeks, I’ve discovered lots of interesting things about hotels that make me not want to trust them any more than I trust the airlines. And the “middlemen”, the modern travel tools, actually are destroying more goodwill than they are creating.
Agent vs. Aggregator
Some sites are travel agents. Travelocity, Expedia, the original ones: they either pre-buy some inventory and sell it to us, the travelers, or when you are ready to buy, they quickly buy the room and turn around and sell it to us. Could we have gotten that room at that price ourselves just by going to the hotel directly? Sure; in fact, we could have gotten a better rate… but the convenience of the tool made us book through them. This seems fine… til you want to change.
You see, you didn’t make the reservations. Hotels.com, Expedia, whatever made the reservation. You can ask them to help you with the change, but they are the owners of the reservation. So watch how this works.
You get a room at ComfortSuites at 5:30pm on hotels.com. At 5:35pm, however, you need to cancel it. You call the hotel, but they tell you that they can’t help you: the reservation is made at hotels.com, so they own it, not you. So you call hotels.com (or waste hours chatting with them or sending email with a turnaround time of 4 hours to get a form reply) to cancel… After 30 minutes of hold, they tell you that they can’t cancel the reservation, it’s all lalready paid for. In facts, you’ve paid hotels.con and they’ve paid the hotel. So unless the hotel refunds the money, hotels.com won’t pay you. But wait, didn’t we just learn that the hotel won’t do anything because hotels.com owns the reservation and you don’t?
At the end of the day, both sides can point the finger at each other and you, the consumer, are out of the money. Even worse: change fees are now mandatory on almost every change, and they still don’t guarantee you a seat on another flight; hotels can charge 1 night’s stay for any change… and they can pick the most expensive night if you have multiple rates. It just goes on and on. You agree to all this, btw, the moment you pay: it’s all there in the clickwrap you agreed to. You did read all that, didn’t you?
Note the complete imbalance: the planes can overbook and then bump you. The hotel can be out of rooms even though you have a “reservation”, and they can force you to stay at another “sister” location. You rarely get much if any compensation for this… but when you try to change things on them, well…
Ok, enough ranting. You know you have to travel, so what can you do to mitigate this? Kayak.com is still an affiliate play, linking you back to the branded sites of the provider, as is the wonderful Farecast, now Bing Travel. But the big guys: Priceline, Expedia, Orbitz, Travelocity… well, I’d say use them to find a deal, but book it directly with the company you are traveling with or staying with. Otherwise, you are just setting yourself up for pain the moment anything changes. And if there’s one thing that’s constant these days, it’s change.
* * *
No question, now is a golden time in New York for data guys and gals. There are some amazing companies doing some amazing things out here. Sometimes, you even think you are in the Valley of Sili… well, until you almost get hit by a cab as you cross the street to buy a tiny bag of no-name potato chips for $4.
Wanted to update you all on why I’ve been so quiet since the summer.
I was lucky enough to be called by two friends, each of whom wound up at Citibank, and they are the kind of people who make things happen. Citi is undergoing a huge revival: they’ve repaid their TARP money, dumped all their risky debt into a holding company, and focused on becoming a “Digital Bank”. Now, that’s easy to say, but Citi is actually putting their money where their mouth is (but in a bank-safe way, of course). This includes everything from changing the web site to rolling up tablet and smartphone apps to integrating social and working to supply a financial-services aspect to many of the ways we work online.
Oh, and they are doing this across 80 or so countries.
So, I’ve joined as a Director of Digital Analytics for North America. This is in the Global Consumer Marketing and Internet group (GCMIO), in the “Decision Management” arm (basically, Analytics), and involves optimizing our digital touchpoints (sites, apps, marketing, mobile, even ATMs) as well as showing how digital can integrate with other marketing (including some great Media Mix Modeling work). (BTW, titles don’t work the way you expect in a bank: lots of Vice Presidents, fewer Directors… more like the way agencies tend to title.)
Look, I’m as surprised as you that I’m now a banker (formally, an “Officer of the Bank”). But there were some great things to consider:
We are located in Long Island City (aka, LIC) just across the river from Manhattan. It’s not really on Long Island, but is instead the first subway stop in Queens on the E line. We get a great view of the city… which we stare at longingly wondering why there are no restaurants, Starbucks, or Duane Reades outside our 50 story building.
Why not a startup? Truthfully, the final decision came down to some rapidly growing startups vs. some big brands. But one thing I find I really enjoy is teaching folks about how to use digital data, how to think about targeting, testing, and the fun that can happen as you move into Agile approaches… and for all it’s knowledge about international currency approaches and risk assessment and credit usage, Citi still needed some pushes into new ways to think about using digital data. To be able to work with the guys who manage zillions of dollars worth of transactions and see how we can turn that power onto digital data… well, that seemed cool. I’m sure I’ll wind up with a startup in the next go round, but for now, I’m having fun.
And why so quiet? Well, to be fair, Citi is still a bank, and has some “banky” type restrictions. One of them is lack of access to some communication sites and experiences from work, including the major webmail sites. You can see why they’d be concerned if folks were using Hotmail to send out customer financial info. So, that does make it more difficult to be everpresent online… but I’ve been finding ways to get back into collaborating with others online during the day (thanks, iPhone and iPad!).
So, look for changes in Citi.com, in how you might interact with your Citi credit card or bank account in multiple ways online, and think about how you’d want to work with your bank and your money online… and if your current bank or credit card isn’t doing what you want, why not let me know, and give Citi a try?
Oh, and if you are an analytics person, either stats/modeling or web-digital analytics, and you are looking to try a new challenge… no matter when you read this, we have openings. We have offices in every major country. And we’ll give you a chance to start fixing all those things you don’t like about banking online. What more could you ask? Here are some links to try:
Citi GCMIO jobs and Citi Omniture jobs
Here’s one open RIGHT NOW: Lead our online testing and optimization work, including landing page and site page optimization. Testing is going to be part of everything we do, but it has to be done right in the early stages to show real impact. It can range from marketing testing all the way through to site functionality, and everything in between. You can be the one to make it happen. Click HERE for the DM Director Digital Testing and Optimization job.
* * *
While I’ve been between full time gigs, I’ve had the chance to do some consulting with a couple of companies, and I wanted to describe a bit of what I’m seeing.
One of the more interesting companies I’ve worked with recently is Rocket Fuel, Inc..
The founders include George John, engineering lead behind Yahoo!‘s BT (behavioral targeting), Richard Frankel, product lead behind most of Yahoo!‘s targeting products (and an early employee of NetGravity; if you remember that one, you know how long he’s been in this game), and Abhinav Gupta, who built many of the analytic data systems powering BT at Yahoo!.
They very quickly added talent including another co-worker of mine, Jarvis Mak, who built up their analytics and client services team, and he was kind enough to allow me to help out in their NYC office.
What does Rocket Fuel do? We all know that you can use various data points to buy ad inventory off of the exchanges, but just throwing data at the problem isn’t enough. Instead, you need to decide which data is most predictive to drive the behavior of interest. There are pre-packaged “BT” categories from many vendors, but those are built to be general models. The RF team has built the next generation of those models, using many more data points but also including much more flexibility in their approach. In effect, they build a custom configured model for each campaign designed to optimize the advertiser’s specific behavioral goal and optimize the spend across multiple exchanges and tactics. Their approach also allows for very rapid updating of the models: Not just rapid rescoring, but literally rapidly rebuilding to take into account the most recent behavioral data they’ve encountered. That might range from just coefficient updates to actual model vs. model comparisons to account for new variables. Finally, they are hitting moving targets: as the exchanges move to real-time-bidding, they can optimize the offer in real time to recognize changes in the market.
There were a couple of things they are doing these days which exemplify the modern data-marketing approach:
They keep ALL the data. Every impression was stored with every piece of data they could link to it. They did a nice job of using HBase for some processing with a Hadoop/Hive system for others. Some modeling was experimented with in R or MATLAB, but the heavy duty stuff quickly productionized to Hadoop. And because all the data was there, pretty much any question I had could be answered using all the data that occurred during a campaign without too much waiting for the query to finish. Yes, Hive is a queue oriented system, but these guys had some great UDFs which reduced some of the MR phases, especially around joins. Knowing that all the data is available, it became kind of fun to be able to kick off queries around, say, every cookie who received more than X of our ads over the last month to see just which cookies we were seeing too many of, and start to dig into why. While I’ve had this elsewhere, often the databases were just not designed to deal with this volume… Rocket Fuel was built around it.
It’s not enough to just be an automated system when dealing with marketing. The recommendation guys learned this long ago: If you just do recommendations without recognizing the difficulty of interfacing with content systems and supply/inventory systems, you don’t have much success; similarly, most of the successful email marketing companies for the largest brands (e-Dialog, Responsys) have built great service teams on top of their powerful tech. While Rocket Fuel’s success is driven a lot by the effectiveness of their models, different clients need different levels of help with everything from conversion pixel placement to how to understand their results.
Honestly, I was surprised. I expected to find a magic machine pumping out predictive models that automatically scored everything and had great performance and used APIs on the exchanges to just get ads up and out to the right people. The founders would be modeling and coding, and the rest would just be automated. Instead, Jarvis Mak has had to build a good team across the country of account reps who assist with campaign design and delivery, and analysts who assist with strategic consulting, pre-and in-campaign forecasting and evaluation, and post-campaign recommendations. Together, they help the marketers move from tactical to more strategic targeting. There are also people working on deals to get access to content pools at pre-negotiated pricing or that aren’t easily available in the exchanges, and people working on dealing with trafficking issues (can you believe this is still a problem after all of this time?) or data quality. All the tech we have today doesn’t eliminate the need for these services, so if you are building your startup and assume it will just work and mint money while you sleep, well, it if involves something online, you should assume it will need more care and feeding from humans then you ever imagined. And as I’ve chatted with other data-centric companies small and large, I’m seeing a similar pattern. Sure, the tech does a lot… but you still need the people.
They experimented. A lot. One advantage of being a small company (well, growing fast, and so not as small as they used to be) is that they can really try to innovate in things like targeting for brand impact or targeting for social-network growth. Even within a client’s campaign, they have the ability to throw in extra variables into small holdout groups to see how their presence impacts model performance; if they win, they migrate to more of the campaign; if not, no harm done.
Their business model is a mix of services, and that in itself is an experiment. Should they have an API to allow others to use them as a full DSP? Should they expose more external reporting to advertisers who want more visibility? Should they add more people to the client services team? They have different clients with different relationships, to see which is most effective. They haven’t had to pivot really hard, but they’ve certainly taken on certain client relationships that they recognize were not scalable, and minimized them. But without experimenting with the business model, it can be hard to know what’s both profitable and scalable.
That agile nature was also applied to the back-end systems. I’ve mentioned the rapid model updating, but they also were rapidly rolling out improvements to all sorts of systems. For example, their Hive install leveraged the Cloudera distro for things like the Hue user interface… but they had built some really clever custom functions which eliminated the need to do some joins, speeding up both the analyst queries as well as the data feeds for some of R&D models. These functions were written as rapidly iterating agile projects, so they met analysts needs almost immediately. Having real analytic-system coders who are familiar with Hadoop and it’s family is a completely different (and much more enjoyable) experience than dealing with the usual db programmer with 20 years of PL/SQL on simple, small transactional data.
Rocket Fuel was very, very cloud based for basic business apps. From using Google apps to manage communication and basic documents, SugarSync to manage file sharing, wikis to manage knowledge sharing, Salesforce to manage the sales process, nothing major for day-to-day work had to be stored on any local storage. Instead, much of the day-to-day work could run off of light and cheap boxes with Chrome. Now, of course folks used MS Office and Project and whatever when they had to… but I was surprised at how rarely that came up. Interestingly, when I asked about putting some of the data or ops systems in a cloud, they mentioned that in their constant cost-benefit evaluation, having the hardware under their control allowed them to reduce latency in a way that the cloud vendors couldn’t meet at scale. I suspect that will change in the future, but it jives with what I’ve heard from others: cloud PAAS/SAAS are great for proof-of-concept, but when you really need speed, you need to control your HW.
As a startup, they skimped on some things… but they made coming to work pretty nice. From incredible daily lunches in Redwood Shores to a fully stocked fridge and pantry in NYC, from poker-and-RockBand afternoons to kayak breaks, they recognized that they had to move fast but could have fun at the same time. Yes, it’s one thing to say “Google does that too!”... but the Goog has a zillion dollars; it’s another to say “every startup starts out doing that stuff to get talent”... but how many keep doing that even as they grow well past fitting the entire company in 1 room?
So, Rocket Fuel was a pretty cool place. But they are in a tough industry. Everyone in this space seems to tell the same story of data-driven marketing: “We use data to optimize the ad by showing the right creative to the right person at the right time at the right price”. When you get deeper into it, you can start to differentiate, but the average marketer can easily get confused around the differences between X+1, RocketFuel, DataXu, Interclick, Turn, or the zillion others.
Also, there’s still massive inventory issues: Lots of exchanges and places to buy, but still limitations on some of the “good stuff” or high quality inventory. For example, many advertisers like expandables (ads which can grow outside of their original unit size), but those are hard to get on the exchanges, so the full power of the algorithms are not always brought to bear (for any of these companies). Agencies building their own trading desks and linking those to the publisher private exchanges may also remove some of the better inventory, but that remains to be seen; the agency trading desks are very new, and they vary in sophistication.
Another issue: part of the value in this ad network business is an arbitrage play. If you can buy inventory for $0.01, but your models show that it’s being shown to a person worth $0.05 to your advertiser, then you can pocket the difference. This falls apart if inventory rises in price (for certain types of creative, or high quality or focused content, the inventory can be expensive) or if advertisers don’t want to pay more for targeting (some advertisers can afford to pay more than others), and also falls apart if the arbitrager can’t show the additional value to defend the higher price. All players in the exchanges have this as part of their playbook, though the better players are trying to build their business on more than just this pricing play. The AdExchanger blog is a good place to see how folks talk about the exchanges and how the online ad business is changing.
Now, many of you know that I’m also a big fan of X+1 (warning: they have an autorunning talker so turn down the volume if you visit).
There are similarities and differences between the companies, beyond the fact that X+1 is East Coast and Rocket Fuel is West Coast (like rappers, everyone has a favorite side of the country). I very much like how X+1 tries to optimize the entire experience, from ad-selection all the way through to landing page optimization (like an Optimost light). I think that’s very compelling, since it uses all the data available to make the entire experience consistent, and also gives some additional attribution capabilities. Rocket Fuel has chosen to focus on the ad optimization side only, and to their credit, does a great job of it. Also, X+1 offers a full DSP suite, while Rocket Fuel has tended towards more internal management of campaigns. Talent at both companies are top notch, and it’s hard to pick one over the other on that front. If you are considering using an audience optimization service, these should both be in your short list.
BTW, Rocket Fuel careers and X+1 careers are both hiring, and if you like working with big data, big optimization problems, and big clients… either of these are great choices, and tell them I sent you!
* * *
Wow, that was a long break, wasn’t it?
I wanted to update folks that I’m no longer with Barnes and Noble. First off, before anyone jumps to any conclusions, I enjoyed my time there and have lots of respect for the great work being done with the nook and the site experience on BN.com, as well as the impressive advances in marketing for both the site and the ebook ecosystem. There are some smart folks there working very hard to compete against Amazon, and if you think your job is tough… imagine going up against Amazon’s core business each and every day.
I said it back then, and I’ll say it again: very few other companies have such a large database of media choices by so many customers (I can’t say the real number, but many, many millions) over such a long time (over 10 years!).
So, why leave? Well, I had worked pretty hard on centralizing the data, in the belief that a full view of the customer would help improve marketing, customer experience, and actually using the data to change the business. Metrics are nice, but it takes some effort to move beyond that and actually use the data as part of changing your business.
And we did achieve a lot:
But things change. And as the company changed, they reorganized, and decided that some tactical needs should outweigh the longer-term delivery of my approach. The drive to centralize was changed into a drive to split analytics resources across the different business needs: Marketing wanted to own their analytics resources, the ebooks team had some great questions and wanted dedicated support, and the site teams wanted more dedicated resources of their own. But as we let each group try to optimize their silo, the original vision of evolving into a shared customer record and the tools to leverage it, the focus on enhancing the customer experience and creating a long term, satisfied and profitable customer via the use of this data across channels… well, that fell by the wayside in the drive to get more tactically focused.
Strategically, there is no right answer for centralization vs. decentralization. If 3 legs of the stool are needed to hold it up, then making sure each one gets exactly what it needs is a compelling case. And if you’ve seen the stock recently, it’s hard to argue that some things needed to change in the company. I also can’t complain about the need to fix some of those tactical issues: The way Barnes shows up in SEO and SEM these days is light years ahead of where it was when I started, for example.
So, given those necessary changes, BN and I agreed that where they wanted things to go didn’t quite fit into where I wanted to be, and we agreed to part ways.
That doesn’t mean I still don’t believe they will do great things… on the contrary, analysts like Basia Fabian, Vince Ovlia, and Ana Kravitz will continue to turn raw data into useful insights. And smart cats like Emmy Davis in on-site Search, Kristina Stern in Search Marketing, and Jerram Betts in SEO will continue to change how BN helps you find what you are looking for in media (and beyond).
The “make the data dance” baton, in some ways, has been passed to Marc Parish who runs all retention and database marketing over at BN (including that classic Membership program), and he’s so good, he even taught me a thing or two! (wow, what an ego I have… 8-) ) Make sure to catch Marc’s talks at Strata and other big data conferences around the country. And there are a bunch of other talented folks working together across NYC and Palo Alto.
As for me, what’s next? Well, I’ve filled my dance card with some consulting gigs which has been fun, and I’m on the verge of the next great adventure (more on that in a bit). There are amazing startups doing some really cool stuff, and NYC has been surprising me with the technical and data sophistication of some of the folks I’ve been talking to. It’s not just media out here these days!
So, keep using your nook (esp. once you’ve put a newer version of Android on it), keep visiting your local B&N, and keep checking out BN.com.
And keep checking out NetTakeaway.com! Expect a backlog of posts to start showing up soon…
* * *
Sometimes, in this world of magical analytics and open source software, the most basic stuff doesn’t get done. Today, we’ll talk about Tables. Yes, the simple process of making a nice looking table for a presentation is still a manual process of pasting into Excel and manually fiddling. When I can cluster gigs of data but I can’t get a good looking table of the results, something seems wrong.
Why is this so hard? Well, easy tables are easy but it can get more complex than you think.
1) Univariate or Bivariate
You can think of these as the classic list or the “2×2” crosstabs we see. These are very popular in that they are easy to create, pretty easy to understand, and have been presented in every Stats 101 class. The most common thing to see inside these tables are counts, but you could also have means of some measures, %age of row/col/table, running total, etc. Your table could be made of 2 Independent Variables, or 1 IV and 1 Dependent Variable. The stats for 2×2 tables are very well known; almost everyone can rattle off “Chi-Square Expected and Observed” though there are others that are used as well.
This can also be extended to more than 2×2: you can have 3×6, etc, but these are just more complex extension of the 2×2 case. In each one, we only have 2 variables we are examining, though each may have many levels.
Most every tool can do this, including Excel (Pivot Tables rock!).
2) More than 2 variables
Now, this gets more tricky. Let’s take Gender, Purchased vs. Not-Purchased, and Presence of Children (Y/N). If we run tables in most packages, we’d get something like this:
Children = Y
Children = N
That is, the tool just prints a 2×2 (or 3×3 or whatever) table filtered for each level of the 3rd variable. If you have a 4th variable, you can usually get the tool to run a bunch of 2×2s for each level of the 3rd and 4th variable combined. The tools can, of course, give all the usual stats for the 2×2 so you can figure out which are useful, and you can change the order of variables in the tables command to see different things in the 2×2… but this isn’t really what you wanted.
In fact, what you may have wanted might have been something like this…
In fact, you probably wanted the Y cell merged across, and the N cell merged across to make things look nicer, but my HTML isn’t so great. I don’t feel bad, however, because neither can most tools.
These types of tables, where you put additional splits layered on top of bivariates, are often called stub and banner tables or just banner tables or tabs in the market research world. And you’ve seen them in tons of reports, market research output, and even hand made them in Excel.
So, it seems like it should be easy, right? Well, I’ll run down the tools in a moment, but it’s really sad. Pretty much none of the open source options work, and even the commercial ones aren’t much better.
Ok, but it could be fixed, right? Well, there is one other rub…
3) Sample Weighting and Stratification
I mentioned that one of the most popular uses of these tables, these “tabs” or “banner reports”, is in market research. And most folks just assume that their survey sample represents the population and just do their counts. But more advanced researchers know that they have to weight the statistics to account for response bias. If you know that Females make up 50.7% of the US Popl (see http://quickfacts.census.gov/qfd/states/00000.html) and you only have 30% in your sample, you have to weight up their responses. This is easy in some cases, but some stats become very complex, especially if you have stratified sampling (Wikipedia explains it pretty well at http://en.wikipedia.org/wiki/Stratified_sampling).
So, not only would your tool need to display the tables better, but it should also handle the necessary statistics to display properly weighted counts, percentages, and analyses. In R, Thomas Lumley’s survey package does the stats, but even this package doesn’t display banners or tables very well.
So, just how bad is it? Well, let’s see.
Here are just 13 of the many ways to make tables in R: table, xtabs, ftable, ctab, summary (from hmisc), contingency.tables (from Deducer), VCD’s structable, aggregate, epi’s stat.table, rreport (but exports only Latex), xtable, gmodel’s CrossTable, ecodist’s crosstab.
Here are all the ways to make a true banner table: (cricket chirp, cricket chirp)
So, that’s painful. Another problem is the lack of graphical output. Since all of R’s table commands spit out text formatted tables, you can’t just copy and paste them into Excel (or other spreadsheet tools) to reformat them. This is a huge stopper to productivity. Instead, what you really want is either a) formatting control in the program to create a graphically appealing, copy and pastable table, or b) direct output to a system which facilitates this, like Openoffice, Excel, or HTML. Some R commands output to Latex (via sweave), but for the average analyst, this is unusable (I love academics, but come on, asking analysts to use Latex is just wrong! If sweave is the best we can do, then we are all in deep trouble.).
I’ll talk more about this below. Ok, we love R, but what about other options?
SPSS: If you can afford it, the wonderful CTABLEs aka Custom Tables module is really nice. Besides a great gui:
they also have the ability to treat a variable as a measure or a dimension as you wish. You can build the table WYSIWYG with as many layers as make sense, and put multiple measures in each cell with various constraints and conditions. You can combine levels on the fly and recalc the counts, which is fantastic. Gold star to SPSS for this one. If you use SPSS, you really should be using Custom Tables
Systat: Can only do 2×2 with the filter header. No banners at all.
Minitab: Usual 2×2 with filter header, no banners at all.
Statistica: Does offer stub and banner tables, but not much control over them… While it’s not SPSS, it’s also cheaper
Stata: Stata’s table commands are all text output based, and don’t really offer a banner table.
SAS: SAS has proc tabulate and proc report, and these start to get you to stub and banner… but require some coding, and are still text output. That being said, they are pretty far along, and so are 3rd after SPSS and Statistica.
Spotfire S-Plus: Same as R, just the 2×2 with filter header.
So, what do real market research folks do? Most use SPSS and SAS, or settle for the small number of overpriced “tabulation” programs still on the market which just make tables. Programs like Wincross, Quantum (the classic solution for many large survey houses, now owned by SPSS as SPSS-MR Quantum), Uncle Tabulation, and Marketsight [SAAS] are sort of helpful, but not cheap. For example, Wincross costs $2500 per user!
Are there any open source solutions? cCount is a DOS program which requires a compiler to run Quantum scripts. And that’s it.
I’m disappointed by what I’ve learned. If you want to make banner tables, you literally have to use Excel and hand construct them, or buy SPSS. There has to be another way.
I think R will be the solution. It’s current myriad of table commands stink, tis true, but I’ve started to put together an approach that combines the best of those table-oriented commands, the amazing magic of reshape, and some HTML/word/excel output to create good looking tables. It’s not there yet, but I’ll keep plugging away. In addition, the useR!2010 R Conference has some good posters and talks about similar problems of needing higher quality output, so we’ll see what comes about.
* * *
Looking over my “Questions to Answer” list, one kept coming back to haunt me. I was attacking pieces of it, but I realized that it’s a huge gap in the web analytics world, and I want to get people thinking about it.
So much attention has been focused on Marketing Attribution: users see my marketing across multiple channels and so I have to combine them in a weighted fashion to “divvy out credit”, to decide what combination of marketing is most effective for me. Simple versions are the “Last Click” attribution we all know and love (well, put up with). More advanced models look a combination of metrics (a mix of First Click, Average, and Last Click attribution) or trying advanced models to weigh it out. Some just give tools (the Atlas Engagement Mapping model, for example) to let you choose your weights, but do not actually optimize the weights for you. And some folks out there say that the actual weights don’t matter, let a model optimize on your behalf and don’t worry about it (media mix companies come to mind).
Now, those are all interesting, but now that I am client-side, I realize that e-commerce sites have the exact same problem on the site. That is, I have a search function, I have product pages, I have a home page, and other functions and pages. How do I attribute an eventual conversion to these features? How do I decide where I need more investment, and what features/pages are doing fine?
In fact, I do want to give credit to marketing sources and provide ROI. But I also want to understand what on my site is contributing effectively to driving conversion, and what is merely assisting.
For all the attention to the marketing attribution problem, there appears to be little attention shown to the problem of Internal Site Attribution.
How do most tools address this? The classic “Multiple Whole Attribution” approach, which sucks. They simply give total basket/sales credit to every page which was in the session leading to the sale. No weighting, no adjustment, and when added all up, it sums to huge multiples of the actual money generated due to double (and quadruple and quintuple etc.) counting.
How might we solve this? One way is to simply take all the techniques tried for marketing attribution and apply them to your internal site experience (see lists above). Categorizing your pages helps. So, you could say that a user is exposed to the home page, some product category pages, a search results page or two, some product detail pages, and then some cross-sells via the cart on the way out. Just like a user is exposed to display and search ads, you can try to tease out the interactive impact of these various impressions.
I call to the various web analytic companies out there working so hard on the external marketing attribution problem: lots of competition in that space; lots of vacuum in internal site attribution. Marketing, esp. search marketing, is indeed important. After all, you spend money on that stuff, so you need to see it’s ROI. But if you are in e-commerce, I’d say you spend a pretty good amount of resources on the site itself in time and money, including content acquisition and editing, product merchandising and management. Shouldn’t you get some sense of the ROI for this? Should you invest in better on-site search, or simply lower your prices? Does that cool flash configurator help, or is it really the combination of users who use it AND visit your support forums?
Web analytic guys, time to help clear this one up.
And yes, for those following, this is indeed part of my What Web Analytics is Missing complaints, bridging “Understand my Site” and “Understand my Business”. This is an area ripe for the picking, one that any site manager who has to “defend the site” will be ecstatic to see solved. If you are looking to differentiate your analytic tool, this would be a good way to do it.
* * *
It’s old news to some, but new to others. Coremetrics licensed Asterdata’s high speed analytic processing database systems a few months ago, and I was lucky enough to see some coming attractions based on the tech changes.
I am not able to say details on what I saw, but I can say that having Asterdata on the back end is really starting to open up possibilities for them. Like many of these systems, you stop thinking in terms of what is possible given the constraints of the database, and instead say “what if I just open up the flexibility to the user, and assume the database can scale up to meet it?”.
Folks who come from ROLAP and MOLAP backgrounds on the big 3 (Oracle, MS’s SQL Server, IBM’s DB2) all seem stuck in a mindset of “what queries can we handle given that we need tons of indexes, temp space, and denormalized fact tables?”. Asterdata, Greenplum, Netezza, etc. all change this mindset into “just write the SQL and we’ll make the query work”. (Yes, it’s not your eyes, all 3 of these sites look almost identical). The rise of parallelization and columnar data stores, and the recent addition of map/reduce frameworks and cloud capability into these systems, can provide massive speedups for ongoing flexible reporting, but more importantly, provides the the ability to drive a wide variety of ad-hoc analytic queries at speed.
What was Coremetrics using before? Well, I can point you to this Coremetrics press release from 2000 where they licensed EMC, Oracle and Sun Microsystems and one could assume that some of that tech has stayed around all these years, upgraded faithfully over time, just like every other enterprise.
If you are interested in keeping up with this new world of analytically enhanced databases, the Monash Research DBMS2 site is, without question, the best source for information about these companies. Every post is full of interesting database goodies, technical enough to go below the marketing, but business savvy enough to understand what market needs each company is meeting and missing. Highly recommended.
As Coremetrics allows me to speak publicly about what I am seeing, I’ll point out some of what I like and some of what is still missing. My hope for them is that they manage to embrace the flexibility this new platform offers instead of staying constrained to point fixes on current capability. What I’ve seen so far is very promising… but only when it’s in our hands will we know if it truly opens new doors for us.
* * *
I think I was way behind on the news, but I’m pleased to give congrats to Stéphane Hamel on the acquisition of his wonderful WASP tool by iPerceptions, announced on October 14, 2009.
I continue to recommend the tool, and I’m very pleased for Stéphane Hamel. Now, even more reason to give it a shot, as we watch what iPerceptions does with it.
* * *
One of the funnest parts of any web analytics role is instrumentation: the tagging of the various parts of the site (Whee.). While what I mention below may not have happened to me, every one of them has happened to someone I’ve worked with:
Yes, QA can catch some of this, but with new pages and new capabilities (AJAX pages, iPhone apps, widgets and in-page apps, etc.) and a gazillion new tags (ad networks, ad validators, 3rd party trackers like ComScore, social tracking, buzz tracking, etc.) coming to the market, it’s harder and harder to keep track of it all.
There are a couple of ways people are attacking it. One is the “piggybacking” approach, where one of your tags is “1st call” and it cascades the call down to other tags. So, your page really has only that one tag, and you plop the other tags on a management page at that vendor’s site. Not bad, but each vendor likes to say “Oh, we can have other tags piggyback on us, but our tag has to be first call”. This, of course, is a Prisoner’s Dilemma, and so gets us nowhere.
Another are 3rd parties which try to help out with the problem. Maxamine, now part of Accenture, is one company which can help you validate, organize, and manage your instrumentation and tagging. On the other end of the “company size” spectrum, smaller players like TagMan act as neutral tag aggregators, letting you load all your tags with them and then controlling which fire when. And tools like WASP can help you go through your site to verify that the tags are at least present; your vendor may also have similar tools.
But I wanted to point your attention to an interesting new idea, one that the amazing John Graham-Cummings is working with. If his name rings a bell, it’s because he wrote one of the early and best antispam filters called POPFile that really leveraged Bayesian approaches to spamfighting. So, anything he chooses to spend time with is probably worth looking at.
One of his latest projects is working with the JSHub open source tag consolidator approach. The site is fine, but his blog explains it much better: What is JSHub?. Basically, since so much of the tagging experience is the same (use JS to create an image call with data in the query string), he proposes consolidating all that duplicative stuff and use a standard approach to defining what data you need. After all the data is somewhat consistent from tag to tag; it’s what each vendor can do with it which is their real story.
I look forward to more vendors joining up into this fully open approach to allow more tag consolidation. This will make it easier on both the sites and the users: Sites will have more control and management over the tag forests sprouting up, and users will have better experiences controlling what’s tracking them and having faster page load times.
Liam Clancy and Fiann O’Hagan have a good idea with JSHub, and I encourage all of us who have to deal with tags on sites to take a look at it. It won’t solve everything, sure, but it’s a good step in the right direction.
* * *