Deprecated: Function set_magic_quotes_runtime() is deprecated in /home/mwexler/public_html/tp/textpattern/lib/txplib_db.php on line 14
The Net Takeaway: Let down by databases...

OTHER PLACES OF INTEREST

Danny Flamberg's Blog
Danny has been marketing for a while, and his articles and work reflect great understanding of data driven marketing.

Eric Peterson the Demystifier
Eric gets metrics, analytics, interactive, and the real world. His advice is worth taking...

Geeking with Greg
Greg Linden created Amazon's recommendation system, so imagine what can write about...

Ned Batchelder's Blog
Ned just finds and writes interesting things. I don't know how he does it.

R at LoyaltyMatrix
Jim Porzak tells of his real-life use of R for marketing analysis.

 

HOW DID YOU GET HERE?

https:
https:
https:
https:
https:
https:
https:
baidu.com
https:
https:

 

 

 

Let down by databases... · 10/16/2008 12:02 PM, Tech Analysis

I’m mad… but mostly at myself. Here’s the story.

I was given a translation table linking 2 types of userIDs for some data we collected. Each system anonymized differently, so we needed the lookup to join the data. We’ll call this the Lookup table; 13 million rows means it is not tiny for your average analyst.

We also have various slices or segments of users we want to examine. These slices range from a 120k list… to a 19 million line list.

Varying table sizes, need for a join… sounds like a database problem to me! Just to set the stage, the IDs in question for both system are 32 character alpha-nums.

I’ve already written the query in my head:


select ID2 
from lookup lkup, segmentlist seglst
where seglst.dID=lkup.dID 

(Yes, I still write my joins in the where clause; much more readable to me than those “inner join on” of the modern SQL)

Should be easy.

12 days later.

We’ve gone from MS Access (who’s idea was that?) to SQLite to MySQL. We’ve played with indexes and configuration files, tweaking memory and caches all over the place. We’ve turned off everything else on the box so that this database is the only thing running, and even made just 1 query at a time. Some of these queries are running 3 days at a time! People say that joining varchars/text are slower than joining on numbers, but come on…

We’re looking now into Oracle XE and the MS SQLServer developer edition: both are limited to 1gb ram, 1 proc and 4gb of data, but they may be more powerful than MySQL

Why? Well, it seems that none of these databases allow you to override the “avoid indexes when reading lots of rows” rule. It varies from DB to DB, but the rule of thumb is that if you have to read more than 10% of a table, you should skip the index and row-scan.

Now, to be fair, there is some good research around why this is so (see, for example, the PDF In Defense of Full-Table Scans by Jeff Maresh). But they all have a flaw: they don’t know how many rows really need to be read, so they guess… and when they guess wrong, the use suffers. And they don’t always take the hint.

In my case, I know that we should pick the smaller table, walk it, and use the index… because there (sadly for me) won’t be lots of matches. Because I know this, the index makes more sense: we don’t need to read every row looking for the match, we can just pop into the index, see if the row is there (it won’t be) and move on.

But no matter what type of index I tried (multiple single column indexes, combined or clustered indexes, indexes using a foreign key), the databases still wanted to do full table scans on both. This basically becomes a looped nested join, which in effect means “read 1 row from first table, read the entire second table looking for the match, read next row from first table, etc.”. While this might make sense if there are lots of overlaps… it didn’t fit what I needed.

So, struggling over these almost 2 weeks, limited in my hardware choices, I rethought the problem.

Grabbing my copy of Beginning Python by the awesome Magnus Lie Hetland (highly, highly recommended), I wrote my own “read and merge” program.

That one that took 3 days, the merging of the 19 million to the 13 million row lookup? 15 minutes.

I did all the files in under 2 hours, including qa.

I basically read the lookup table into a Dictionary, and the “segment” table into a List. On my windows laptop, it appears that a List can hold even 19 million items… but the Dictionary gave MemoryErrors after only 5 million or so. I wound up writing a thing which the entire segment into memory, but then chunking through the lookup table in 2 million row chunks. Even with all that junk, it still flew.

Lessons learned?

1) Unless you have a reasonably powered box with a db that is configured for these kinds of data warehouse queries, you may find that stock/off the shelf databases will struggle under these loads. Lots of page churning and disk I/O will spread your query out.

2) Indexes aren’t everything… but they help. Oracle has Bitmap indexes which allow nice index merging (when you can force a query to use them) and I look forward to that dribbling down to MySQL (and other open source dbs)

3) Sometimes, a database is the obvious answer. It’s also sometimes the wrong answer (and I hate to admit that).

What about Hadoop, Bigtable, etc.? Supposedly these are the answers to big data problems… cept that they are still programmer tools, requiring Java expertise. Also, still no SQL on these things though Facebook’s Hive (PDF) may be the most helpful thing when it gets more accessible (well, either that or MS’s Dryad which uses a SQL variant (and will probably move to LINQ)).

What about SAS or SPSS? I don’t have SAS on my box, so it’s out. And SPSS, even in version 17, lacks the speed that I’ve had in previous database work around complex joins… so I skipped it.

There are also columnar store databases, which might have helped in this case, but few are open source and most require more hardware than I had available to me.

And yes, before you jump on me, I’m sure there are some config tricks I didn’t think about… but I didn’t think I should have to get so tricky. It was a 1 column 19 million row table joining to a 2 column 13 million row table.

To analysts out there: learn a programming language. Pick either Ruby or Python (Ruby is more like Perl, so your perl friends can help out; Python is becoming fully integrated into SPSS) and don’t be afraid to use it to munch files as converters or even simple joiners.

Also, one trick suggested to me: Skip the import stage if you are using Oracle, and use their External Table feature. Wouldn’t have sped up the joins, but allows you to skip the imports, which can be a nightmare in it’s own right.

So, I’m still mad that the databases let me down. This should have been an easy task. But it’s clear that the database world is so busy adding overhead by being all things to all people (we handle transactions! We handle complex multi-table joins in views! We handle BLOBs and Full Text Indexing) that the simple stuff gets washed away. And that’s really sad.

Also, I should probably request a beefier box sometime soon. So, I’m mad at the bad economy making it hard to allocate HW effectively.

But I’m mostly mad at myself. For making everything look like a nail to my database hammer, and for trying to solve the general situation (merge files!) with the DB instead of just writing a one-off point solution to just solve the problem (merge these files!). And wasting 12 days of my (and others’) time.

Lesson learned. I hope.

(PS: Useful tools:
Notepad++ (PortableApps Edition)
PortableDB (Nice portable MySQL)
Python (though the Activestate one is not bad as well)
HeidiSQL (PortableApps Edition, MySQL query tool)
SQLite
SQLiteMan (query tool for SQLite databases, which also are used by Firefox and Chrome for various storage things)
AstroGrep (Fantastic multi-file searcher and replacer, for fixing quotes/tabs/etc. Free, open source)
)

* * *

 

Name
E-mail
http://
Message
  Textile Help
Please note that your email will be obfuscated via entities, so its ok to put a real one if you feel like it...

Admin
powered by Textpattern 4.0.4 (r1956)