Deprecated: Function set_magic_quotes_runtime() is deprecated in /home/mwexler/public_html/tp/textpattern/lib/txplib_db.php on line 14
The Net Takeaway: Full Outer Joins in SPSS


Danny Flamberg's Blog
Danny has been marketing for a while, and his articles and work reflect great understanding of data driven marketing.

Eric Peterson the Demystifier
Eric gets metrics, analytics, interactive, and the real world. His advice is worth taking...

Geeking with Greg
Greg Linden created Amazon's recommendation system, so imagine what can write about...

Ned Batchelder's Blog
Ned just finds and writes interesting things. I don't know how he does it.

R at LoyaltyMatrix
Jim Porzak tells of his real-life use of R for marketing analysis.






Full Outer Joins in SPSS · 01/31/2008 12:27 PM, Analysis

In SQL, there is the concept of linking tables by shared keys. For the most part, think the classic case of a customer ID linking customer address to purchase data. In SQL, it’s very easy to create these links as you select, and even though they can have unexpected side effects (either expanding or filtering what rows you get back), it’s one of the foundations of “Relational” databases and a very powerful tool.

I’m going to talk about what these joins are (using Oracle and SQL for examples), then show how you get this same helpful effect in SPSS (way at the bottom). If you know what these are, or don’t have patience, jump down there… but you’ll miss links to Oracle v8 and v9+ tricks, as well as a link to how to do full outer joins in MS Access, so skip at your own risk! Ok, back to our main story.

As part of the “Relational Algebra” (or “Relational Calculus”, depending on how advanced you are), there are different types of joins. The basic one is key to key, keeping only the matches. This is called an “equijoin”.

(I like old style SQL, so I don’t use ANSI joins. Sorry)

Select t1.customer, t2.address
from customerlist t1, addresslist t2
where t1.customerID=t2.customerID

This returns only customers who have given their address, leaving out customers with no address.

Now, let’s say you want all the customers, and you don’t mind if there are Nulls in the address column for those who haven’t given them. This is a Left Outer Join (left is arbitrary, think of the long customer list as the “left side”). Oracle v8 had a weird way of doing this; v9 made it (sort of) cleaner:

v8: The (+) means keep all the IDs.

Select t1.customer, t2.address
from customerlist t1, addresslist t2
where t1.customerID=t2.customerID (+) 

v9: ANSI new syntax (doesn’t like joins in the Where clause)

Select t1.customer, t2.address
from customerlist t1 left outer join addresslist t2
on t1.customerID=t2.customerID

Now, what if you wanted a list of customers who had addresses, customers without addresses, and addresses without customers? Why this last? You may want to do an append via some 3rd party to find those names and have a completed list, so you need everything from both tables, but joined where appropriate.

This is called a Full Outer Join. In Oracle up to v8, you couldn’t do this directly and had to use syntax trickery. Full Outer Joins in Oracle9i shows both the workarounds in v8, and the new syntax in v9 and above.

The workaround involves two outer join queries combined by a UNION operator. Basically, you say “give me customers with and without addresses (match ‘em where I’ve got ‘em) on one hand, give me addresses with and without customers (match ‘em where I’ve got ‘em) on the other, and then plop them on top of each other to make one long list”. The extra trick is the UNION instead of the UNION ALL. Why? Well, your match ‘ems will be in both joins, so you will have dupes unless you let SQL take care of them with the UNION.

MS Access also needs a simulated full outer join which is described her as two UNION ALLs (could probably be implemented as 1 union with some editing).

But how about SPSS? Unlike SAS, SPSS has no “Proc SQL”, so there is no direct way to write the query. Instead, we have to fake it with Match Files. BTW, this is a topic worth a Knowledgebase article on, but the SPSS Knowledge Base has only one (1!) article on “outer join”, which doesn’t really even address the issue: Can SPSS perform an anti-join merge of 2 or more files as in Clementine? Searching on Merge is more helpful, but only slightly. Here is one on “Cartesian Join”, which is really just every row to every row (not really a join at all, but technically correct): Cartesian File Merge

Here’s how you do it.

Here’s an example problem: I have a list of IDs, and I want to know who did something on 3 different months, say visited a web site. There are 2 ways this problem can be presented:

1) Open each file, add a flag column (MayFlag, JuneFlag, whatever). If this is based on the filename, you can even script it. Sort it before saving by the ID
2) Then, take your master list, Match Files one after another.

or, the harder one…

At the end, we want a list of IDs who visited at least once (ATLO) and a flag for each month they visited. In this version, we won’t know anything about those who didn’t visit. Because I want all the IDs, I have to do full outer joins for each new file or month I include.

This syntax will do the first 2 files for us. Just keep repeating for each additional file.

* Create fake data.
DATA LIST LIST /id(A8) JuneFlag(F1).
One 1
Five 1
Six 1
Seven 1


SAVE OUTFILE='C:\Junk\test2.sav'   /COMPRESSED.

DATA LIST LIST /id(A8) MayFlag(F1).
One 1
Two 1
Three 1
Four 1

SAVE OUTFILE='C:\Junk\test1.sav'   /COMPRESSED.

* Ok, just like we talked about.
* Do 2 left outer joins, then the "Union" which is really
*    just another match files and some dedupe syntax.
* BTW, this assumes that your visit files are already aggregated so
* each ID is once per file; if you have more than one, aggregate
* the files first before the joins.

* 1st Left Outer Join.
* I have "test1" open already, so it's my "Left" in the Left outer Join.
* Test2, my key file, will "lose" some its records in this join.
 /by id.

SAVE OUTFILE='C:\Junk\step1.sav'

* 2nd Left Outer Join: Now, take 2nd table, and make it the left side, and repeat...
GET File='C:\Junk\test2.sav'.

 /by id.

SAVE OUTFILE='C:\Junk\step2.sav'

* But wait... ID One is in both sets!  Horrors!.
* Have to "Union" them.
* Merge the two files (add cases), and then dedupe.
* I still have "step2.sav" open, so I'll just add step1 back in...



* not needed, but always good to save your work!.
SAVE OUTFILE='C:\Junk\step3.sav'   /COMPRESSED.

* Identify Duplicate Cases; standard SPSS generated syntax.
  /BY id
VARIABLE LABELS  PrimaryLast 'Indicator of each last matching case as Primary'.
VALUE LABELS  PrimaryLast 0 'Duplicate Case' 1 'Primary Case'.

Select If PrimaryLast=1.


* Done.

Wasn’t that fun?

Sure would be nice if SPSS added a SQL type access to our files to allow us to treat it like a database. Its easy to dislike SQL for many things, but when it works, its like magic. SPSS is making us jump through lots of hoops, including multiple time-consuming sorts, to get the same result. I hope that, as SPSS adds more multi-threading, they include a multi-threaded SORT in the future.

BTW, to turn off logging?

SET Printback=Off TFit=Both TLook=None.

* * *


  Textile Help
Please note that your email will be obfuscated via entities, so its ok to put a real one if you feel like it...

powered by Textpattern 4.0.4 (r1956)