OTHER PLACES OF INTEREST
Danny Flamberg's Blog
Danny has been marketing for a while, and his articles and work reflect great understanding of data driven marketing.
Eric Peterson the Demystifier
Eric gets metrics, analytics, interactive, and the real world. His advice is worth taking...
Geeking with Greg
Greg Linden created Amazon's recommendation system, so imagine what can write about...
Ned Batchelder's Blog
Ned just finds and writes interesting things. I don't know how he does it.
R at LoyaltyMatrix
Jim Porzak tells of his real-life use of R for marketing analysis.
HOW DID YOU GET HERE?
When you have a dataframe, you reference specific columns like this:
dataframe$column with the “$” sign.
If you don’t want to keep typing it, try
attach(dataframe) = assume “dataframe” precedes each variable name
detach(dataframe) = turns off the attach, nice way to end things.
http://wiki.r-project.org/rwiki/doku.php?id=tips:data-frames has some more info on how to manipulate dataframes.
BTW, ; (semicolon) for multiple commands on one line.
rowcount = length()... of a single variable. If you have a dataframe, you need to use Dim(). Remember, a dataframe is basically a list of lists, so length() of a dataframe is just the number of columns in your dataframe… each of which is a list. Its annoying, I know.
So, each variable is a vector or array (if you took Pascal, Basic, Java, or C in college, you remember what an array is). Put them together and you have a dataframe. Now, if that dataframe is all numbers, then in effect, its just a huge 2-d array. Since its stored as lists, its not exactly a matrix, but can become one with as.matrix(). Keeping this in mind makes many tricks much easier to understand.
The usual summary stats include mean(), var(), sd(), median(), and fivenum(). (summary(), for a numeric, gives the same as fivenum)
Hist() for histogram, and if you have a numeric, simple.hist.and.boxplot() gives both.
Often, barplot() will give you a strange result. Its commonly used like this:
barplot(table(x)) which graphs the summarized data, and is usually what you wanted anyway. (Note that this demonstrates the “object” nature of things: you are graphing the “results” of the table object.
Restructure and Merge
Ah, the everpopular transpose. SAS still wins in the “aggregate lots of rows and flip them so events are summarized rows, one per person” world. R has stack() and unstack(), but reshape() is the more powerful (and complex) options. It basically combines pieces from tapply, by, aggregate, xtabs, apply and summarise. Lots more about reshape() here. Reshape can also make tables, so consider it instead of xtabs().
merge() combines frames, and can be used to make left, right, or full outer joins. cbind() just concatenates 2 frames if they are in the same order; I would stick with merge since you should have keys on both files anyway.
How about dropping a column? The only way I know is:
NewDataFrame <- OldDataFrame[1:7,9:23] to drop column 8. I am SURE there is a better way. And from “R for SAS and SPSS users”, we learn that you can do:
mysubset$q3 <- mysubset$q3 <- NULL
which drops Q3 and Q4 from your data.
This is also how I drop rows, or using something more clever like
earlydata <- data[data$year<1960,] You can use more than one variable ala X[Y>=A & Y<=B]. | (pipe) is an “or”, & (ampersand) is an and.
R experts call pulling chunks “extraction”. There is a function called subset() and ?select gives good info. ?extract is also recommended.
http://pj.freefaculty.org/R/Rtips.html are the best damn tips in existence.
If you just type an analysis command, basic output is given on the screen… but remember, everything is an object. Therefore, many users assign the results of an to an object, which allows cleaner printing, formatting, etc.
Also, the default output for many analyses is so bare-bones as to be useless. So, get into the habit of assigning the results of a run to a variablename, and then using summary(variablename) almost immediately afterwards (or just put on same line with semicolon (;) separator).
lm(distance~stretch,data=elasticband) prints some basic stuff, but
elastic.lm <- lm(distance~stretch,data=elasticband)
gives much more info.
3 basic analysis commands to play with:
summary(dataframe) gives simple attributes of all the variables (min, max, median, etc.)
cor() gives correlation matrix
lm(distance~stretch,data=elasticband) gives a linear regression (lm=linear model)
y ~ x (duh, but implicit intercept)
y ~ x+1 (explicit intercept)
y ~ 0 + x (explicit deletion of intercept or intercept set to 0; use in distance calcs or chemical creation based on time analysis)
y ~ w*x (this one FORCES the single parameters w and x, as well as the interaction! it doesn’t look it, but this is akin to I(w+x+(w*x))
Note how I used the I() function, which basically says “don’t use this confusing regression syntax, use normal math”
When you want a conf interval in regression, remember that its
b Â± t*SEb, can use qt(0.975,n-2) to get the appropriate t score.
Categoricals: Crosstabs, Freqs, and other beasts
It is actually surprising how poor R’s table and crosstab functions are compared to the rest of the system.
Crosstabs come basically from
table(). While its not a full tab system, lots of options. As J H Maindonald points out on page 23 of his manual: WARNING: NAs are by default ignored. The action needed to get NAs tabulated under a separate NA category depends, annoyingly, on whether or not the vector is a factor. If the vector is not a factor, specify
exclude=NULL. If the vector is a factor then it is necessary to generate a new factor that includes “NA” as a
x <- factor(x,exclude=NULL)
table() gives frequencies; if you want proportions of the vector, try
Two variables reads like table(Car$Make, Car$Type).
Visually: barplot(table(mydata$Q9, mydata$Q7), legend.text=T)
summary() = summary is a generic function used to produce result summaries of the results of various model fitting functions; when run on a vector, gives either freq count for factors or means/etc. for numerics.
table() = table uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels. Doesn’t have lots of options. Can do some labeling, and can run 2d chi-square
xtabs() = more advanced cross tab, uses formula notation.
ftable() = “flat” contingency tables.
You can also do stuff like prop.table() to see proportions, and addmargins() to include marginals. tapply() can be used to fake some of this manually, but its a lot of work.
CrossTable() function in the gregmisc package (now Gmodels, I think; greg split them out but you also need gtools and gdata for it to work).
There is also crosstab() function in the ecodist package. xtabs() in the stats package is a simple solution, but package reshape is much more flexible (and complex and powerful).
ctab() is in the catspec package by John Hendrickx. This will do multi-way tables with summary row/col statistics. http://www.xs4all.nl/%7Ejhckx/R/ctab.html is the official doc page.
There is also the CrossTable() function in the ‘gmodels’ package on CRAN, though this will only do one way and two way tables, with summary row/col statistics.
For simple count with multi-way output, try ftable(). summary() gives range, mean, quantile data for continuous variables, and you can potentially format this to your liking.
Note that contrary to postings on r-help, help.search(“crosstabulation”) and help.search(“crosstab”) do nothing. help.search(“cross”) gives some help… but not much.
Simulating the SPSS CROSSTABS procedure: (adaptation of Gregory R. Warnes’ CrossTable() function in the gregmisc package)
This custom function is really, really nice. To add it to R perm, you have to edit files like the system startup file $R_HOME/library/base/R/Rprofile or the .Rprofile in the current directory or in the user’s home directory.
Another one worth looking at is http://www.stanford.edu/~agallant/files/rcode/tab.r. Thanks, rseek.org!
There is also some info in Kickstarting R
with a special function http://cran.r-project.org/doc/contrib/Lemon-kickstart/xtab.R
Interestingly enough, I posted about a multiple response question on the R-help group and got a very clever response.
Lets say you have data like this:
id att1 att2 att3
1 1 1 0
2 1 0 0
3 0 1 1
4 1 1 1
People have checked some of 3 attributes. I want to know which hang together. Look at this clever approach using crossprod() (the matrix multiplication function) to get the co-occurrence numbers, drop the diagonals, and then convert to percentages:
ratings <- data.frame(id = c(1,2,3,4), att1 = c(1,1,0,1), att2 = c(1,0,0,1), att3 = c(0,1,1,1))
tab <- crossprod(as.matrix(ratings[,-1]))
tab <- tab - diag(diag(tab)) # drop the diagonal
tab # pretty, isnt' it?
tab / nrow(ratings) # divide by the people to get percentages
Consider str() over summary() when examining a data frame?
Task Views show combinations of packages to solve certain areas of statistical problems (AI, Econometrics, etc.)
* * *