OTHER PLACES OF INTEREST
Danny Flamberg's Blog
Danny has been marketing for a while, and his articles and work reflect great understanding of data driven marketing.
Eric Peterson the Demystifier
Eric gets metrics, analytics, interactive, and the real world. His advice is worth taking...
Geeking with Greg
Greg Linden created Amazon's recommendation system, so imagine what can write about...
Ned Batchelder's Blog
Ned just finds and writes interesting things. I don't know how he does it.
R at LoyaltyMatrix
Jim Porzak tells of his real-life use of R for marketing analysis.
HOW DID YOU GET HERE?
(If you’ve found this via searching, you may enjoy the entire series of R articles, found via the navigation link on the right, R Statistical System. These are all in “somewhat random notes” style, but they’ve been helpful to me in the past. Feel free to ping me with updates or suggestions.)
Ok, why all this? I wanted to pull together notes for folks who are pretty savvy and need to understand the quirks of R. Yes, there are lots of docs out there (I link to some of the better ones below) and yes, these notes aren’t always well organized… but they try to focus on the “getting the job done” parts, not the stats-tutorial or stats-programmer approach most other docs take.
Oh, the “official” docs? They stink for beginners or folks with little time on their hands to trudge through them. http://cran.r-project.org/manuals.html are the “official” ones, and http://cran.r-project.org/other-docs.html are the contributed ones by users trying to make it better (but still, tough slogging ahead).
As you get more into R, these will all become more useful, but don’t worry if you get stuck on these in the early days: they are all pretty technical, both programming-wise and computational-statistics wise. But if you want to see how to work R like the masters, this is a good place to dig.
None of them are great, but these are the best of what I’ve seen in my reading. (I will expand and review them all later)
Using R for Data Analysis and Graphics with Introduction, Code and Commentary by J H Maindonald Updated Version which fed into much of this guide, credit to Maindonald!
Verzani-SimpleR.pdf has some very nice graphs, and also explains how to read the output, which is very helpful.
(There is also a Using R PDF which is the older Maindonald book, linked for historical reasons)
Quicker but still pretty good: Notes on the use of R for psychology experiments and questionnaires by Jonathan Baron and Yuelin Li
There are many contributed docs to R, and you can click here to see docs sorted by most recently updated. CRAN Other Documentation shows some commentary around some of the more general tutorials and docs, some of which I reference above.
Another place to dig is in the R Mailing lists but as I post elsewhere, folks are not nice to newbies. Be prepared to slog through some really rude responses by people who don’t remember what it was like when they were just starting out. Also, the mantra to remember: It’s Open Source, get used to it.
The Most Important Things to Know about R
I will assume you’ve used other programs such as SPSS, Mystat or even Excel. I also assume Windows. Note that R will often top out at 100k rows and 20 variables b/c it stores everything in memory, depending on the type of data you have. You may do better with Linux than with Windows; Linux has a better memory model for the big stuff. Yes, Windows will top out at 2gb until we get to 64-bit, but if you are adventurous, the R for Windows FAQ has some hideously complex suggestions that none of us can easily do to potentially work around this issue (there is more about this near the bottom of the post).
There are 4 things to remember in working in R:
c("item 1", "item 2")meaning “concatenate into a list the 2 objects named Item 1, Item 2”. Also, “list” is different to R from “vector” and “matrix” and “dataframe” etc. ad nauseum. But beyond the “specific meaning” aspects which you can deal with later, you get the idea.
There aren’t many good ones. Some things which help me keep my sanity includes the pretty good start of JGR which still has bugs but includes a “data editor” similar to a spreadsheet, and color syntax highlighting and command tooltips to help your syntax. In a later entry entitled R GUIs, I review a few guis, web-front-ends, and editors.
BTW, in JGR: To run part of your syntax file when open in the handy editor, select it and
<Cmd><Enter> on the Mac, or
<Ctrl><Enter> on PC. No docs on this, but that’s how it’s done. Also, to get to the data editor, use the Object Explorer and then double-click on your data frame… voila. the Edit command doesn’t work as of this posting. There are more tips for JGR in the R GUIs entry here.
There is online help, but its hard.
help() is your starting place.
help(plot) gives help for the plot command.
help.search(plot) is as expected, and
apropos(plot) lists all functions with the word plot in their name. Note that some of this help is aimed at programmers, not those of use who need to know how to get something done.
help.start() will pop up a more menu driven approach, but still not all that helpful. Basically, help.start() starts the browser version of the help files.
example(command1) prints an example of the use of the command. This is especially useful for graphics commands. Try, for example,
example(contour), example(dotchart), example(image), and
Basically, everything in R is an object. Assume an object is either a number or word, a collection of numbers/words, or the results of a procedure. BTW: identifiers are Case Sensitive! Comments are prefixed by the
# character, a line at a time (meaning you need the # on each line).
Most stats packages lay the data in a simple fashion: column heads are variable names, and each row is a new entry. R turns this on its head a bit. You start off with columns of numbers (i.e., each variable on its own) and you merge them into a “data frame”, akin to SAS’s dataset or SPSS’s datafile.
Yes, this is annoying. It’s open-source; get used to it.
q() is the quit command. Why not just quit? Ya got me. Altogether now: It’s open source; get used to it. Anyway…
q("yes") saves everything.
If you just want to “clear the decks”, consider
rm(list=ls()) which deletes all objects.
Some tips on reading and manipulating data are in this PDF.
Getting data into the system…
myDataFrame <- read.table("c:/austpop.txt", header=T)
As you’d expect, you can play with what delimiter is used (sep=), and here, the first line are the headers and are read as the column names. (There are always MTOWTDI like PERL. For example, there is also
read.csv(), etc. Check your manuals!)
Not all of these wind up in a data frame format; its pretty specific. If you aren’t sure, just force it:
mydf<-as.data.frame(mydata) ## data as dataframe
(yes, a semicolon can separate multiple commands.)
Typing in data:
Pretty similar. In this case, we have a “c” function which combines numbers into a “column”.
t1 <- c(1,2,3,4,5)
There is a mini “spreadsheet” for editing and adding data if you wish to raw-type:
xnew <- edit(data.frame())
If you have a couple of these, then you can combine them manually into a data frame with the data.frame function:
elasticband <- data.frame(strch = c(46,54,48,50,44),
dist = c(148,182,173,166,109))
If you want to edit your stuff in a very (very!) basic data editor, the default R install has one:
elasticband <- edit(elasticband)
NOTE: You have to assign the result of the edit back to the object, or you lose all your edits. Bad thing.
(BTW: Just typing the name of the object at the prompt, like
elasticband at the prompt will print out your data frame (the dataset you read). This works for almost any object in R: type its name and it just dumps its contents. Handy.)
Besides just typing a name,
print(dataframe) will print your dataframe nicely formatted.
To see the names of the variables currently in the dataframe…
You can also have “row labels” with
row.names(myDataFrame). This is somewhat rare; its basically picking one of the variables to be a “row label”. Use if it you need to, but I haven’t found a good use for it except in hacking tables to make output look better.
BTW, here’s a fun one: You can read from the clipboard, handy for quick grabs from Excel:
read.table() will read from the clipboard (via
file = "clipboard" or
All of this (and more) are in the “official guide to R Input and Output” at http://cran.r-project.org/doc/manuals/R-data.pdf.
The R site has some info on reading in SPSS info here. It basically says “Function read.spss can read files created by the `save’ and `export’ commands in SPSS. It returns a list with one component for each variable in the saved data set. SPSS variables with value labels are optionally converted to R factors.”. This is part of the package
FOREIGN. Packages are addins which are listed by
library() and loaded by
library(foreign) (in this case). I have lots more about packages elsewhere, including R Packages
In practice, it looks like this:
MyDataSet <- read.spss("c:\\junk\\file.sav",to.data.frame=TRUE)
Yes, I did find that I needed double slashes. The data frame is the object MyDataSet.
What really threw me? Dates/Timestamps. SPSS stores date/time values as the number of seconds since October 14, 1582 (the start of the Gregorian calendar) (see http://www.childrensmercy.com/stats/data/dates.asp). So, have to do lots of calcs to convert those dates back to something R can use.
From a post on the R-Help list, here is one way:
as.chron(ISOdate(1582, 10, 14) + mydata$SPSSDATE)
as well as this post which points out that spss.get in package Hmisc can handle SPSS dates automatically. This and additional discussion on SPSS dates is available in the Help Desk article in R News 4/1.
After loading, type
to see overall documentation.
There are a few things to consider. Start with
dataset <- spss.get("c:\\junk\\WN User Survey Final Data.sav")
but you may want to add the charfactor=T if you want to convert character variables to factors. Play with it, it may help or hurt. You can always do it later.
dataset <- spss.get("c:\\junk\\WN User Survey Final Data.sav", charfactor=T)
RODBC handles ODBC connections to databases
channel <- odbcConnect("DSN")
odbcGetInfo(channel) # Prints useful info
sqlTables(channel) #gets all the table names; is there a way to filter this?
Don’t forget to
odbcClose at the end.
sqlSave copies an R data frame to a table in the database, and
sqlFetch copies a table in the database to an R data frame.
An SQL query can be sent to the database by a call to
sqlQuery. This returns the result in an R data frame. (
sqlCopy sends a query to the database and saves the result as a table in the database.)
data1 <- sqlQuery(channel,"select * from dual")
If you need multiple lines for a long query, use the
paste() function to assemble a full query. This can also be used to create substitutions, etc.
A finer level of control is attained by first calling
odbcQuery and then
sqlGetResults to fetch the results. The latter can be used within a loop to retrieve a limited number of rows
at a time, as can function
And remember, you can read from spreadsheets via ODBC as well… but read only, no write back via ODBC! Note the use of the different connect, odbcConnectExcel.
channel <- odbcConnectExcel("bdr.xls")
## list the spreadsheets
## Either of the below will read in Sheet 1:
sh1 <- sqlFetch(channel, "Sheet1")
sh1 <- sqlQuery(channel, "select * from [Sheet1$]")
(Also, there is a DBI approach (similar to PERL DBI) which works kind of similarly. See package DBI:
To run the Windows binary packge ROracle_
you’ll need the client software from Oracle. You must have the
$ORACLE_HOME/bin in your path in order for R to find the Oracle’s runtime libraries. The binary is currently not on CRAN (grrr), but only at http://stat.bell-labs.com/RS-DBI/download/index.html Note that you would do better to compile this yourself, or better yet, skip it and just use RODBC. )
Some more is in this http://cran.r-project.org/doc/manuals/R-data.pdf and is well worth a read.
If you saved it with “save” (see below”), then you can use the
load() command. For example,
data(name) loads the data attached to a package (ie, in its search path)
data() lists what’s in the most recently attached package?
data() can also be used to load other things; for most purposes, I would stick with
save(), see below).
data(package = .packages(all.available = TRUE)) to list the data sets in all available packages; this is handy for seeing just what sample data you have available for testing or demo purposes.
attach(data.frame1) makes the variables in data.frame1 active and available generally… i.e., this loads a previously saved data frame. Its like
load(), but only loads when it needs it… kind of an “on deck” command. Yes, this is confusing; sorry bout that.
To load a collection of commands (i.e, a script or command file), try
sink("record.lis") sends all output to a file;
sink() turns this off.
You can save your entire “workspace” with
save.image(file="archive.RData") These can then be attached or loaded later.
save.image() (i.e., nothing in parens) is just a short-cut for “save my current environment”, equivalent to
save(list = ls(all=TRUE), file = ".RData"). It is what also happens with
A more common thing is to just save useful variable objects:
save(celsius, fahrenheit, file=“tempscales.RData”). Note that these are all binary files, so they can move from platform to platform, but are UNREADABLE IN ANY OTHER SOFTWARE. You were warned.
(BTW: R, by default, can save the entire workspace as the hidden file .RData. This can cause confusion later on, so be careful. This whole “saving the workspace” is painful, and in fact, the R folks suggest using a different directory for each “set” of analyses you do so you can store the whole thing in .RData and .Rhistory per directory. This is kludgy.)
WORKING WITH DATA
names(obj1) prints the names, e.g., of a matrix or data frame. (aka variable names)
List of objects: ls()
rm(object1) removes object1. To remove all objects, say rm(list=ls()).
Size of Objects: dim() or (for vectors) length().
Multiple vectors (Columns) get linked into a data frame. THis is pretty central to how R works. Unlike SPSS (up through v14), R can hold multiple datasets in memory at once, each named.
This data frame stuff will drive you up a wall. I’ve mentioned it elsewhere as well, so keep trying. If each variable is a vector or array or list, you can make a “list of lists”. The dataframe is the list of variables; each variable is its own list/array. (Yes, List and Array and Vector are all special terms in R, so I shouldn’t use them interchangeably. Sorry.)
The data.frame() function puts together several vectors into a dataframe, which has rows and columns like a matrix.
save(x1,file="file1") saves object x1 to file file1. To read in the file, use
q() quits the program.
q("yes") saves everything.
options(digits=3) sets digit printout to 3 decimals
Besides lots of formulas, each column (variable) is basically a vector, and so you can do all sorts of vector stuff like concatenate, subset, etc.
Now, remember, everything is an object. So, think in terms of functions on the entire object, and assume you can’t use loops (you can, but they suck). The good news is that this gives lots of interesting possiblities. For example, since
x is item 2 of x, you can also do
x[x>3] to magically see all the items with values greater than 3.
To expand a vector or whatever, just assign something out of its current range. To truncate, set the length to whatever you want or use the index:
x <- x[1,2,3,5] keeps only the 4 items referenced there.
In addition, like a SQL join, calculations will expand vectors. So, if you multiply a single vector by a vector with 5 items, its like the single is applied across all 5 items. Similarly, the function
sapply() takes as arguments the the data frame, and the function that is to be applied and applies it across the columns (like a closure?)
rep(thing, count) replicates thing count times.
Now, a vector with names (levels) gets a special name: factor. Basically, its like the AUTORECODE in spss. Take a column, convert it:
This reduces storage, and some R functions expect a “factor”. Like SPSS, levels (integers) are assigned in alpha sort order of the levels, not in order of appearance. if that’s a problem, assign a (..., ref=“LA”) to force a specific option to be the 0th or reference category. You can even just manually force the (..., levels=c(“LA”, “MA”, etc.)) if you want
Besides Vectors, you can have Arrays/Matrices (2 or more dimensions, all of same type), Data Frames (basically an array with each column as different type), Lists (Vectors with vectors in them, like a nested array), and Strings (basically, character vectors).
Strings prefer double quotes, and use C style escape sequences:
BTW, all the stuff you do with numerics can be done with characters, like repeats;
Paste() is a substring combiner”
labs <- paste(c("X","Y"), 1:10, sep="")
c("X1", "Y2", "X3", "Y4", "X5", "Y6", "X7", "Y8", "X9", "Y10")
Every object in R has a
mode() and a
length(). Othere attributes are available via the
as.integer(), etc. convert objects to new modes.
When you have a dataframe, you can get to its columns a couple of ways:
This whole “access” or “extraction” thing is painful. ?extract gives more details. Basically, you can use  or $, or []. You can even get help on them
help("[["). Here are some more details.
With much help from the “Introduction to R” in the next few paras:
An R list is an object consisting of an ordered collection of objects known as its components. There is no particular need for the components to be of the same mode or type, and, for example, a list could consist of a numeric vector, a logical value, a matrix, a complex vector, a character array, a function, and so on. Here is a simple example of how to make a list:
Lst <- list(name="Fred", wife="Mary", no.children=3,
Components are always numbered and may always be referred to as such. Thus if Lst is the name of a list with four components, these may be individually referred to as Lst[], Lst[], Lst[] and Lst[]. (If, further, Lst[] is a vector subscripted array (as it is in the example) then Lst[] is its first entry.)
If Lst is a list, then the function length(Lst) gives the number of (top level) components it has.
List components can have names; you can see how they wer hand typed in the example. So, name$component_name is another way to get to the data. Lst$name is the same as Lst[] and is the string “Fred”.
Additionally, one can also use the names of the list components in double square brackets, i.e., Lst[[“name”]] is the same as Lst$name. This is especially useful, when the name of the component to be extracted is stored in another variable as in
x <- "name"; Lst[[x]]
It is very important to distinguish Lst[] from Lst. `[[...]]’ is the operator used to select a single element, whereas `[...]’ is a general subscripting operator. Thus the former is the first object in the list Lst, and if it is a named list the name is not included. The latter is a sublist of the list Lst consisting of the first entry only. If it is a named list, the names are transferred to the sublist.
A data frame is basically just a list with class “data.frame”.
If you have a dataframe and you don’t want to keep typing “dataframe$variable”, you can
attach(dataframename) and just use the variable names, and then
detach(dataframename) when you are done. If you are just using one dataframe, this is pretty handy.
dataframe[ROWS,COLUMNS]. So if you want all rows for column 3, try
mydataframe[,3]. If you leave out the comma, R assumes you meant the column, so mydataframe3 is the same, for the most part, as mydataframe[,3]. Names are more annoying. If the columns have names, you can do one with mydataframe[“q1”], but if you want more than one name, you have to use the c() function! mydataframe[c(“q1”,“q3”,“q9”)].
Now, if you like the $ approach, that’s fine… but if you want to select multiple variables, you have to recreate the dataframe, so in effect:
summary(data.frame(mydata$q1, mydata$q3)). Yes, this is annoying.
Ok, all that mishagas aside, the
subset() seems to be the winner:
subset(dataset,(dataset$Q4>=18 & is.na(dataset$Q8)==F))
BTW... R loves to convert character data into factors AUTOMATICALLY when you create the dataframe to save memory. This can be REALLY ANNOYING. If you don’t want this, consider:
data.frame(v1, I(v2)) The I() function says “interpret this as raw, no transform” , basically. More at ?data.frame and ?read.table
Deleting a column: 4 examples:
iris[,5] <- NULL
see http://cran.r-project.org/doc/contrib/usingR-2.pdf page 22 for useful functions
Checking for Nulls: NA is the R “null”, and you aren’t supposed to test for equality to it. Instead, use
df[!duplicated(df$colname),] You could also use
aggregate but aggregate() is basically a wrapper for tapply and tapply basically loops in R. duplicated() loops in C (and uses hashing).
How to count distinct or count unique in a column? well, unique() returns the uniques in a vector. sort(unique()) returns the sorted uniques.
And length() tells number of items in the vector… but gives number of columns in a Dataframe (which is a list of lists, each internal list being a variable vector. Annoying, huh?. dim() does this for a dataframe. So, if you are looking to count a variable, use length, but be careful, ok?
Unique Lists from Lists: https://stat.ethz.ch/pipermail/r-help/2004-September/056830.html
cbind, sapply, tapply, mapply, split become your best friends.
plot() does x-y plots
pairs() does great pairwise x-y plots… very handy.
help.start() starts the browser version of the help files, or just
help(command1) prints the help available about command1.
help.search("keyword1") searches keywords for help on this topic.
apropos("topic1") finds commands relevant to topic1, whatever it is.
example(command1) prints an example of the use of the command. This is especially useful for graphics commands. Try, for example,
If you are a newsgroup kinda person, try
Cat vs. print vs. format:
If one wants to display a character string with control over
newlines then one typically uses cat. If one wants to display
an object one uses print or else converts it to a character string
using format or as.character and then display it using cat.
Linking to Excel?
Big Data and Memory
Getting all your memory:
Rgui.exe —max-mem-size=2Gb or —max-mem-size=2000M,
But I reprint (edited) from the R Faq for Windows
2.9 There seems to be a limit on the memory it uses!
Indeed there is. It is set by the command-line flag —max-mem-size and defaults to the smaller of the amount of physical RAM in the machine and 1.5Gb. It can be set to any amount between 32Mb and 3Gb. Be aware though that Windows has (in most versions) a maximum amount of user virtual memory of 2Gb. Use ?Memory and ?memory.size for information about memory usage. The limit can be raised by calling memory.limit within a running R session. The executables Rgui.exe and Rterm.exe support up to 3Gb per process under suitably enabled versions of Windows (see http://www.microsoft.com/whdc/system/platform/server/PAE/PAEmem.mspx: even where this is supported it has to be specifically enabled). On such systems, the default for —max-mem-size is the smaller of the amount of RAM and 2.5Gb.
From the R-Help Mailing List:
Duncan Murdoch, Friday, March 03, 2006 says:
R can deal with big data sets, just not nearly as conveniently as it deals with ones that fit in memory. The most straightforward way is probably to put them in a database, and use RODBC or one of the database-specific packages to read the data in blocks. (You could also leave the data in a flat file and read it a block at a time from there, but the database is probably worth the trouble: other people have done the work involved in sorting, selecting, etc.) The main problem you’ll run into is that almost none of the R functions know about databases, so you’ll end up doing a lot of work to rewrite the algorithms to work one block at a time, or on a random sample of data, or whatever.
From a post at DecisionStats :
A very rough rule of thumb has been that the 2-3GB limit of the common 32bit processors can handle a dataset of up to about 50,000 rows with 100 columns (or 100,000 rows and 10 columns, etc), depending on the algorithms you deploy.
From the R-Help Mailing List:
[R] Re: suggestion on data mining book using R
Vito Ricci on Thu, 20 Jan 2005
Hi, see these links:
Brian D. Ripley, Datamining: Large Databases and
Methods, in Proceedings of â€œuseR! 2004 – The R User
Conferenceâ€, may 2004
and if looking for a book I (Vito Ricci) suggest:
Trevor Hastie , Robert Tibshirani, Jerome Friedman,
The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, 2001, Springer-Verlag.
B.D. Ripley, Pattern Recognition and Neural Networks
Some other docu links to check out:
http://cran.r-project.org/doc/contrib/Rossiter-RIntro-ITC.pdf (Nice, includes some good tips)
http://cran.r-project.org/doc/contrib/Vikneswaran-ED_companion.pdf (An R Companion to Experimental Design)
http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf (R for Beginners)
http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf (Fitting Distributions with R)
http://cran.r-project.org/doc/contrib/Burns-unwilling_S.pdf (The unwilling S user; remember that R is the open source S, so most things will work in both systems).
http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf (Practical Regression and Anova using R)
Some tips on reading and manipulating data are in this PDF.
http://cran.r-project.org/doc/contrib/Kuhnert+Venables-R_Course_Notes.zip has lots of stuff in it, including data.
http://cran.r-project.org/doc/contrib/Marthews-BeginnersRcourse.zip has a 9 page rapid fire doc to get you started, includes data.
http://cran.r-project.org/doc/contrib/Lemon-kickstart_1.6.zip has a collection of HTML docs, so a pain to use, but once unzipped, is a good dive in… Already unzipped at http://cran.r-project.org/doc/contrib/Lemon-kickstart/ if you want to see it.
http://tolstoy.newcastle.edu.au/R/e2/help/07/01/8033.html and as.date, strptime, and chron
Sample dataset all about the Titanic:
Hoping I helped you.
* * *