Deprecated: Function set_magic_quotes_runtime() is deprecated in /home/mwexler/public_html/tp/textpattern/lib/txplib_db.php on line 14
The Net Takeaway: R


Danny Flamberg's Blog
Danny has been marketing for a while, and his articles and work reflect great understanding of data driven marketing.

Eric Peterson the Demystifier
Eric gets metrics, analytics, interactive, and the real world. His advice is worth taking...

Geeking with Greg
Greg Linden created Amazon's recommendation system, so imagine what can write about...

Ned Batchelder's Blog
Ned just finds and writes interesting things. I don't know how he does it.

R at LoyaltyMatrix
Jim Porzak tells of his real-life use of R for marketing analysis.







R GUIs · 4027 days ago, Analysis

First off, if you found this page via a web search or bookmark, you may be much happier in the R Section of this site to see the multiple articles about R, including this one, but also about Packages, Data manipulation, etc.

The home of all R GUIs is which is accessible via the “R GUIs” link on the R home page. Start at the Overview link on the left nav. They include many Linux options which I will not delve into here at this time (I used to work at Microsoft, sorry bout that). Also linked are web front ends, SOAP and other RPC type calls of R servers, and DCOM/App links for Excel, Gnumeric and a few others.

Best Choices:

This is a totally subjective list.

JGR is my current favorite. Running in Java, it includes completion, color syntax highlighting, integration of the iPlot interactive plotting packages, and nice people on the mailing list. I say start here, if for no other reason than its quick and easy. BTW, in the syntax editor, highlight your code and Ctrl-Enter to run it. Not documented, but that’s how its done.

R-Commander is probably the most popular GUI these days. It does stay on the more “basic” side, but still is a good attempt to ease beginners into the world of R. And, after looking through the menus, there are some pretty handy shortcuts there, akin to SPSS Base offerings plus a few more. There are a bunch of plugins which extend it in different directions. Worth trying. Open Source.

SciViews-R is another popular gui. Currently in an alpha state, it could be a very powerful tool, though it feels more like an IDE than an analytic home. It is getting constant revision, and lots of attention on the mailing lists. One nice feature: A “report editor” with full HTML editing capabilities. In addition, SciViews can integrate with R-Commander and include its menus and features as well. All of this is open source.

If you just want an editor, the IDE/Script Editors page lists many of them, including Jedit. One interesting quickie is Tinn-R, basically a Notepad with some R twists and tweaks. I tend to default to Jedit, but that’s because I use it in so many other places. In addition, JGR has color syntax and code completion, so I have been using it more often as an editor as well. As I said above, however, if you want a light code editor, Tinn-R is pretty good.

Up and Comers:

RKward is an open-source GUI which looks pretty good… but is KDE and therefore strongest on Linux. Thanks to the porting of some libraries, it can run on Windows (barely), but it really needs help. Linux can handle memory better than Windows, so you might find yourself forced to use a Linux box for some analyses. You’ll want to use RKward. Once it gets better on Windows, it will be a powerhouse. See docs at

PMG is the “poor man’s gui”. I haven’t tried it yet, requires GTK libraries (Linux will already have them, Windows will need them installed; site tells you how). It’s getting some updates, but slowly, so YMMV.

Rattle which Togaware (an Australian Data Miner) says “(the R Analytical Tool To Learn Easily) provides a simple and logical interface for quick and easy data mining.” Don’t know much about it; it too requires the GTK libraries which are pretty easy to install. Gets lots of love on the mailing lists, partially from it’s ease, and partially that it brings in lots of data mining components often ignored in more “statistics” focused packages. Lots of great content on this site about the process of actually data mining.

SimpleR GUI is a good start for a GUI. Simple-R Handbook (PDF) shows how to use it. Some very clever features and graphics. However, it hasn’t been updated in a while, so it may not work as well as you wish.

If you have money, REvolution R Enterprise is an Eclipse based IDE which is pretty nice. Get your company to pay for it. Also, Eclipse can be a bit of a pig, so a beefier box is helpful.

Couldn’t care less about:

ESS and the Emacs bindings: as a non-emacs person (go Jedit!), I find little value in using a fat and overblown editor to run an already complicated system. Talking to emacs/vim folks (they both have problems) is similar to Name that Tune in reverse: I can do that in 10 keystrokes! I can do that in 12 keystrokes! Anyway… I am really just jealous because I’ve tried a few times and still can’t learn Emacs or VI, so I derogate them to make myself feel better. Anyway. If you already love Emacs, ESS is the way to go.

Brogdar is a commercial app which sits on top of R. However, it costs around $350, and though it has some interesting features, I can’t recommend it at this point (for $350, you can start to buy entire stats packages with support, so best of luck to them).

StatPaper attempts to turn R into a Mathematica or Maple gui, intermixing graphs and commands. Not much done since 2003, so don’t hold your breath. Not even listed on the site.

Farther afield…
OpenI is an overall reporting and BI system, but it has a link to use R. This is by the astounding guys at LoyaltyMatrix [now Responsys]. See more of their coolness at R at LoyaltyMatrix

BEE is another suite of tools supporting Business Intelligence project implementation including ETL tool and OLAP server and a thin client, with R as the “analytic” engine. Not updated since 2005, which is too bad.

DecisionStudio is a “desktop BI platform”. Basically integrates open source offerings including MySql, DBDesigner, R and Tinn-R, iReport and JasperReports, all linked with Python. Last updated in March, 2006, so may be dead.

Little Seeds, starting to grow…
Biocep-R, Statistical Analysis Tools for the Cloud Computing Age includes not only a Java front end, but a whole framework for distributed computing. The “Virtual R Workbench” looks pretty impressive, but unclear how much of the other server stuff you need.

R AnalyticFlow is like the visual front end from Clementine or SAS Ent Miner. Still early, but could be the front end to keep if you like these flowchart approaches.

RedR is a cool dataflow approach, open source, still very young.

Visualization of Data
Besides a gui to help you use R, there is a also a growing field of graphical depiction of data to help with analysis. This can be as simple as a scatterplot, or as sophisticated as a 3d chart allowing you to select points and see their relationships highlighted on multiple other charts simultaneously.

One of the more popular tools for this in the open source world is the GGOBI system, and R has a pretty good linkup called rggobi. Like R, it’s a whole system, so you won’t get all of its power on the first day, but it’s worth playing with to see just how easy it can be.

(Thanks to Andy Edmonds on the Web Analytics Group for reminding me that I left rggobi out.)

This isn’t exactly the same, but if you already use a package, you might find that it wraps around R. So, for example, SPSS has the R Extension available from their installer CDs in recent versions, or at SPSS Developer Central. If you use SAS/IML (which is pretty hard core, it’s only if you mess with your own matrices), you can get to R via their SAS/IML to R Interface. Other tools are also starting to wrap around R.

Gone now?
Rpad is an interesting mix of web based tool with more of an app feel. As they say on their page, “Rpad is an interactive, web-based analysis program. Rpad pages are interactive workbook-type sheets. Rpad is an analysis package, a web-page designer, and a gui designer all wrapped in one. The user doesn’t have to install anything—everything’s done through a browser.” Could become something cool in the future… though it appears to be gone now.

Comments? [2]

* * *


Understanding R · 5220 days ago, Analysis

R Notes
R Wiki

(If you’ve found this via searching, you may enjoy the entire series of R articles, found via the navigation link on the right, R Statistical System. These are all in “somewhat random notes” style, but they’ve been helpful to me in the past. Feel free to ping me with updates or suggestions.)


Ok, why all this? I wanted to pull together notes for folks who are pretty savvy and need to understand the quirks of R. Yes, there are lots of docs out there (I link to some of the better ones below) and yes, these notes aren’t always well organized… but they try to focus on the “getting the job done” parts, not the stats-tutorial or stats-programmer approach most other docs take.

Oh, the “official” docs? They stink for beginners or folks with little time on their hands to trudge through them. are the “official” ones, and are the contributed ones by users trying to make it better (but still, tough slogging ahead).

As you get more into R, these will all become more useful, but don’t worry if you get stuck on these in the early days: they are all pretty technical, both programming-wise and computational-statistics wise. But if you want to see how to work R like the masters, this is a good place to dig.

Best Manuals:
None of them are great, but these are the best of what I’ve seen in my reading. (I will expand and review them all later)
Using R for Data Analysis and Graphics with Introduction, Code and Commentary by J H Maindonald Updated Version which fed into much of this guide, credit to Maindonald!
Verzani-SimpleR.pdf has some very nice graphs, and also explains how to read the output, which is very helpful.

(There is also a Using R PDF which is the older Maindonald book, linked for historical reasons)

Quicker but still pretty good: Notes on the use of R for psychology experiments and questionnaires by Jonathan Baron and Yuelin Li

Fantastic guide for those who are very familiar with SAS and SPSS:
R for SAS & SPSS users PDF at the Author’s own site,

There are many contributed docs to R, and you can click here to see docs sorted by most recently updated. CRAN Other Documentation shows some commentary around some of the more general tutorials and docs, some of which I reference above.

GREAT list of tips:
RTips, aka StatsRUs. These are slowly migrating to the R Wiki Tips Section so check there also.

Don’t ever forget about the general R FAQ and the Windows specific R FAQ.

Another place to dig is in the R Mailing lists but as I post elsewhere, folks are not nice to newbies. Be prepared to slog through some really rude responses by people who don’t remember what it was like when they were just starting out. Also, the mantra to remember: It’s Open Source, get used to it.

The Most Important Things to Know about R
I will assume you’ve used other programs such as SPSS, Mystat or even Excel. I also assume Windows. Note that R will often top out at 100k rows and 20 variables b/c it stores everything in memory, depending on the type of data you have. You may do better with Linux than with Windows; Linux has a better memory model for the big stuff. Yes, Windows will top out at 2gb until we get to 64-bit, but if you are adventurous, the R for Windows FAQ has some hideously complex suggestions that none of us can easily do to potentially work around this issue (there is more about this near the bottom of the post).

There are 4 things to remember in working in R:

  1. Everything is an object. This means that your variables are objects, but so are output from analyses. Everything that can possibly be an object by some stretch of the imagination… is an object.

  2. R works in columns, not rows. We normally think of data as 1 line per person (or observation), with a collection of variables recorded per person. But R thinks of variables first, and when you line them up as columns, then you have your dataset. Even though it seems fine in theory (we analyze variables, not rows), it becomes annoying when you have to jump through hoops to pull out specific rows of data with all variables.

  3. R likes lists. If you aren’t sure how to give data to an R function, assume it will be something like this: c("item 1", "item 2") meaning “concatenate into a list the 2 objects named Item 1, Item 2”. Also, “list” is different to R from “vector” and “matrix” and “dataframe” etc. ad nauseum. But beyond the “specific meaning” aspects which you can deal with later, you get the idea.

  4. It is open source. It won’t work the way you want. It has far too many commands instead of optimizing a core set. It has multiple ways to do things, none of them really complete. People on the mailing lists revel in their power over complexity, lack of patience, and complete inability to forgive a novice. We just have to get used to it, grit our teeth, and help them become better people.

There aren’t many good ones. Some things which help me keep my sanity includes the pretty good start of JGR which still has bugs but includes a “data editor” similar to a spreadsheet, and color syntax highlighting and command tooltips to help your syntax. In a later entry entitled R GUIs, I review a few guis, web-front-ends, and editors.

BTW, in JGR: To run part of your syntax file when open in the handy editor, select it and <Cmd><Enter> on the Mac, or <Ctrl><Enter> on PC. No docs on this, but that’s how it’s done. Also, to get to the data editor, use the Object Explorer and then double-click on your data frame… voila. the Edit command doesn’t work as of this posting. There are more tips for JGR in the R GUIs entry here.

There is online help, but its hard. help() is your starting place. help(plot) gives help for the plot command. is as expected, and apropos(plot) lists all functions with the word plot in their name. Note that some of this help is aimed at programmers, not those of use who need to know how to get something done.

help.start() will pop up a more menu driven approach, but still not all that helpful. Basically, help.start() starts the browser version of the help files.

example(command1) prints an example of the use of the command. This is especially useful for graphics commands. Try, for example, example(contour), example(dotchart), example(image), and example(persp).


Basically, everything in R is an object. Assume an object is either a number or word, a collection of numbers/words, or the results of a procedure. BTW: identifiers are Case Sensitive! Comments are prefixed by the # character, a line at a time (meaning you need the # on each line).

Most stats packages lay the data in a simple fashion: column heads are variable names, and each row is a new entry. R turns this on its head a bit. You start off with columns of numbers (i.e., each variable on its own) and you merge them into a “data frame”, akin to SAS’s dataset or SPSS’s datafile.

Yes, this is annoying. It’s open-source; get used to it.

BTW: q() is the quit command. Why not just quit? Ya got me. Altogether now: It’s open source; get used to it. Anyway… q("yes") saves everything.

If you just want to “clear the decks”, consider rm(list=ls()) which deletes all objects.

Some tips on reading and manipulating data are in this PDF.

Getting data into the system…

myDataFrame <- read.table("c:/austpop.txt", header=T)

As you’d expect, you can play with what delimiter is used (sep=), and here, the first line are the headers and are read as the column names. (There are always MTOWTDI like PERL. For example, there is also read.csv(), etc. Check your manuals!)

Not all of these wind up in a data frame format; its pretty specific. If you aren’t sure, just force it:
mydata<-read.csv('C:/data.csv'); mydf< ## data as dataframe
(yes, a semicolon can separate multiple commands.)

Typing in data:
Pretty similar. In this case, we have a “c” function which combines numbers into a “column”. t1 <- c(1,2,3,4,5)

There is a mini “spreadsheet” for editing and adding data if you wish to raw-type:
xnew <- edit(data.frame())

If you have a couple of these, then you can combine them manually into a data frame with the data.frame function:
elasticband <- data.frame(strch = c(46,54,48,50,44), dist = c(148,182,173,166,109))

If you want to edit your stuff in a very (very!) basic data editor, the default R install has one:
elasticband <- edit(elasticband)

NOTE: You have to assign the result of the edit back to the object, or you lose all your edits. Bad thing.

(BTW: Just typing the name of the object at the prompt, like elasticband at the prompt will print out your data frame (the dataset you read). This works for almost any object in R: type its name and it just dumps its contents. Handy.)

Besides just typing a name, print(dataframe) will print your dataframe nicely formatted.

To see the names of the variables currently in the dataframe…

You can also have “row labels” with row.names(myDataFrame). This is somewhat rare; its basically picking one of the variables to be a “row label”. Use if it you need to, but I haven’t found a good use for it except in hacking tables to make output look better.

BTW, here’s a fun one: You can read from the clipboard, handy for quick grabs from Excel: read.table() will read from the clipboard (via file = "clipboard" or readClipboard).

All of this (and more) are in the “official guide to R Input and Output” at

The R site has some info on reading in SPSS info here. It basically says “Function read.spss can read files created by the `save’ and `export’ commands in SPSS. It returns a list with one component for each variable in the saved data set. SPSS variables with value labels are optionally converted to R factors.”. This is part of the package FOREIGN. Packages are addins which are listed by library() and loaded by library(foreign) (in this case). I have lots more about packages elsewhere, including R Packages

In practice, it looks like this:

library(foreign); MyDataSet <- read.spss("c:\\junk\\file.sav",

Yes, I did find that I needed double slashes. The data frame is the object MyDataSet.

What really threw me? Dates/Timestamps. SPSS stores date/time values as the number of seconds since October 14, 1582 (the start of the Gregorian calendar) (see So, have to do lots of calcs to convert those dates back to something R can use.

From a post on the R-Help list, here is one way:
library(chron) as.chron(ISOdate(1582, 10, 14) + mydata$SPSSDATE)
as well as this post which points out that spss.get in package Hmisc can handle SPSS dates automatically. This and additional discussion on SPSS dates is available in the Help Desk article in R News 4/1.

After loading, type library(help='Hmisc'), ?Overview, or ?Hmisc.Overview'
to see overall documentation.

There are a few things to consider. Start with
dataset <- spss.get("c:\\junk\\WN User Survey Final Data.sav")
but you may want to add the charfactor=T if you want to convert character variables to factors. Play with it, it may help or hurt. You can always do it later.
dataset <- spss.get("c:\\junk\\WN User Survey Final Data.sav", charfactor=T)

RODBC handles ODBC connections to databases
channel <- odbcConnect("DSN") odbcGetInfo(channel) # Prints useful info sqlTables(channel) #gets all the table names; is there a way to filter this?

Don’t forget to close or odbcClose at the end.
Function sqlSave copies an R data frame to a table in the database, and sqlFetch copies a table in the database to an R data frame.

An SQL query can be sent to the database by a call to sqlQuery. This returns the result in an R data frame. (sqlCopy sends a query to the database and saves the result as a table in the database.)

data1 <- sqlQuery(channel,"select * from dual")

If you need multiple lines for a long query, use the paste() function to assemble a full query. This can also be used to create substitutions, etc.

A finer level of control is attained by first calling odbcQuery and then sqlGetResults to fetch the results. The latter can be used within a loop to retrieve a limited number of rows
at a time, as can function sqlFetchMore.

And remember, you can read from spreadsheets via ODBC as well… but read only, no write back via ODBC! Note the use of the different connect, odbcConnectExcel.
library(RODBC) channel <- odbcConnectExcel("bdr.xls") ## list the spreadsheets sqlTables(channel) ## Either of the below will read in Sheet 1: sh1 <- sqlFetch(channel, "Sheet1") sh1 <- sqlQuery(channel, "select * from [Sheet1$]")

(Also, there is a DBI approach (similar to PERL DBI) which works kind of similarly. See package DBI:
<pre> library(DBI) library(ROracle) ora=dbDriver("Oracle") </pre>

To run the Windows binary packge
you’ll need the client software from Oracle. You must have the
$ORACLE_HOME/bin in your path in order for R to find the Oracle’s runtime libraries. The binary is currently not on CRAN (grrr), but only at Note that you would do better to compile this yourself, or better yet, skip it and just use RODBC. )

Some more is in this and is well worth a read.

Loading DataSets
If you saved it with “save” (see below”), then you can use the load() command. For example, load("thatdataset.Rdata")

data(name) loads the data attached to a package (ie, in its search path)
data() lists what’s in the most recently attached package? data() can also be used to load other things; for most purposes, I would stick with load() (and save(), see below).

Use data(package = .packages(all.available = TRUE)) to list the data sets in all available packages; this is handy for seeing just what sample data you have available for testing or demo purposes.

attach(data.frame1) makes the variables in data.frame1 active and available generally… i.e., this loads a previously saved data frame. Its like load(), but only loads when it needs it… kind of an “on deck” command. Yes, this is confusing; sorry bout that.

To load a collection of commands (i.e, a script or command file), try source("commands.R"). sink("record.lis") sends all output to a file; sink() turns this off.

You can save your entire “workspace” with save.image(file="archive.RData") These can then be attached or loaded later. save.image() (i.e., nothing in parens) is just a short-cut for “save my current environment”, equivalent to save(list = ls(all=TRUE), file = ".RData"). It is what also happens with q("yes").

A more common thing is to just save useful variable objects:
save(celsius, fahrenheit, file=“tempscales.RData”). Note that these are all binary files, so they can move from platform to platform, but are UNREADABLE IN ANY OTHER SOFTWARE. You were warned.

(BTW: R, by default, can save the entire workspace as the hidden file .RData. This can cause confusion later on, so be careful. This whole “saving the workspace” is painful, and in fact, the R folks suggest using a different directory for each “set” of analyses you do so you can store the whole thing in .RData and .Rhistory per directory. This is kludgy.)

names(obj1) prints the names, e.g., of a matrix or data frame. (aka variable names)

List of objects: ls()
rm(object1) removes object1. To remove all objects, say rm(list=ls()).
Size of Objects: dim() or (for vectors) length().

Multiple vectors (Columns) get linked into a data frame. THis is pretty central to how R works. Unlike SPSS (up through v14), R can hold multiple datasets in memory at once, each named.

This data frame stuff will drive you up a wall. I’ve mentioned it elsewhere as well, so keep trying. If each variable is a vector or array or list, you can make a “list of lists”. The dataframe is the list of variables; each variable is its own list/array. (Yes, List and Array and Vector are all special terms in R, so I shouldn’t use them interchangeably. Sorry.)

The data.frame() function puts together several vectors into a dataframe, which has rows and columns like a matrix.

Quickest program:

save(x1,file="file1") saves object x1 to file file1. To read in the file, use load("file1").

q() quits the program. q("yes") saves everything.

options(digits=3) sets digit printout to 3 decimals

Data Manipulation
Besides lots of formulas, each column (variable) is basically a vector, and so you can do all sorts of vector stuff like concatenate, subset, etc.

Now, remember, everything is an object. So, think in terms of functions on the entire object, and assume you can’t use loops (you can, but they suck). The good news is that this gives lots of interesting possiblities. For example, since x[2] is item 2 of x, you can also do x[x>3] to magically see all the items with values greater than 3.

To expand a vector or whatever, just assign something out of its current range. To truncate, set the length to whatever you want or use the index:
x <- x[1,2,3,5] keeps only the 4 items referenced there.

In addition, like a SQL join, calculations will expand vectors. So, if you multiply a single vector by a vector with 5 items, its like the single is applied across all 5 items. Similarly, the function sapply() takes as arguments the the data frame, and the function that is to be applied and applies it across the columns (like a closure?)

rep(thing, count) replicates thing count times.

Now, a vector with names (levels) gets a special name: factor. Basically, its like the AUTORECODE in spss. Take a column, convert it:
This reduces storage, and some R functions expect a “factor”. Like SPSS, levels (integers) are assigned in alpha sort order of the levels, not in order of appearance. if that’s a problem, assign a (..., ref=“LA”) to force a specific option to be the 0th or reference category. You can even just manually force the (..., levels=c(“LA”, “MA”, etc.)) if you want

Besides Vectors, you can have Arrays/Matrices (2 or more dimensions, all of same type), Data Frames (basically an array with each column as different type), Lists (Vectors with vectors in them, like a nested array), and Strings (basically, character vectors).

Strings prefer double quotes, and use C style escape sequences:

BTW, all the stuff you do with numerics can be done with characters, like repeats; Paste() is a substring combiner”
<pre> labs <- paste(c("X","Y"), 1:10, sep="") becomes c("X1", "Y2", "X3", "Y4", "X5", "Y6", "X7", "Y8", "X9", "Y10") </pre>

Every object in R has a mode() and a length(). Othere attributes are available via the attributes() function.

as.character(), as.integer(), etc. convert objects to new modes.

When you have a dataframe, you can get to its columns a couple of ways:

This whole “access” or “extraction” thing is painful. ?extract gives more details. Basically, you can use [] or $, or [[]]. You can even get help on them help("[["). Here are some more details.

With much help from the “Introduction to R” in the next few paras:
An R list is an object consisting of an ordered collection of objects known as its components. There is no particular need for the components to be of the same mode or type, and, for example, a list could consist of a numeric vector, a logical value, a matrix, a complex vector, a character array, a function, and so on. Here is a simple example of how to make a list:
Lst <- list(name="Fred", wife="Mary", no.children=3, child.ages=c(4,7,9))

Components are always numbered and may always be referred to as such. Thus if Lst is the name of a list with four components, these may be individually referred to as Lst[[1]], Lst[[2]], Lst[[3]] and Lst[[4]]. (If, further, Lst[[4]] is a vector subscripted array (as it is in the example) then Lst[[4]][1] is its first entry.)

If Lst is a list, then the function length(Lst) gives the number of (top level) components it has.

List components can have names; you can see how they wer hand typed in the example. So, name$component_name is another way to get to the data. Lst$name is the same as Lst[[1]] and is the string “Fred”.

Additionally, one can also use the names of the list components in double square brackets, i.e., Lst[[“name”]] is the same as Lst$name. This is especially useful, when the name of the component to be extracted is stored in another variable as in
x <- "name"; Lst[[x]]

It is very important to distinguish Lst[[1]] from Lst[1]. `[[...]]’ is the operator used to select a single element, whereas `[...]’ is a general subscripting operator. Thus the former is the first object in the list Lst, and if it is a named list the name is not included. The latter is a sublist of the list Lst consisting of the first entry only. If it is a named list, the names are transferred to the sublist.

A data frame is basically just a list with class “data.frame”.

If you have a dataframe and you don’t want to keep typing “dataframe$variable”, you can attach(dataframename) and just use the variable names, and then detach(dataframename) when you are done. If you are just using one dataframe, this is pretty handy.

Remember, its dataframe[ROWS,COLUMNS]. So if you want all rows for column 3, try mydataframe[,3]. If you leave out the comma, R assumes you meant the column, so mydataframe3 is the same, for the most part, as mydataframe[,3]. Names are more annoying. If the columns have names, you can do one with mydataframe[“q1”], but if you want more than one name, you have to use the c() function! mydataframe[c(“q1”,“q3”,“q9”)].

Now, if you like the $ approach, that’s fine… but if you want to select multiple variables, you have to recreate the dataframe, so in effect: summary(data.frame(mydata$q1, mydata$q3)). Yes, this is annoying.

Ok, all that mishagas aside, the subset() seems to be the winner:
subset(dataset,(dataset$Q4>=18 &$Q8)==F))


BTW... R loves to convert character data into factors AUTOMATICALLY when you create the dataframe to save memory. This can be REALLY ANNOYING. If you don’t want this, consider:
data.frame(v1, I(v2)) The I() function says “interpret this as raw, no transform” , basically. More at ?data.frame and ?read.table

Deleting a column: 4 examples:

   iris[,5] <- NULL  

data(iris) iris$Species <- NULL data(iris) iris[,“Species”] <- NULL or Newdata <- subset(d2004, select=-c(concentration,stade)) or mydf2 <-,ncol=20)) ### if you want to erase the third column, do: mydf <- mydf[,-3] ### if you want to erase the first, third and twentieth column, do: mydf2 <- mydf2[,-c(1,5,20)]

see page 22 for useful functions

Checking for Nulls: NA is the R “null”, and you aren’t supposed to test for equality to it. Instead, use

Getting Uniques: df[!duplicated(df$colname),] You could also use aggregate but aggregate() is basically a wrapper for tapply and tapply basically loops in R. duplicated() loops in C (and uses hashing).

How to count distinct or count unique in a column? well, unique() returns the uniques in a vector. sort(unique()) returns the sorted uniques.

And length() tells number of items in the vector… but gives number of columns in a Dataframe (which is a list of lists, each internal list being a variable vector. Annoying, huh?. dim() does this for a dataframe. So, if you are looking to count a variable, use length, but be careful, ok?

Unique Lists from Lists:

cbind, sapply, tapply, mapply, split become your best friends.

plot() does x-y plots
pairs() does great pairwise x-y plots… very handy.


help.start() starts the browser version of the help files, or just help().
help(command1) prints the help available about command1."keyword1") searches keywords for help on this topic. apropos(topic1) or apropos("topic1") finds commands relevant to topic1, whatever it is. example(command1) prints an example of the use of the command. This is especially useful for graphics commands. Try, for example, example(contour), example(dotchart), example(image), and example(persp).


If you are a newsgroup kinda person, try

Tables Tips:

Cat vs. print vs. format:
If one wants to display a character string with control over
newlines then one typically uses cat. If one wants to display
an object one uses print or else converts it to a character string
using format or as.character and then display it using cat.

Linking to Excel?

Big Data and Memory
Getting all your memory:
Rgui.exe —max-mem-size=2Gb or —max-mem-size=2000M,

But I reprint (edited) from the R Faq for Windows

2.9 There seems to be a limit on the memory it uses!

Indeed there is. It is set by the command-line flag —max-mem-size and defaults to the smaller of the amount of physical RAM in the machine and 1.5Gb. It can be set to any amount between 32Mb and 3Gb. Be aware though that Windows has (in most versions) a maximum amount of user virtual memory of 2Gb. Use ?Memory and ?memory.size for information about memory usage. The limit can be raised by calling memory.limit within a running R session. The executables Rgui.exe and Rterm.exe support up to 3Gb per process under suitably enabled versions of Windows (see even where this is supported it has to be specifically enabled). On such systems, the default for —max-mem-size is the smaller of the amount of RAM and 2.5Gb.

From the R-Help Mailing List:
Duncan Murdoch, Friday, March 03, 2006 says:

R can deal with big data sets, just not nearly as conveniently as it deals with ones that fit in memory. The most straightforward way is probably to put them in a database, and use RODBC or one of the database-specific packages to read the data in blocks. (You could also leave the data in a flat file and read it a block at a time from there, but the database is probably worth the trouble: other people have done the work involved in sorting, selecting, etc.) The main problem you’ll run into is that almost none of the R functions know about databases, so you’ll end up doing a lot of work to rewrite the algorithms to work one block at a time, or on a random sample of data, or whatever.

From a post at DecisionStats :
A very rough rule of thumb has been that the 2-3GB limit of the common 32bit processors can handle a dataset of up to about 50,000 rows with 100 columns (or 100,000 rows and 10 columns, etc), depending on the algorithms you deploy.


From the R-Help Mailing List:
[R] Re: suggestion on data mining book using R
Vito Ricci on Thu, 20 Jan 2005
Hi, see these links:

Brian D. Ripley, Datamining: Large Databases and
Methods, in Proceedings of “useR! 2004 – The R User
Conference”, may 2004

and if looking for a book I (Vito Ricci) suggest:

Trevor Hastie , Robert Tibshirani, Jerome Friedman,
The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, 2001, Springer-Verlag.

B.D. Ripley, Pattern Recognition and Neural Networks


Some other docu links to check out: (Nice, includes some good tips) (An R Companion to Experimental Design) (R for Beginners) (Fitting Distributions with R) (The unwilling S user; remember that R is the open source S, so most things will work in both systems). (Practical Regression and Anova using R)

Some tips on reading and manipulating data are in this PDF. has lots of stuff in it, including data. has a 9 page rapid fire doc to get you started, includes data. has a collection of HTML docs, so a pain to use, but once unzipped, is a good dive in… Already unzipped at if you want to see it. and, strptime, and chron

Sample dataset all about the Titanic:

Hoping I helped you.

Comments? [4]

* * *


Analysis with R · 5221 days ago, Analysis

When you have a dataframe, you reference specific columns like this: dataframe$column with the “$” sign.

If you don’t want to keep typing it, try
attach(dataframe) = assume “dataframe” precedes each variable name
detach(dataframe) = turns off the attach, nice way to end things. has some more info on how to manipulate dataframes.

BTW, ; (semicolon) for multiple commands on one line.

Basic Stuff
rowcount = length()... of a single variable. If you have a dataframe, you need to use Dim(). Remember, a dataframe is basically a list of lists, so length() of a dataframe is just the number of columns in your dataframe… each of which is a list. Its annoying, I know.

So, each variable is a vector or array (if you took Pascal, Basic, Java, or C in college, you remember what an array is). Put them together and you have a dataframe. Now, if that dataframe is all numbers, then in effect, its just a huge 2-d array. Since its stored as lists, its not exactly a matrix, but can become one with as.matrix(). Keeping this in mind makes many tricks much easier to understand.

The usual summary stats include mean(), var(), sd(), median(), and fivenum(). (summary(), for a numeric, gives the same as fivenum)

Hist() for histogram, and if you have a numeric, simple.hist.and.boxplot() gives both.

Often, barplot() will give you a strange result. Its commonly used like this: barplot(table(x)) which graphs the summarized data, and is usually what you wanted anyway. (Note that this demonstrates the “object” nature of things: you are graphing the “results” of the table object.

Restructure and Merge
Ah, the everpopular transpose. SAS still wins in the “aggregate lots of rows and flip them so events are summarized rows, one per person” world. R has stack() and unstack(), but reshape() is the more powerful (and complex) options. It basically combines pieces from tapply, by, aggregate, xtabs, apply and summarise. Lots more about reshape() here. Reshape can also make tables, so consider it instead of xtabs().

merge() combines frames, and can be used to make left, right, or full outer joins. cbind() just concatenates 2 frames if they are in the same order; I would stick with merge since you should have keys on both files anyway.

How about dropping a column? The only way I know is:
NewDataFrame <- OldDataFrame[1:7,9:23] to drop column 8. I am SURE there is a better way. And from “R for SAS and SPSS users”, we learn that you can do:

mysubset$q3 <- mysubset$q3 <- NULL

which drops Q3 and Q4 from your data.

This is also how I drop rows, or using something more clever like
earlydata <- data[data$year<1960,] You can use more than one variable ala X[Y>=A & Y<=B]. | (pipe) is an “or”, & (ampersand) is an and.

R experts call pulling chunks “extraction”. There is a function called subset() and ?select gives good info. ?extract is also recommended. are the best damn tips in existence.

If you just type an analysis command, basic output is given on the screen… but remember, everything is an object. Therefore, many users assign the results of an to an object, which allows cleaner printing, formatting, etc.

Also, the default output for many analyses is so bare-bones as to be useless. So, get into the habit of assigning the results of a run to a variablename, and then using summary(variablename) almost immediately afterwards (or just put on same line with semicolon (;) separator).

So,lm(distance~stretch,data=elasticband) prints some basic stuff, but
elastic.lm <- lm(distance~stretch,data=elasticband) summary(elastic.lm)
gives much more info.

3 basic analysis commands to play with: summary(dataframe) gives simple attributes of all the variables (min, max, median, etc.)
cor() gives correlation matrix
lm(distance~stretch,data=elasticband) gives a linear regression (lm=linear model)

Model Syntax:
y ~ x (duh, but implicit intercept)
y ~ x+1 (explicit intercept)
y ~ 0 + x (explicit deletion of intercept or intercept set to 0; use in distance calcs or chemical creation based on time analysis)
y ~ w*x (this one FORCES the single parameters w and x, as well as the interaction! it doesn’t look it, but this is akin to I(w+x+(w*x))

Note how I used the I() function, which basically says “don’t use this confusing regression syntax, use normal math”

When you want a conf interval in regression, remember that its
b ± t*SEb, can use qt(0.975,n-2) to get the appropriate t score.

Categoricals: Crosstabs, Freqs, and other beasts

It is actually surprising how poor R’s table and crosstab functions are compared to the rest of the system.

Crosstabs come basically from table(). While its not a full tab system, lots of options. As J H Maindonald points out on page 23 of his manual: WARNING: NAs are by default ignored. The action needed to get NAs tabulated under a separate NA category depends, annoyingly, on whether or not the vector is a factor. If the vector is not a factor, specify
exclude=NULL. If the vector is a factor then it is necessary to generate a new factor that includes “NA” as a
level. Specify x <- factor(x,exclude=NULL)

table() gives frequencies; if you want proportions of the vector, try table(Car$Type)/length(Car$Type)

Two variables reads like table(Car$Make, Car$Type).

Visually: barplot(table(mydata$Q9, mydata$Q7), legend.text=T)

summary() = summary is a generic function used to produce result summaries of the results of various model fitting functions; when run on a vector, gives either freq count for factors or means/etc. for numerics.

table() = table uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels. Doesn’t have lots of options. Can do some labeling, and can run 2d chi-square

xtabs() = more advanced cross tab, uses formula notation.
ftable() = “flat” contingency tables.

You can also do stuff like prop.table() to see proportions, and addmargins() to include marginals. tapply() can be used to fake some of this manually, but its a lot of work.

CrossTable() function in the gregmisc package (now Gmodels, I think; greg split them out but you also need gtools and gdata for it to work).

There is also crosstab() function in the ecodist package. xtabs() in the stats package is a simple solution, but package reshape is much more flexible (and complex and powerful).

ctab() is in the catspec package by John Hendrickx. This will do multi-way tables with summary row/col statistics. is the official doc page.

There is also the CrossTable() function in the ‘gmodels’ package on CRAN, though this will only do one way and two way tables, with summary row/col statistics.

For simple count with multi-way output, try ftable(). summary() gives range, mean, quantile data for continuous variables, and you can potentially format this to your liking.

Note that contrary to postings on r-help,“crosstabulation”) and“crosstab”) do nothing.“cross”) gives some help… but not much.

Simulating the SPSS CROSSTABS procedure: (adaptation of Gregory R. Warnes’ CrossTable() function in the gregmisc package)

This custom function is really, really nice. To add it to R perm, you have to edit files like the system startup file $R_HOME/library/base/R/Rprofile or the .Rprofile in the current directory or in the user’s home directory.

Another one worth looking at is Thanks,!

There is also some info in Kickstarting R
with a special function

Interestingly enough, I posted about a multiple response question on the R-help group and got a very clever response.

Lets say you have data like this:
id att1 att2 att3
1 1 1 0
2 1 0 0
3 0 1 1
4 1 1 1

People have checked some of 3 attributes. I want to know which hang together. Look at this clever approach using crossprod() (the matrix multiplication function) to get the co-occurrence numbers, drop the diagonals, and then convert to percentages:

ratings <- data.frame(id = c(1,2,3,4), att1 = c(1,1,0,1), att2 = c(1,0,0,1), att3 = c(0,1,1,1))
tab <- crossprod(as.matrix(ratings[,-1]))
tab <- tab - diag(diag(tab)) # drop the diagonal
tab # pretty, isnt' it?
tab / nrow(ratings) # divide by the people to get percentages


Consider str() over summary() when examining a data frame?

Task Views show combinations of packages to solve certain areas of statistical problems (AI, Econometrics, etc.)

Comments? [1]

* * *


R Packages · 5221 days ago, Analysis

(First off, if you found this page via a web search or bookmark, you may be much happier in the R Section of this site to see the multiple articles about R, including this one, but also about Packages, Data manipulation, etc.)

Packages are bundles of additional functionality. They can be analyses, datasets, or just tools. For the unix side, they come as source code and get compiled on your system. For windows, the R team has pre-compiled many of them, but sometimes they don’t work. (All together now: It’s open source. Get over it.)

Like CPAN is the home of all add-ins for Perl, CRAN is the home for all add-ins (packages) for R. While there are a few here and there not mirrored on CRAN... assume CRAN is the best place to start.

What’s on CRAN? Check out

library() lists what’s available on your current box
search() lists what’s loaded
library(packagename) loads it in
detach("package:packagename") unloads it

Adding a package? First you have to get it, using install.packages(name); note the plural. Then, to activate it, use JGR’s package manager, or type the commands below; the path below is your default library dumping ground, but feel free to put your favorite path. Remember, the names are case sensitive, so Hmisc needs to be spelled this way.
install.packages(c("RODBC"),"C:/Program Files/R/rw2001/library");.refreshHelpFiles()

Don’t forget to either or .refreshHelpFiles() to refresh the help files and indexes; most of the packages include some these days.

You can set your nearest CRAN to be the default for your session (or put it in a startup file):
options(CRAN = "")
Then simply say
install.packages("foo") or install.packages("foo")

If you’ve already downloaded the zip with the binary package for Windows, then argument pkgs can also be a character vector of file names of zip files if CRAN=NULL. The zip files are then unpacked directly.
<pre> install.packages(c("C:/Downloads/Downlods/R System/"),CRAN=NULL) </pre>

Packages can be removed in a number of ways. From a command prompt they can be removed by just deleting the package directory, or
remove.packages(c(“pkg1”, “pkg2”), lib = file.path(“path”, “to”, “library”))

as in : remove.packages(c("DBI"))

Or just delete the directory. I have no idea if the help files are properly removed as well; perhaps run the refresh commands mentioned above to remove the un-needed help files.

Updating Package?
summary(packageStatus()) lets you see what is new and not.
update.packages() walks through each new one to let you upgrade it.

print x[[“inst”]][“Status”]

Default Packages:

Boot = Bootstrap functions, including some sample data
Class = Classification, very handy, including k-nearest-neighbor and SOMs
Cluster = Cluster analysis including plots, Clara/Diana/Agnes large data techniques
Datasets = Tons of datasets for sample analyses
Foreign = translators for Minitab, SPSS, S3, SAS, DBF, etc.
Graphics = all the basic plots and some clever ones; Lattice has more advanced ones
grDevices = Control over graphic display devices
Grid = Low level graphics control, underlies Lattice
KernSmooth = Kernel Smoothing algorithms (Kernel Density Estimate, etc)
Lattice = Powerful visualization package, similar to the Trellis package from S-Plus; requires Grid package
MASS = Venables and Ripley’s MASS, including datasets, analyses, and examples linked to their book. Lots of good “utility” analyses here.
Methods = Package to deal with R internals and programming
mgcv = GAMs with GCV smoothness estimation and GAMMs by REML/PQL = General Additive Models
nlme = Linear and nonlinear mixed effects models
nnet = Feed-forward Neural Networks and Multinomial Log-Linear Models, handy for categorical data analysis
rpart = Recursive Partitioning and Tree building. Handy for categorical analysis.
spatial = Kriging and Point Pattern Analysis. I have no idea what this does, so worth investigating. I assume its a geo-spatial analysis approach
splines = Regression Spline Functions and Classes
stats = All the stats you ever wanted, from Anovas to weighted means, and lots of stuff inbetween.
stats4 = Statistical functions using S4 classes. Looks like wrappers around the more advanced stat calculations
survival = Survival analysis (Cox model, etc.), including penalised likelihood. Useful for decay analyses. Includes some sample data
tcltk = Tcl/Tk Interface, a gui popular on unix but less accessible on windows (hence the drive towards JGR and other “more cross platformy” approaches)
tools = a mixture of random stuff, more useful for R programmers than users
utils = a mixture of random stuff, but actually handy things. Worth reviewing the list of things here for quick saves.

VCD, “Visualizing Categorical Data” has been mentioned as a great package for data viz. has the JGR packages

Finally, for the ever popular clustering of binary data:

?dist (method=“binary”)
For distance based clustering methods see

Recent Discovery:
sqldf is an R package for performing SQL select statements on R data frames, optimized for convenience.

It consists of a thin layer over the R packages RSQLite and RMySQL. (The code for accessing RSQLite has been tested but the code for accessing RMySQL has only been partly tested and only in the development version of sqldf). More information can be found from within R by installing and loading the sqldf package and then entering ?sqldf. A number of examples are at the end of this page and more examples are accessible from within R in the examples section of the ?sqldf help page.

So, for those times when you know exactly how the transform should go in SQL, but you don’t know all the R tricks to get it there… sqldf.

Another good one: sqlitedf and Basically, this replaces your in-memory dataframe with a SQLLite backed version, allowing much larger data. As G. Grothendieck, the author of SqlDF, pointed out in a comment, this doesn’t give you access to SQL itself, but can help you deal with larger datasets while staying in an R context and syntax.

Update: 2/6/2008: FF is a very exciting package that got its first big show at the 2007 user conference.
The ff package: Handling Large Data Sets in R with Memory Mapped Pages of Binary Flat Files What’s great about it is that it appears to work without changing lots of R’s insides.

Andy Edmonds on the Web Analytics Group suggested highlighting the ODBC and SQLite connectors. Getting data in and out of databases and other tools is pretty important. Did you know you can control Excel through ODBC? And SQLite is a very small database that that you can use when you just gotta do something in sql that you can’t do easily in R (multi-dataframe joins, etc.). rodbc and rsqlite are good places to start.

Comments? [2]

* * *


R Links to examine... · 5236 days ago, Analysis

(If you’ve found this via searching, you may enjoy the entire series of R articles, found via the navigation link on the right, R Statistical System. These are all in “somewhat random notes” style, but they’ve been helpful to me in the past. Feel free to ping me with updates or suggestions.)

?extract talks about how to pull out specific parts of a matrix/grid
Probably should move these to the R Wiki Links page

Besides all the fun stuff below, there is an R Wiki and the R Graph Gallery the wiki is really weak and needs all the love it can get while the Gallery is really eye-opening and could be one of the best things in the R Community (next to the R for SAS and SPSS Users book by Robert Muenchen). Additional graphing info can be seen at the RGraphExampleLibrary Package List

There is a shell of a book at Wikibooks, and this could probably use some help as well.

Search Engines Dedicated to R
R-Seek has some nice features, including different tabs for different results types.

R MultiSearch gives 3 search boxes, one for Google, Swiki, and Rollyo, all aimed at R sites. BTW, this one searches Baron’s site (see below)

R Site Search by Jonathon Baron. Pretty good, but getting a bit long in the tooth. Includes most R documents, functions, and R-help mail archives including R-help, R-sig-geo, and R-sig-mixed-models messages. was ok, but gone already…

The Fantastic R for SAS and SPSS Users.

From Vanderbilt, R and S-Plus Packages, Functions, and Documentation including the very nice An Introduction to S and the Hmisc and Design Libraries by CF Alzola and FE Harrell (PDF).

Gmane has a collection of links to other lists, including as well as the overall listing,

In some ways, Nabble beats Gmane for ease of use. Check all of the Nabble Statistics Archive Lists including PSPP and R, and look at the combined R lists at

Data Mining With Rattle and R Book gives both good stats stuff, and tells how to use the Rattle GUI which is a cool little tool. This is an entire book, but a bit hard to read page to page because of too many “splits” of content.

Data Mining with R: learning by case studies by Luís Torgo is an entire book on data mining with R, including code. Not updated since 2002.


Quick-R is a good quick intro to R.


Using R for psychological research:A simple guide to an elegant package is a pretty nice start, pretty quick.

R code for graphics from the book R Graphics by Paul Murrell. In a nice gesture, he shows the graph with the code, which is very cool.

Producing Simple Graphs with R is a nice page for making graphs.

Handy list of commands to keep around: The R Reference Card by Tom Short and the rpad group. (Actually, this whole site has some interesting things, like docs with examples which run “in the browser” and other useful links). Other ref cards are linked at Baron’s page:

Probably not a big deal
R help for Statistics 371 gives some small tips, scripts. No biggie.
Applied Econ:e-Tutorial 2: A Brief Introduction to R also some small tips, scripts.
R/S-Plus links

Still not sure what to do with via via

Slightly more advanced Notes on using R (PDF)

From the R Mailing List:
[R] Create a new var reflecting the order of subjects in existing var

Things I always forget about R now at

Programming in R last updated Jan 2007
Both by Tom Lumley, an R core developer/ The PDF is 208 pretty impressive pages.

Data Mining With Rattle and R: Open Source Desktop Survival Guide More good stuff from Togaware.

Colors in R

R by Example Some good little snippets, last updated 2005

Quick Tutorial in Time Series with R. The whole site is pretty good: if you want to learn about Time Series analysis.

A Quick, Painless Tutorial on the R Statistical Package So, the quick and painless one is out of date, and the up to date one is really aimed at programmers, meaning it is neither quick nor painless.

From SPSS to R on the R Wiki.

R Cookbook Pretty impressive, so of course, it’s gone. He still has a short course.

R Tutorial by Kelly Black

Supposedly, this textbook of statistics with R is pretty good.

R cheatsheet from 1998 I guess things don’t change all that much; most of this still works.

R software by Henrik Bengtsson

Programming in R is a very programmer oriented discussion, but pretty handy.

Using R for Psychological Research

SQLDF for R This clever little tool let’s you use SQL to manipulate data frames. Think of SAS’s Proc SQL. A somewhat similar project is sqlitedf SQLite Data Frames for R, which has not been updated since Dec 2007.

R and Postgres at Google Code

Cluster Validation with R

gsubfn is an R package used for string matching, substitution and parsing.

batchfiles Windows batchfiles for use with R Appears to be ways to start R adding proper paths, etc.

Rattle (the R Analytical Tool To Learn Easily) provides a simple and logical interface for data mining. and also at


Rory Winston’s Blog, The Research Kitchen, has some great samples of really optimized R code. Good to learn from.

Princeton Guide:



Apropos of nothing, this is really cool:

Comments? [1]

* * *


R Commentary · 5352 days ago,

Just links to a few posts I’ve made about R outside of the useful notes in this section.

R Graph Gallery
R doesn’t want ‘newbies’... and that’s a mistake.
A very early post: R: They just left the interface out…

Comments? [4]

* * *


powered by Textpattern 4.0.4 (r1956)