When doing math or numerical analysis, the knowledge of the technique is far too often tied to the tool performing the calculation. Consider an engineer whose understanding of the Fast Fourier transformation is inseparably tied to the fft function in Matlab. Of course this hypothetical engineer understands what the results mean (more or less) but may not be able to duplicate his analysis if Matlab were taken away.
In most cases, it is likely that no deeper understanding will be required. But what happens if the computer makes a mistake? Or the program becomes unavailable? Both situations are entirely possible. Computer algorithms aren’t perfect and occasionally arrive at results make little sense; and hardware has been known to fail.
When the engineer understands how the computer arrived at the answer, however, he can recognize, understand, and ultimately correct those cases where the results are unexpected. This is an important reality check that can prevent costly disasters later down the line. Or, if the hardware is unavailable, he can use an alternative tool or software package to duplicate the analysis.
But while such a situation can arise with any type of numerical software, it’s most likely to happen to users of a statistical package. I find this extremely ironic since a proper understanding of statistics is essential to live in the modern world. (Much more so than an understanding of the Fast Fourier transform, at any rate.) The rules of probability, the normal curve, correlation, and multivariate statistics can have a direct impact on how we live our lives. They are used in making important decisions in finance, medicine, science and government. A misunderstanding of stats and the methods of science (from which statistics is inseparable), underlies the most divisive issues of our day: abortion, stem cell research, and global warming.
Moreover, neither side has a monopoly on ignorance or misunderstanding. People fail to distinguish between correlation and causality, or insist in using the word “average” as a slur. Nearly as bad are those that – like the hypothetical engineer described above – only understand statistics within the narrow context of their stats package. Casual statisticians are nearly as dangerous as the wholly uninformed.
The Statistical Package for the Social Sciences (SPSS), is one of the biggest perpetrators of this crisis. Which is hugely ironic, because I happen to love SPSS. SPSS is probably the first statistical package that has placed advanced statistical methods within the grasp of the novice user. I’ve been a happy user for nearly a decade (ever since I was introduced to the program in high school). But there is no doubt that I’ve come to understand statistics within the context of SPSS and its GUI.
Please don’t misunderstand me, I have a pretty good grasp of basic statistics. I can sling probability with the best of them and take relish in describing when to use the Fischer Exact test instead of a Chi-Square; but advanced statistics are a completely different matter. Advanced stats scare me. I can certainly use these more complicated methods. I’ve analyzed and written about multi-variate models and even ventured into Analysis of Variance (ANOVA). But I have to rely on SPSS and the aid of my institution’s biostatistician to help me recognize when there is a problem.
Which is why, in a time of tight budgets, losing the institution’s SPSS license has been a crushing blow to my productivity. (Whoever made that decision should be hauled out and shot!) Because I don’t have my statistics software any more, there are certain aspects of my job that are much more difficult to do. And unfortunately, there is only logical conclusion to draw: I’ve become a victim of the statistical ease of SPSS.
Open Source Alternatives
I went through a similar experience about a year ago. At the time, I had become increasingly frustrated with the restrictions, licensing fees, and limitations of the Matlab technical computing language. After one particularly infuriating meeting, I decided that I had had enough and was going to do something about it. In the months that followed, I spoke with friends and colleagues, and experimented with every alternative I could get my hands on. I looked at Octave (the “Open Source Matlab”) and Ruby, before eventually settling on a combination of Python and PyQt to meet my needs. The result of these changes has been tremendously positive. Python is both easier to use and far more powerful than Matlab could ever hope to be. Not only am I happier and more productive, but so are those who work with me.
It is, therefore, logical that when I lost my statistical language of choice that I would look to open source to provide an alternative. Fortunately, the Open Source community delivers not one alternative to SPSS, but two: Gnu PSPP and R.
Gnu PSPP
As the name implies, PSPP has one simple goal: to clone SPSS in every way that matters. It can perform descriptive statistics, T-tests, linear regression and non-parametric tests. It has an easy to easy to use and relatively intuitive GUI. It can use SPSS syntax and read SPSS data files. It supports an obscene number of variables and cases (about a billion). It interoperates with Gnumeric and OpenOffice. Finally, it’s fast.
Aside from its horribly ugly icon, PSPP would appear to deliver exactly what I want and need. Except, you might have noticed that this article is titled “Statistics with R”, not “Statistics with PSPP”. Obviously, I chose to go with the second alternative. But why?
PSPP works as advertised. I found it able to deal with nearly all of the old SPSS data files and syntax that I threw its way. But, the program suffers from the problem of all clones everywhere: it’s greatest aspiration is to be a copy of something else. That is to say, it seeks to be “Good Enough”, and therein lies the problem, I don’t want a tool that is good enough. I want to use excellent software, even if it’s different or requires me to learn new things. Even if I have to pay for it.
I’m not trying to pick on or be unfair to PSPP. It meets an important need in the free software landscape. It just doesn’t fit in my with my desires or preferences very well.
The R Statistical Project
This is where R steps into the picture. Whereas PSPP is “aimed at statisticians, social scientists and students requiring fast convenient analysis of sampled data (emphasis added)”, R is the software that most statisticians actually use. When I contacted the statistician at my institution to ask, “What statistical software should I use? I’m looking at R and PSPP.”
He responded, “Oh that’s easy. Use R. There will be a learning curve, but it’s much more powerful and capable than even SPSS or SAS.”
As I’ve started to explore the feature set and available modules, it readily becomes apparent as to why. R is a huge language. There are thousands of packages that cover every type of statistics I’ve ever heard of, and many more I haven’t.
Even better, people have gone to great lengths to incorporate R into other tools. It has a set of excellent python bindings and interoperates very well with LyX and LaTeX. As just a single example, using the Sweave document class, you can use R to easily embed code in reports and other documents that need to be updated on a very frequent basis. This allows for these publications to be generated on demand with the most recent data. The only other place I’ve seen the equal to this feature is within the proprietary universe of Microsoft Office and SQL Server.
Easing Into R
Indeed, if R can be said to have a major weakness, it would be that it is too full featured and capable. Particularly for someone who is a statistical novice. The sheer number of packages and options available is absolutely overwhelming. Moreover, the reference material is distributed in nature. Like other open source tools, you can find answers to your questions; but you need to be intelligent about how you ask them.
As I’ve explored R, there have been quite a few painful moments. This isn’t because R is more difficult than SPSS or SAS, but rather because it is tremendously different. As an example, consider the differences in how the programs work with data.
Both SPSS and SAS use one main data structure, the data set. A data set can be thought of as a big spreadsheet where the variable names are kept in columns and the individual observations are kept in rows (SPSS calls them cases). In contrast, R uses may types of data structures. It has an a two dimensional array that is similar to the data set, but it is possible to use one dimensional arrays of data (similar to vectors), or three dimensional arrays (which might contain extremely complex data). The added options raise very complicated questions: What sorts of statistical calculations are done on a three dimensional arrays? Why are they necessary? How do I need to code my data so that I can take advantage of R’s advanced features?
Those aren’t issues that a user of SPSS or SAS even needs to consider. But with R, they present themselves before you even begin to use the program; and there is no centralized source of information to help you figure out the answers. The result of too many options and too little information is unproductive agony.
Series Introduction
Which is why I decided to write this series. R does some spectacular things, it is excellent software. But there are some things you need to know before using it. Here are just a few examples:
- What user interfaces are available for R and which should you use?
- How can you use R to summarize data and do basic comparisons?
- How does R handle moderately advanced statistics like one-way ANOVA and non-parametric tests? What about mult-variate statistics and regression analysis?
- How does R work with other programs? How can you format your output into publication quality figures?
- What support does LyX and LaTeX offer for users of R?
The purposes of these articles are to address the concerns of the novice statistician or scientist. I will try and avoid jargon and other indecipherable terms. I will ensure that the examples are interesting and relevant. But most importantly, I will try and to help build a deeper statistical foundation. I know exactly what it feels like to be the “hypothetical” engineer who has become too reliant on his tools. As long as the tool is nearby, you’re fine. But when that tool is taken from you, be prepared for a world of hurt.
Until fairly recently, I’ve been in that world of hurt. You might just say that these articles are my way of explaining how I got out.
Related Posts
Tags: R,Scientific Computing,Statistics
Categories: Computer, rapidBOOKS
3 Comments »





































Comments
3 Responses to “Statistics With R – Part 1: An Old Dog Learns New Computing Tricks”
One other stat open-source possibility is gretl, a GUI.
It is important to emphasize that R is more than just a statistical package, one can programme all sorts of things – personally I use it for producing great plots some of which I struggle to do in other ways. One of my favourites is the triangular plotting from the plotrix library/package which can be used to plot flammability charts, and I suspect phase diagrams, distillation maps etc. There are probably about half a dozen other packages covering triangular/ternary plotting, but I have chosen to use plotrix.
@Stephen: Thank you for the wonderful comment. When I was looking at alternatives, this was actually very important to me. When dealing with technology, it’s important to know what kind of dividend you can expect to receive. For a relatively small investment, R has a great return. Much better than PSPP, in my opinion.
@Livui, thank you for mentioning gretl. I will have to take a look at it.
Care to comment?