Analyzing Statistics with GNU Rby Kevin Farnham
The R Project for Statistical Computing is an open source language and programming environment that provides a wide variety of data manipulation, statistical analysis, and data visualization capabilities. R is a successor to the S software originally developed at Bell Laboratories. R is a GNU project and is freely available under the terms of the GNU General Public License. The software runs on Windows, Mac OS, and a wide variety of Unix platforms.
Even for people who aren't expert statisticians, the power of R is alluring. Working interactively or using an R script, with just a few lines of code a user can perform complex analyses of large data sets, produce graphics depicting the features and structure of the data, and perform statistical analyses that can quickly answer questions about the data. This article introduces R and demonstrates a small slice of its capabilities, using data from the stock market and real estate industry as input.
R User Interface Basics
I developed and ran the demonstrations in this article on a Vision Computers PC running Debian Linux (Sarge version, "stable"). I installed R Version 2.1 using the standard Debian Advanced Package Tool (APT) utility.
To run R interactively, open a terminal window, create a work directory, enter the directory, and execute R:
Figure 1. The R command window
R's command language takes some getting used to, but knowing just a few commands provides enough knowledge to perform basic statistical analyses and generate plots for any data set that can be formatted into a one-, two-, or three-dimensional rectangular grid.
R commands can be lengthy, because most functions have many optional settings parameters. So it's very convenient that the up and down arrow keys in the R command window recall previous command-line entries, which you can then edit. This lets you work toward your data analysis goal incrementally by modifying and extending previously entered command lines.
Data Import and x/y Line Plotting
A simple starting point is a vector or array of numbers. I downloaded historical Standard and Poor's (S&P) 500 stock index data and wrote a Perl program to convert the data into a simple two-column table. Figure 2 shows a snippet of the start of the file.
Figure 2. The SP500.txt file format
read.table() function reads an external file into an R data frame variable. To read the S&P 500 data file and make a simple plot of the close price over time, enter the following commands at the R prompt (
> dv <- read.table("./SP500.txt", header=1) > sp500value = dv[,2] > plot(sp500value, type="l")
The first line reads the file SP500.txt into the data frame variable
header=1 setting identifies that the file has a one-line header consisting of nondata text, which R must ignore. The data frame
dv is a two-dimensional array. The first index identifies the row number (starting with 1), and the second identifies the column number (starting with 1).
The second command line tells R to assign variable
sp500index to the values in the second column of the data frame. The third line tells R to plot the index data using a line (
type="l") to connect the data points. The result appears in a new window (Figure 3).
Figure 3. The S&P 500 close price plot