The R Project for Statistical Computing is an open source language and programming environment that provides a wide variety of data manipulation, statistical analysis, and data visualization capabilities. R is a successor to the S software originally developed at Bell Laboratories. R is a GNU project and is freely available under the terms of the GNU General Public License. The software runs on Windows, Mac OS, and a wide variety of Unix platforms.
Even for people who aren't expert statisticians, the power of R is alluring. Working interactively or using an R script, with just a few lines of code a user can perform complex analyses of large data sets, produce graphics depicting the features and structure of the data, and perform statistical analyses that can quickly answer questions about the data. This article introduces R and demonstrates a small slice of its capabilities, using data from the stock market and real estate industry as input.
I developed and ran the demonstrations in this article on a Vision Computers PC running Debian Linux (Sarge version, "stable"). I installed R Version 2.1 using the standard Debian Advanced Package Tool (APT) utility.
To run R interactively, open a terminal window, create a work directory, enter the directory, and execute R:
Figure 1. The R command window
R's command language takes some getting used to, but knowing just a few commands provides enough knowledge to perform basic statistical analyses and generate plots for any data set that can be formatted into a one, two, or threedimensional rectangular grid.
R commands can be lengthy, because most functions have many optional settings parameters. So it's very convenient that the up and down arrow keys in the R command window recall previous commandline entries, which you can then edit. This lets you work toward your data analysis goal incrementally by modifying and extending previously entered command lines.
A simple starting point is a vector or array of numbers. I downloaded historical Standard and Poor's (S&P) 500 stock index data and wrote a Perl program to convert the data into a simple twocolumn table. Figure 2 shows a snippet of the start of the file.
Figure 2. The SP500.txt file format
The read.table()
function reads an external file into an R data frame variable. To read the S&P 500 data file and make a simple plot of the close price over time, enter the following commands at the R prompt (>
):
> dv < read.table("./SP500.txt", header=1)
> sp500value = dv[,2]
> plot(sp500value, type="l")
The first line reads the file SP500.txt into the data frame variable dv
. The header=1
setting identifies that the file has a oneline header consisting of nondata text, which R must ignore. The data frame dv
is a twodimensional array. The first index identifies the row number (starting with 1), and the second identifies the column number (starting with 1).
The second command line tells R to assign variable sp500index
to the values in the second column of the data frame. The third line tells R to plot the index data using a line (type="l"
) to connect the data points. The result appears in a new window (Figure 3).
Figure 3. The S&P 500 close price plot
Related Reading Excel Hacks 

When you pass only one vector of data to the R plot()
function, it makes an x/y plot using the point index as the x value and the specified data vector points as the y value. Because the data file lists the data in historical order, the plot shows the value of the S&P 500 index over time.
That's not bad for three lines of code! However, you can make a prettier, better annotated plot easily by using the plot()
function's optional arguments. For example, to add a title, a subtitle, and axis labels, enter:
> plot(sp500value, type="l", main="S&P 500",
sub="[3Jan1995 thru 29Jul2005]",
xlab="Day", ylab="Value")
This produces the plot shown in Figure 4.
Figure 4. An annotated S&P 500 close price plot
Proceeding further, you could use R's date
and axis
classes to produce an xaxis that uses the dates stored in column 1 of the data frame for labels. See the R documentation for details.
After generating a plot, R provides options for adding new data. The 90day moving average is plotted on stock index graphs published in the Wall Street Journal. A moving average is the average value of the preceding n data items. How about displaying the moving average of the S&P 500 over the preceding 90 days?
In R's nomenclature, a "moving average" is a "filter" (an equation) applied to a "time series" (the S&P 500 index values). R's filter()
function is complex, providing many different data processing options. Fortunately, the actual commands for creating a 90day moving average data set are tiny compared with what standard programming languages might require:
> coef90 = 1/90
> ma90 = filter(sp500value, rep(coef90, 90), sides=1)
The first line defines a weighting factor for the data in the filter: each day's S&P 500 value will represent 1/90 of the moving average. The second line creates the moving average data set. The rep()
function "repeats" the 1/90 coefficient 90 times (including 90 days of S&P 500 data in the moving average). The sides=1
parameter specifies to include only the trailing data points in the moving average (which is how financial moving averages are always calculated, because we cannot foresee the future).
Add the moving average data (variable ma90
) to the existing plot as green line using the R lines()
function:
> lines(ma90, col="green")
Figure 5 shows the result.
Figure 5. S&P 500 close price and 90day moving average

Histograms are plots that show the distribution of a set of values. It's easy to use R to look at the distribution of daily gains (and losses) for the S&P 500 in the sample data set.
First, you must calculate the daily percent change for each day. You can't do this for the first day in the data set, so the size of the percent gains array will be one unit less than the size of the sp500value array. Here's one method for directing R to create the daily percent gain array, based on knowledge (gained using a Unix wc
command) that the data set consists of 2,664 total points:
> yesterday = sp500value[1:2663]
> today = sp500value[2:2664]
> changePercent = 100 * (today / yesterday  1.0)
Here, the new variable yesterday
is the set of S&P 500 values (excluding the last day), and the new variable today
is the set of S&P 500 values offset by one day (excluding the first day). Hence, the two variables are aligned such that yesterday[i]
and today[i]
represent yesterday's and today's S&P 500 price. This allows application of an equation using yesterday's and today's prices, which is in the third line: the calculation of the percent that the S&P 500 index changed from yesterday to today.
Now you can plot the histogram (Figure 6):
> hist(changePercent, breaks=10,
main="S&P 500 Daily Percent Change")
Figure 6. The S&P 500 daily percent change
The breaks
parameter tells R approximately how many bins to create while sorting the data. In this case, I asked for 10 bins, but R produced a plot with 13 bins spaced 1 percent apart. R uses the suggested value as a guideline, but in its default mode chooses a bin spacing and number of bins that yields a plot that is easy to comprehend based on the input data. In this case, the data was such that a bin spacing of 1 percent produced bins with divisions on the whole numbers, and the 13 required bins was close to the requested 10 bins, so R produced its plot accordingly. Experiment with different break
values to see how this works.
For cases where you require a specific set of bin break points, specify a list of values for the breaks
parameter. In this case, R will produce bins bounded precisely at the specified values.
Looking at the histogram plot, you can see that in the past 10 years, on most days the S&P 500 either rose or declined by less than 1 percent; but it rose on more days than it declined.
An interesting question: does a correlation exist between the stock market's movement one day and its performance the next day? In other words, if the stock market rose yesterday, is it likely to rise today? To gain some insight on these questions, analyze the daily percent change data further using the following R commands:
> changePercentYesterday = changePercent[1:2662]
> changePercentToday = changePercent[2:2663]
> myDf < data.frame(x=changePercentYesterday,
y=changePercentToday)
> myFm < lm(y~x, data=myDf)
> plot(changePercentYesterday, changePercentToday,
main="Daily Change Correlation")
> abline(coef(myFm), col="red")
> summary(myFm)
This looks complicated, but it also illustrates how much work just a few lines of R code can do. The first two lines create yesterday
and today
percent change variables such that changePercentYesterday[i]
is aligned with changePercentToday[i]
, permitting calculations and plotting using yesterday's change and today's change. The third line creates a new data frame (myDf
) that has as its x
data the values stored in changePercentYesterday
and as its y
data the values stored in changePercentToday
. The fourth line uses R's lm()
"linear model" statistical function to perform a linear fit of the data in myDf
. Next, it plots the raw data (yesterday's change versus today's change) using plot()
, with yesterday's percent change plotted as the X value and today's percent change plotted as the Y value. The abline()
function adds a red "best fit" line to the graph based on the yintercept and slope coefficients calculated by the lm()
function, yielding the plot in Figure 7.
Figure 7. The S&P 500 daily change correlation

Finally, the command summary(myFm)
produces a text summary of the linear regression analysis performed by lm()
:
Call:
lm(formula = y ~ x, data = myDf)
Residuals:
Min 1Q Median 3Q Max
6.92396 0.59826 0.01474 0.59522 5.64824
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 0.04402 0.02185 2.015 0.044 *
x 0.01498 0.01939 0.772 0.440

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.127 on 2660 degrees of freedom
Multiple RSquared: 0.0002243, Adjusted Rsquared: 0.0001516
Fstatistic: 0.5967 on 1 and 2660 DF, pvalue: 0.4399
You probably need a background in statistics to interpret all of this accurately, but looking at the graph, there does not appear to be a strong correlation between the S&P 500's yesterday and today change. What the market did yesterday doesn't seem to strongly affect what happens todaythough the correlation is nonzero. The negative slope of the best fit line suggests that the market had a slight tendency to reverse, or correct, a portion of the previous day's movement, over the 10year data period.
R provides several options for graphical presentation of threedimensional data, including 3D perspective plots, colorcoded images, and contour maps. To demonstrate R's 3D capabilities, I downloaded United States house price index data published by the Office of Federal Housing Enterprise Oversight (OFHEO). The data set used for the analysis is the percent that house prices have increased over the past five years in the nine regions the OFHEO defines for the United States. The OFHEO presents its data in a table. However, a table doesn't clearly depict the geographical distribution of the "housing bubble."
To produce a 3D representation of the fiveyear house price index data, I created a 51by30element grid that approximates the geographical size and position of the nine OFHEO regions. Then, using a Perl script, I created an input data set for R that assigns the price index for the region to each grid point. Hence, the R input file consists of 51 x 30 = 1,530 data points. For areas on the grid that are not part of the United States (for example, the ocean), the data values are NA, which tells R not to display data for that position in the 3D plot.
The R commands that produce an image representation of the house price data are:
> dv < read.table("./ofheo5yr.gridFine", header=FALSE)
> z < dv[,1]
> attr(z, "dim") = c(51,30)
> image(z, col=topo.colors(50), axes=FALSE)
The first line reads the 1,530point data file. The second line assigns the values read to variable z
. Next, the attr()
function alters the dimension ("dim
") attribute of the z
variable to arrange its data in the form of a twodimensional array with 51 rows, each containing 30 data values. Finally, the image()
function generates an image of the z
data, using 50 colors in the topo
color scale to represent the varying z
values. Figure 8 shows the results.
Figure 8. The plot of fiveyear regional house price changes
The topo
color scale assigns violet to low values, with blue, green, yellow, orange, and pink assigned to successively higher values. The image shows that the change in housing prices is not at all evenly distributed across the United States geographically.
The persp()
function provides another view of the same data:
> persp(z,theta=0,phi=60,box=FALSE,col="yellow")
This code tells R to generate a 3D perspective image of the z
data, viewing the image with no rotation (theta=0
) and tilted 60 degrees from the horizontal (phi=60
), with no box around the image, coloring the mesh yellow. Figure 9 shows the plot.
Figure 9. The perspective plot of fiveyear regional house price changes
Perhaps more clearly, the perspective plot shows that the large increases in house prices in the past five years have occurred on the West Coast and in the Northeast, while the middle of the United States hasn't seen significant home price increases overall. This backs up Alan Greenspan's assertions that there is not a national housing "bubble," but there appear to be "signs of froth in some local markets."

R provides several options for saving your work. First, when you execute the q()
function to quit your session, R asks:
Save workspace image? [y/n/c]:
Enter y
to save your current R session state (so that next time you run R in that same directory, your starting status will be identical to that of the current session). Enter n
to exit without saving the current session state, or c
to continue the current session.
R provides a capability to plot directly to PNG, JPEG, and PDF files through its display setting options. To plot to a file instead of to the R graphics window, execute the png()
, jpeg()
, or pdf()
function prior to any R plotting function. For example:
> png("myImage.png")
> persp(z,theta=0,phi=60,box=FALSE,col="yellow")
You can also invoke R by using scripts. This permits the integration of R into automated data analysis processes. Here's an R script that reads the S&P 500 data file and produces a PNG image of the data:
dv < read.table("./SP500.txt", header=FALSE)
png("SP500.png")
plot(dv[,2], type="l")
q()
The command line to execute the script (named gspc.scr) is:
R nosave <gspc.scr > gspc.log
The nosave
flag tells R not to save the script's session. The text output produced by the script is sent to file gspc.log.
R is a powerful, mature, and very well documented GNU open source project. Though the language syntax is a bit unusual, knowing just a few commands lets you quickly generate quite impressive data products.
R's base documentation consists of six manuals, the PDF versions of which comprise 1,700 total pages. The documents are included with the Windows R executable installation download, but they must be downloaded separately for Linux/Unix and MacOS platforms. "An Introduction to R" is the best choice for those who are just starting out. More specialized and detailed references are also available. The R FAQ answers basic introductory questions about R and discusses licensing and usage policy.
To download R, visit The R Project for Statistical Computing (RProject.org) site. Click on CRAN to access the Comprehensive R Archive Network of mirrored servers. Documentation is available using the Manuals link. Section 5.6 in the R FAQ provides information about contributing to the R project (reporting bugs, coding new extensions, porting additional features from S, and so on).
Kevin Farnham is the owner of Lyra Technical Systems, Inc, a small consulting and publishing company.
Return to ONLamp.com.
Copyright © 2009 O'Reilly Media, Inc.