ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Analyzing Statistics with GNU R

by Kevin Farnham
11/17/2005

The R Project for Statistical Computing is an open source language and programming environment that provides a wide variety of data manipulation, statistical analysis, and data visualization capabilities. R is a successor to the S software originally developed at Bell Laboratories. R is a GNU project and is freely available under the terms of the GNU General Public License. The software runs on Windows, Mac OS, and a wide variety of Unix platforms.

Even for people who aren't expert statisticians, the power of R is alluring. Working interactively or using an R script, with just a few lines of code a user can perform complex analyses of large data sets, produce graphics depicting the features and structure of the data, and perform statistical analyses that can quickly answer questions about the data. This article introduces R and demonstrates a small slice of its capabilities, using data from the stock market and real estate industry as input.

R User Interface Basics

I developed and ran the demonstrations in this article on a Vision Computers PC running Debian Linux (Sarge version, "stable"). I installed R Version 2.1 using the standard Debian Advanced Package Tool (APT) utility.

To run R interactively, open a terminal window, create a work directory, enter the directory, and execute R:

R command window
Figure 1. The R command window

R's command language takes some getting used to, but knowing just a few commands provides enough knowledge to perform basic statistical analyses and generate plots for any data set that can be formatted into a one-, two-, or three-dimensional rectangular grid.

R commands can be lengthy, because most functions have many optional settings parameters. So it's very convenient that the up and down arrow keys in the R command window recall previous command-line entries, which you can then edit. This lets you work toward your data analysis goal incrementally by modifying and extending previously entered command lines.

Data Import and x/y Line Plotting

A simple starting point is a vector or array of numbers. I downloaded historical Standard and Poor's (S&P) 500 stock index data and wrote a Perl program to convert the data into a simple two-column table. Figure 2 shows a snippet of the start of the file.

SP500.txt file format
Figure 2. The SP500.txt file format

The read.table() function reads an external file into an R data frame variable. To read the S&P 500 data file and make a simple plot of the close price over time, enter the following commands at the R prompt (>):

> dv <- read.table("./SP500.txt", header=1)
> sp500value = dv[,2]
> plot(sp500value, type="l")

The first line reads the file SP500.txt into the data frame variable dv. The header=1 setting identifies that the file has a one-line header consisting of nondata text, which R must ignore. The data frame dv is a two-dimensional array. The first index identifies the row number (starting with 1), and the second identifies the column number (starting with 1).

The second command line tells R to assign variable sp500index to the values in the second column of the data frame. The third line tells R to plot the index data using a line (type="l") to connect the data points. The result appears in a new window (Figure 3).

S&P 500 close price plot
Figure 3. The S&P 500 close price plot

Excel Hacks

Related Reading

Excel Hacks
100 Industrial Strength Tips and Tools
By David Hawley, Raina Hawley

When you pass only one vector of data to the R plot() function, it makes an x/y plot using the point index as the x value and the specified data vector points as the y value. Because the data file lists the data in historical order, the plot shows the value of the S&P 500 index over time.

That's not bad for three lines of code! However, you can make a prettier, better annotated plot easily by using the plot() function's optional arguments. For example, to add a title, a subtitle, and axis labels, enter:

> plot(sp500value, type="l", main="S&P 500", 
      sub="[3-Jan-1995 thru 29-Jul-2005]", 
      xlab="Day", ylab="Value")

This produces the plot shown in Figure 4.

Annotated S&P 500 close price plot
Figure 4. An annotated S&P 500 close price plot

Proceeding further, you could use R's date and axis classes to produce an x-axis that uses the dates stored in column 1 of the data frame for labels. See the R documentation for details.

After generating a plot, R provides options for adding new data. The 90-day moving average is plotted on stock index graphs published in the Wall Street Journal. A moving average is the average value of the preceding n data items. How about displaying the moving average of the S&P 500 over the preceding 90 days?

In R's nomenclature, a "moving average" is a "filter" (an equation) applied to a "time series" (the S&P 500 index values). R's filter() function is complex, providing many different data processing options. Fortunately, the actual commands for creating a 90-day moving average data set are tiny compared with what standard programming languages might require:

> coef90 = 1/90
> ma90 = filter(sp500value, rep(coef90, 90), sides=1)

The first line defines a weighting factor for the data in the filter: each day's S&P 500 value will represent 1/90 of the moving average. The second line creates the moving average data set. The rep() function "repeats" the 1/90 coefficient 90 times (including 90 days of S&P 500 data in the moving average). The sides=1 parameter specifies to include only the trailing data points in the moving average (which is how financial moving averages are always calculated, because we cannot foresee the future).

Add the moving average data (variable ma90) to the existing plot as green line using the R lines() function:

> lines(ma90, col="green")

Figure 5 shows the result.

S&P 500 close price and 90-day moving average plot
Figure 5. S&P 500 close price and 90-day moving average

Histograms

Histograms are plots that show the distribution of a set of values. It's easy to use R to look at the distribution of daily gains (and losses) for the S&P 500 in the sample data set.

First, you must calculate the daily percent change for each day. You can't do this for the first day in the data set, so the size of the percent gains array will be one unit less than the size of the sp500value array. Here's one method for directing R to create the daily percent gain array, based on knowledge (gained using a Unix wc command) that the data set consists of 2,664 total points:

> yesterday = sp500value[1:2663]
> today = sp500value[2:2664]
> changePercent = 100 * (today / yesterday - 1.0)

Here, the new variable yesterday is the set of S&P 500 values (excluding the last day), and the new variable today is the set of S&P 500 values offset by one day (excluding the first day). Hence, the two variables are aligned such that yesterday[i] and today[i] represent yesterday's and today's S&P 500 price. This allows application of an equation using yesterday's and today's prices, which is in the third line: the calculation of the percent that the S&P 500 index changed from yesterday to today.

Now you can plot the histogram (Figure 6):

> hist(changePercent, breaks=10, 
       main="S&P 500 Daily Percent Change")

Histogram: S&P 500 daily percent change
Figure 6. The S&P 500 daily percent change

The breaks parameter tells R approximately how many bins to create while sorting the data. In this case, I asked for 10 bins, but R produced a plot with 13 bins spaced 1 percent apart. R uses the suggested value as a guideline, but in its default mode chooses a bin spacing and number of bins that yields a plot that is easy to comprehend based on the input data. In this case, the data was such that a bin spacing of 1 percent produced bins with divisions on the whole numbers, and the 13 required bins was close to the requested 10 bins, so R produced its plot accordingly. Experiment with different break values to see how this works.

For cases where you require a specific set of bin break points, specify a list of values for the breaks parameter. In this case, R will produce bins bounded precisely at the specified values.

Looking at the histogram plot, you can see that in the past 10 years, on most days the S&P 500 either rose or declined by less than 1 percent; but it rose on more days than it declined.

Correlation

An interesting question: does a correlation exist between the stock market's movement one day and its performance the next day? In other words, if the stock market rose yesterday, is it likely to rise today? To gain some insight on these questions, analyze the daily percent change data further using the following R commands:

> changePercentYesterday = changePercent[1:2662]
> changePercentToday = changePercent[2:2663]
> myDf <- data.frame(x=changePercentYesterday, 
       y=changePercentToday)
> myFm <- lm(y~x, data=myDf)
> plot(changePercentYesterday, changePercentToday, 
       main="Daily Change Correlation")
> abline(coef(myFm), col="red")
> summary(myFm)

This looks complicated, but it also illustrates how much work just a few lines of R code can do. The first two lines create yesterday and today percent change variables such that changePercentYesterday[i] is aligned with changePercentToday[i], permitting calculations and plotting using yesterday's change and today's change. The third line creates a new data frame (myDf) that has as its x data the values stored in changePercentYesterday and as its y data the values stored in changePercentToday. The fourth line uses R's lm() "linear model" statistical function to perform a linear fit of the data in myDf. Next, it plots the raw data (yesterday's change versus today's change) using plot(), with yesterday's percent change plotted as the X value and today's percent change plotted as the Y value. The abline() function adds a red "best fit" line to the graph based on the y-intercept and slope coefficients calculated by the lm() function, yielding the plot in Figure 7.

S&P 500 daily change correlation plot
Figure 7. The S&P 500 daily change correlation

Finally, the command summary(myFm) produces a text summary of the linear regression analysis performed by lm():

Call:
lm(formula = y ~ x, data = myDf)

Residuals:
     Min       1Q   Median       3Q      Max
-6.92396 -0.59826  0.01474  0.59522  5.64824

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.04402    0.02185   2.015    0.044 *
x           -0.01498    0.01939  -0.772    0.440
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.127 on 2660 degrees of freedom
Multiple R-Squared: 0.0002243,  Adjusted R-squared: -0.0001516
F-statistic: 0.5967 on 1 and 2660 DF,  p-value: 0.4399

You probably need a background in statistics to interpret all of this accurately, but looking at the graph, there does not appear to be a strong correlation between the S&P 500's yesterday and today change. What the market did yesterday doesn't seem to strongly affect what happens today--though the correlation is nonzero. The negative slope of the best fit line suggests that the market had a slight tendency to reverse, or correct, a portion of the previous day's movement, over the 10-year data period.

3-D Data Demonstration: Mapping the U.S. Housing "Bubble"

R provides several options for graphical presentation of three-dimensional data, including 3-D perspective plots, color-coded images, and contour maps. To demonstrate R's 3-D capabilities, I downloaded United States house price index data published by the Office of Federal Housing Enterprise Oversight (OFHEO). The data set used for the analysis is the percent that house prices have increased over the past five years in the nine regions the OFHEO defines for the United States. The OFHEO presents its data in a table. However, a table doesn't clearly depict the geographical distribution of the "housing bubble."

To produce a 3-D representation of the five-year house price index data, I created a 51-by-30-element grid that approximates the geographical size and position of the nine OFHEO regions. Then, using a Perl script, I created an input data set for R that assigns the price index for the region to each grid point. Hence, the R input file consists of 51 x 30 = 1,530 data points. For areas on the grid that are not part of the United States (for example, the ocean), the data values are NA, which tells R not to display data for that position in the 3-D plot.

The R commands that produce an image representation of the house price data are:

> dv <- read.table("./ofheo5yr.gridFine", header=FALSE)
> z <- dv[,1]
> attr(z, "dim") = c(51,30)
> image(z, col=topo.colors(50), axes=FALSE)

The first line reads the 1,530-point data file. The second line assigns the values read to variable z. Next, the attr() function alters the dimension ("dim") attribute of the z variable to arrange its data in the form of a two-dimensional array with 51 rows, each containing 30 data values. Finally, the image() function generates an image of the z data, using 50 colors in the topo color scale to represent the varying z values. Figure 8 shows the results.

5-year regional house price changes image plot
Figure 8. The plot of five-year regional house price changes

The topo color scale assigns violet to low values, with blue, green, yellow, orange, and pink assigned to successively higher values. The image shows that the change in housing prices is not at all evenly distributed across the United States geographically.

The persp() function provides another view of the same data:

> persp(z,theta=0,phi=60,box=FALSE,col="yellow")

This code tells R to generate a 3-D perspective image of the z data, viewing the image with no rotation (theta=0) and tilted 60 degrees from the horizontal (phi=60), with no box around the image, coloring the mesh yellow. Figure 9 shows the plot.

Five-year regional house price changes perspective plot
Figure 9. The perspective plot of five-year regional house price changes

Perhaps more clearly, the perspective plot shows that the large increases in house prices in the past five years have occurred on the West Coast and in the Northeast, while the middle of the United States hasn't seen significant home price increases overall. This backs up Alan Greenspan's assertions that there is not a national housing "bubble," but there appear to be "signs of froth in some local markets."

Saving Your R Work

R provides several options for saving your work. First, when you execute the q() function to quit your session, R asks:

Save workspace image? [y/n/c]:

Enter y to save your current R session state (so that next time you run R in that same directory, your starting status will be identical to that of the current session). Enter n to exit without saving the current session state, or c to continue the current session.

R provides a capability to plot directly to PNG, JPEG, and PDF files through its display setting options. To plot to a file instead of to the R graphics window, execute the png(), jpeg(), or pdf() function prior to any R plotting function. For example:

> png("myImage.png")
> persp(z,theta=0,phi=60,box=FALSE,col="yellow")

You can also invoke R by using scripts. This permits the integration of R into automated data analysis processes. Here's an R script that reads the S&P 500 data file and produces a PNG image of the data:

dv <- read.table("./SP500.txt", header=FALSE)
png("SP500.png")
plot(dv[,2], type="l")
q()

The command line to execute the script (named gspc.scr) is:

R --no-save <gspc.scr > gspc.log

The --no-save flag tells R not to save the script's session. The text output produced by the script is sent to file gspc.log.

Conclusions

R is a powerful, mature, and very well documented GNU open source project. Though the language syntax is a bit unusual, knowing just a few commands lets you quickly generate quite impressive data products.

R's base documentation consists of six manuals, the PDF versions of which comprise 1,700 total pages. The documents are included with the Windows R executable installation download, but they must be downloaded separately for Linux/Unix and MacOS platforms. "An Introduction to R" is the best choice for those who are just starting out. More specialized and detailed references are also available. The R FAQ answers basic introductory questions about R and discusses licensing and usage policy.

To download R, visit The R Project for Statistical Computing (R-Project.org) site. Click on CRAN to access the Comprehensive R Archive Network of mirrored servers. Documentation is available using the Manuals link. Section 5.6 in the R FAQ provides information about contributing to the R project (reporting bugs, coding new extensions, porting additional features from S, and so on).

Related Links

Kevin Farnham is the owner of Lyra Technical Systems, Inc, a small consulting and publishing company.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.