**Analyzing Statistics with GNU R**

Pages: 1, 2, **3**, 4, 5

### Histograms

Histograms are plots that show the distribution of a set of values. It's easy to use R to look at the distribution of daily gains (and losses) for the S&P 500 in the sample data set.

First, you must calculate the daily percent change for each day. You can't do this for the first day in the data set, so the size of the percent gains array will be one unit less than the size of the sp500value array. Here's one method for directing R to create the daily percent gain array, based on knowledge (gained using a Unix `wc`

command) that the data set consists of 2,664 total points:

```
> yesterday = sp500value[1:2663]
> today = sp500value[2:2664]
> changePercent = 100 * (today / yesterday - 1.0)
```

Here, the new variable `yesterday`

is the set of S&P 500 values (excluding the last day), and the new variable `today`

is the set of S&P 500 values offset by one day (excluding the first day). Hence, the two variables are aligned such that `yesterday[i]`

and `today[i]`

represent yesterday's and today's S&P 500 price. This allows application of an equation using yesterday's and today's prices, which is in the third line: the calculation of the percent that the S&P 500 index changed from yesterday to today.

Now you can plot the histogram (Figure 6):

```
> hist(changePercent, breaks=10,
main="S&P 500 Daily Percent Change")
```

*Figure 6. The S&P 500 daily percent change*

The `breaks`

parameter tells R approximately how many bins to create while sorting the data. In this case, I asked for 10 bins, but R produced a plot with 13 bins spaced 1 percent apart. R uses the suggested value as a guideline, but in its default mode chooses a bin spacing and number of bins that yields a plot that is easy to comprehend based on the input data. In this case, the data was such that a bin spacing of 1 percent produced bins with divisions on the whole numbers, and the 13 required bins was close to the requested 10 bins, so R produced its plot accordingly. Experiment with different `break`

values to see how this works.

For cases where you require a specific set of bin break points, specify a list of values for the `breaks`

parameter. In this case, R will produce bins bounded precisely at the specified values.

Looking at the histogram plot, you can see that in the past 10 years, on most days the S&P 500 either rose or declined by less than 1 percent; but it rose on more days than it declined.

### Correlation

An interesting question: does a correlation exist between the stock market's movement one day and its performance the next day? In other words, if the stock market rose yesterday, is it likely to rise today? To gain some insight on these questions, analyze the daily percent change data further using the following R commands:

```
> changePercentYesterday = changePercent[1:2662]
> changePercentToday = changePercent[2:2663]
> myDf <- data.frame(x=changePercentYesterday,
y=changePercentToday)
> myFm <- lm(y~x, data=myDf)
> plot(changePercentYesterday, changePercentToday,
main="Daily Change Correlation")
> abline(coef(myFm), col="red")
> summary(myFm)
```

This looks complicated, but it also illustrates how much work just a few lines of R code can do. The first two lines create `yesterday`

and `today`

percent change variables such that `changePercentYesterday[i]`

is aligned with `changePercentToday[i]`

, permitting calculations and plotting using yesterday's change and today's change. The third line creates a new data frame (`myDf`

) that has as its `x`

data the values stored in `changePercentYesterday`

and as its `y`

data the values stored in `changePercentToday`

. The fourth line uses R's `lm()`

"linear model" statistical function to perform a linear fit of the data in `myDf`

. Next, it plots the raw data (yesterday's change versus today's change) using `plot()`

, with yesterday's percent change plotted as the X value and today's percent change plotted as the Y value. The `abline()`

function adds a red "best fit" line to the graph based on the y-intercept and slope coefficients calculated by the `lm()`

function, yielding the plot in Figure 7.

*Figure 7. The S&P 500 daily change correlation*