The R language provides us with a useful method to calculate the autocorrelation function (ACF) of a time series. An example of an environmental time series with a seasonal cycle is shown below, with the resulting plot:

corr <- acf(series, lag.max=288,type="correlation",plot=TRUE,na.action=na.pass)

(My data set has some missing values, hence the na.action=na.pass parameter.)

As well as the calculated ACF, we can see two blue dashed lines across the plot. These lines indicate the point of statistical significance - values between these lines and zero are not statistically significant, while those above and below the lines (towards one and minus one) are significant.

In some cases, it would be useful to know what this value is, so we can determine whether individual values from the ACF are significant or not. Unfortunately, the output of R's acf function doesn't make it available to us.

A hunt through the source code of the acf function gives us the information we need. We can calculate the significance level as follows:

corr <- acf(series, lag.max=288,type="correlation",plot=TRUE,na.action=na.pass) significance_level <- qnorm((1 + 0.95)/2)/sqrt(sum(!is.na(series)))

The 0.95 parameter indicates that we calculate the correlation at which values are significant to the 95% level - you can change this as you see fit.

Note that the significance is a function of the number of data points you have - the more points, the closer to zero the significance level will be, and the more confidence you can have in your ACF.

Thanks – quite useful

Very handy, thanks for this.

Just one question: shouldn’t it be sqrt(length(series))

Sum seems to give me an inappropriately low value for the significance.

@SimonH: It does use the length of the series, but with the NA values removed since they don’t contribute to the ACF.

The portion sum(!is.na(series)) does not add the values of the series, but counts the number of non-NA values within it.

If you want to total all the values of a series, you have to do sum(series, na.rm=TRUE).

Hope that makes things clearer.

Steve.

is it true for pacf too?

The pacf function calls exactly the same plotting function as the acf function (namely plot.acf). Therefore, if it prints the blue lines for the significance threshold (I can’t test it from where I am right now), the calculation for them will be exactly the same.

Useful

Thank you for this!

Just a couple of more questions.

(1) How does this formula change when ccf() is used instead of acf()? Since it’s possible to compute ccf() between sequences of different lengths, what would be the length of series in the formula?

(2) What are these “confidence” limits called? What is the theory behind the computation of these limits? Are these formulas applicable only in the case of acf and ccf? Could you point to any resource if you know?

Thanks for this, it is super useful. I have a question about the na.pass parameter. How does acf() calculate autocorrelation for existing data points if they are adjacent to NA values?

E.g. 1 , 2 , 3, NA, 4 , NA

How does it calculate 4?

The ACF is a measure of how related the values are at different distances (or lags, as they are known). So, at lag 0 (the first value of the ACF) the correlation is always 1 because all values are the same as themselves.

For the next value (lag = 1), the function compares pairs of values that are 1 step apart. If there’s an NA next to a value, that pair won’t be processed. So in your example, the lags and the pairs processed are:

LAG 1: 1-2, 2-3

LAG 2: 1-3, 3-4

LAG 3: 2-4

LAG 4: 1-4

Hope that makes sense.

can you eloborate on significance of the two dashed lines

I’m not really sure what I can say beyond what’s in the text:

“values between these lines and zero are not statistically significant, while those above and below the lines (towards one and minus one) are significant.”

Hi, thanks for the info. I have a question concerning the two commands for computing the significance level. If I get for example 0.25 for the third lag, does it really mean that it is significant only at 25% sign. level or do I have to use another test to find that out? Thanks!

@Hocus Pocus: The level of statistical significance (significance_level) is the same for every lag. The 0.95 in that command is the level you’re setting.

So if your correlation value of 0.25 is greater than the significance_level then it is statistically significant at the 95% level. If it is less, then it is not significant.

I hope that clarifies it.

Thanks a lot, very useful.

I have a further question.

Can I use acf() and how, if I have several similar time series but interest only in autocorrelation within individuals.

i.e. 10 measures (every 15 min) in n=30 humans.

The autocorr between individuals has no real interest. But I would like to avoid doing 30 acf() because autocorr could be better seen in combinging all the data.

Your 30 individuals represent independent data sets, so combining them into one data set and performing the correlation on that won’t necessarily be meaningful.

Your best bet might be to calculate the ACF for each individual in turn, and then plot all the ACF curves on the same graph. You can make an ‘average’ ACF curve from those 30 curves, but much more interesting will be to compare them all against each other. Is there an overall pattern in the ACFs? Are there any outliers that represent either bad data or interesting deviations of one member of the trial from all the others?

None of my 30 patients are outliers but within 2 of them a lag is outside the ic95 (4 and 5).

How can we combine several acf() in one?

And how can we compare them all against each other.

Many thanks for your advice

I don’t know the best way to combine several ACFs into one that would still be statistically valid, I’m afraid.

The Box.test() function gave different results than acf().

Which one should be considered as a reference?

No idea. I suggest you move to the R mailing list or IRC channel for more help.