Time Series Modeling - Part II (Theoretical Background)
A regression model, such as linear regression, models an output value based on a linear combination of input values.
yhat = b0 + b1X1 yhat = b0 + b1X1
Where yhat is the prediction, b0 and b1 are coefficients found by optimizing the model on training data, and X is an input value.
This technique can be used on time series where input variables are taken as observations at previous time steps, called lag variables.
For example, we can predict the value for the next time step (t+1) given the observations at the last two time steps (t-1 and t-2). As a regression model, this would look as follows:
X(t+1) = b0 + b1X(t-1) + b2X(t-2) X(t+1) = b0 + b1X(t-1) + b2X(t-2)
Because the regression model uses data from the same input variable at previous time steps, it is referred to as an autoregression (regression of self).
An autoregression model makes an assumption that the observations at previous time steps are useful to predict the value at the next time step.
This relationship between variables is called correlation.
If both variables change in the same direction (e.g. go up together or down together), this is called a positive correlation. If the variables move in opposite directions as values change (e.g. one goes up and one goes down), then this is called negative correlation.
We can use statistical measures to calculate the correlation between the output variable and values at previous time steps at various different lags. The stronger the correlation between the output variable and a specific lagged variable, the more weight that autoregression model can put on that variable when modeling.
Again, because the correlation is calculated between the variable and itself at previous time steps, it is called an autocorrelation. It is also called serial correlation because of the sequenced structure of time series data.
The correlation statistics can also help to choose which lag variables will be useful in a model and which will not.
Interestingly, if all lag variables show low or no correlation with the output variable, then it suggests that the time series problem may not be predictable. This can be very useful when getting started on a new dataset.