More Parameters

Datenanalyse und Stochastische Modellierung
10. Autoregressive Processes

Degrees of Freedom

So far in this course, we have restricted ourselves to models with very few free parameters
these models are simplifications of the true system and represent key features, such as relaxation times, oscillatory modes, SEIR-model parameters
In each case, it was clear what we could learn from each parameter and how it was useful, to have the parameter in the model
Including more parameterswe can always get a better fit of the model to the data

Low-order AR-Processes

AR(1) model \[x_{t} = a x_{t-1} + \sigma {\xi}_t,\] where the noise has zero mean and uni variance, has two parameters

There are different ways how to fit these parameters, for example
- the autocorrelation function
- the TAMSD
- a maximum likelihood estimator
if the model is only an approximate description of the data, we cannot expect these fits to yield the same result

Bias and Variance

Bias: bias is the error of a fit that is unavoidable if the model has not sufficient parameters to represent the full complexity of the data

models with fewer parameters have more bias

Variance: variance is the error of the fit

models with more parameters, typically have a higher variance

The sqared error of the model can be decomosed into bias and variance

More information on the fundamental concepts of learning and fitting models can be found at https://work.caltech.edu/telecourse

Example: Linear regression and Ridge regression

Remember: Mean squared errors

Data: z(k,l), Model: x(k,l)=ak+bl+c: For best model minimize \[ \sum_k\sum_l (ak+bl+c-z(k,l))^2 \]

Ridge Regression: favor small a,b (small impact - especially realistic for model with more dimensions) by adding term to minimize slope \[ \sum_k\sum_l (ak+bl+c-z(k,l))^2 + \lambda (a^2 + b^2) \] with a suitable weight lambda

Additional advantage: can solve models with more dimensions (parameters) than data points (due to constraints)

Overfitting

Real data is noisy

Extreme case: a model that learns every situation from the past exactly is 'memorizing' these situations, however, predictions are about situations, that are outside of the past data, so the model is not helpful.

If the model has too many free parameters, the model 'overfits' - it fits the noise

Overfitting is increased by

low number of data points
noise
complexity of the true underlying dynamics

Higher order AR-Processes

Backshift operator \[ Bx_t = x_{t-1} \Rightarrow B^jx_t = x_{t-j} \]

The AR(p) process can be written as \[ x_t=\sum_{j=1}^p a_jB^j x_t + \xi_t \]

Models generate infinite impulse response

The standard literature on this topic is the book 'Time Series Analysis - Forecasting and Control' by Box and Jenkins.

Yule-Walker equations

\[x_t = \sum_{k=1}^p a_k x_{t-k} + \xi_t \] \[\langle x_t x_{0} \rangle = \sum_{k=1}^p a_k \langle x_{t-k}x_{0} \rangle \] \[ c_t = \sum_{k=1}^p a_k c_{t-k} \] \[ c_0-\sigma_\xi^2 = \sum_{k=1}^p a_k c_{k} \]

Example: revisiting the AR(2) process

\[ x_t = a_1 x_{t-1} + a_2 x_{t-2} + \xi_t \] \[ (1-a_1B-a_2B^2) x_t = \Phi(B) x_t = \xi_t \] \[ (1-G_1B)(1-G_2B) x_t = \xi_t \]

Solve

\[ 1-a_1G^{-1}-a_2{G^{-1}}^2 = 0 \] \[ G_{1,2}^{-1} = \frac{a_1\pm \sqrt{a_1^2+4a_2}}{2} [{\Rightarrow} a_1=(G_1+G_2) \;\;\; a_2=-G_1G_2] \]

Real or complex (-> oscillations)

Autocorrelations have form: \[ c_\Delta = h_1G_1^\Delta+h_2G_2^\Delta \]

Stationarity: \[ 1>|G_i| \forall i \]

Stationarity

\[ x_t = \sum_{j=1}^p a_jB^j x_t + \xi_t \]

from which we define

\[ \xi_t = \Phi(B)x_t. \]

The term can be rewritten as

\[ \Phi(B) = 1 - \sum_{j=1}^p a_jB^j = \prod_{j=1}^p (1-G_jB), \]

where Gj^-1 are the roots of \[ \Phi(B)=0 \]

Stationary if all roots are smaller than 1

Partial Autocorrelation Function

The correlation at some lag k that results after removing the effect of correlations due to the terms at shorter lags.

Yule-Walker equations

\[ \left(\begin{array}{c c c c c}1&c_1&c_2&\cdots &c_{k-1}\\ c_1&1&c_1&\cdots &c_{k-2}\\ \vdots &\vdots &\vdots &\cdots &\vdots\\ c_{k-1}&c_{k-2}&c_{k-3}&\cdots &1\end{array}\right)\left(\begin{array}{c}\Phi_{k1}\\ \Phi_{k2}\\ \vdots\\ \Phi_{kk}\end{array}\right) = \left(\begin{array}{c}c_{1}\\ c_{2}\\ \vdots\\ c_{k}\end{array}\right) \]

The solution is the partial autocorrelation function

\[ \Phi_{kk} = Corr(x_t,x_{t-k}|x_{t-1},x_{t-2},...,x_{t-k+1}) \]

Choosing the correct order p

If the process is of order p, the partial autocorrelation function has cutoff after lag p \[\Phi_{kk}=0 \;\; \forall \; k>p\]

Parameter Fitting

Negative conditional log-likelihood

\[ L(a,v)=\frac{1}{2}\log(v) + \frac{\frac{1}{T-1}\sum_{t=0}^{T-1} (x_t-\sum_{j=1}^p a_jx_{t-j})^2}{2v}, \]

is a useful approximation to the true negative log-likelihood.

A more accurate approximation is given by \[ L(a,v)=\frac{1}{2}\log(v) + \frac{\frac{1}{T-1}\sum_{t=-\infty}^{T-1} \langle\tilde{\xi}_t^2(a|x,\xi)\rangle}{2v}, \]

MA-Process

A process cann be defined as \[ x_t = (1-\sum_{j=1}^{q} m_jB^j) \xi_t \]

Has the autocorrelation function \[ c_k = \left\lbrace \begin{array}{l l} \frac{-m_k + m_1m_{k+1}+...+ m_{q-k} m_q}{1 + m_1^2 + ... + m_q^2} & k=1,2,...,q\\ 0 & k>q \end{array} \right. \]

Find order q: last non-zero lag of the autocorrelation function

Finite impulse response

Representations

Example: AR(1): \[ x_t = a x_{t-1} + \xi_t \] \[ (1-aB) x_t = \xi_t \] \[ x_t = \frac{1}{1-aB} \xi_t = (1+aB+a^2B^2+...) \x_t \]
AR(p) models can be written in the moving average representation as an infinite sum \[ x_t = \sum_{j=1}^p a_jB^j x_t + \xi_t = \xi_t + \sum_{j=1}^\infty \xi_{t-j} \]
On the other hand MA(q) models can be written as an infinite sum in the autoregressive representation \[ x_t = \sum_{j=1}^q m_jB^j \xi_t + \xi_t = \xi_t + \sum_{j=1}^\infty x_{t-j} \]
'Duality' of the two types of models

ARMA-Process

AR and MA models can be mixed; this is called an ARMA(p,q) model \[ x_t = \sum_{j=1}^p a_jB^j x_t + \sum_{j=1}^q m_jB^j \xi_t + \xi_t \]

the orders p and q can be determined independnetly from the partial autocorrelation function and the autocorrelation function

ARFIMA(0,d,0)

Autoregressive representation: \[ x_t=\sum_{k=1}^\infty (-1)^{k+1}\frac{\prod_{a=0}^{k-1}(d-a)}{k!}x_{t-k}+\xi_t \]

Moving-average representation: \[ (1-B)^d x_t = \xi_t \]

Both are infinite series
The process has power law decay of correlation - autocorrelation times diverge for J>0.5

ARFIMA(p,d,q): \[ (1-\sum_{j=1}^p a_jB^j)(1-B)^d x_t = (1-\sum_{j=1}^q m_j B^j) \xi_t \]