AR(1) model \[x_{t} = a x_{t-1} + \sigma {\xi}_t,\] where the noise has zero mean and uni variance, has two parameters
Bias: bias is the error of a fit that is unavoidable if the model has not sufficient parameters to represent the full complexity of the data
Variance: variance is the error of the fit
The sqared error of the model can be decomosed into bias and variance
More information on the fundamental concepts of learning and fitting models can be found at https://work.caltech.edu/telecourse
Remember: Mean squared errors
Data: z(k,l), Model: x(k,l)=ak+bl+c: For best model minimize \[ \sum_k\sum_l (ak+bl+c-z(k,l))^2 \]
Ridge Regression: favor small a,b (small impact - especially realistic for model with more dimensions) by adding term to minimize slope \[ \sum_k\sum_l (ak+bl+c-z(k,l))^2 + \lambda (a^2 + b^2) \] with a suitable weight lambda
Additional advantage: can solve models with more dimensions (parameters) than data points (due to constraints)
Real data is noisy
Extreme case: a model that learns every situation from the past exactly is 'memorizing' these situations, however, predictions are about situations, that are outside of the past data, so the model is not helpful.
If the model has too many free parameters, the model 'overfits' - it fits the noise
Overfitting is increased by
Backshift operator \[ Bx_t = x_{t-1} \Rightarrow B^jx_t = x_{t-j} \]
The AR(p) process can be written as \[ x_t=\sum_{j=1}^p a_jB^j x_t + \xi_t \]
The standard literature on this topic is the book 'Time Series Analysis - Forecasting and Control' by Box and Jenkins.
Solve
\[ 1-a_1G^{-1}-a_2{G^{-1}}^2 = 0 \] \[ G_{1,2}^{-1} = \frac{a_1\pm \sqrt{a_1^2+4a_2}}{2} [{\Rightarrow} a_1=(G_1+G_2) \;\;\; a_2=-G_1G_2] \]Real or complex (-> oscillations)
Autocorrelations have form: \[ c_\Delta = h_1G_1^\Delta+h_2G_2^\Delta \]
Stationarity: \[ 1>|G_i| \forall i \]
from which we define
\[ \xi_t = \Phi(B)x_t. \]The term can be rewritten as
\[ \Phi(B) = 1 - \sum_{j=1}^p a_jB^j = \prod_{j=1}^p (1-G_jB), \]where Gj^-1 are the roots of \[ \Phi(B)=0 \]
Stationary if all roots are smaller than 1
The correlation at some lag k that results after removing the effect of correlations due to the terms at shorter lags.
Yule-Walker equations
\[ \left(\begin{array}{c c c c c}1&c_1&c_2&\cdots &c_{k-1}\\ c_1&1&c_1&\cdots &c_{k-2}\\ \vdots &\vdots &\vdots &\cdots &\vdots\\ c_{k-1}&c_{k-2}&c_{k-3}&\cdots &1\end{array}\right)\left(\begin{array}{c}\Phi_{k1}\\ \Phi_{k2}\\ \vdots\\ \Phi_{kk}\end{array}\right) = \left(\begin{array}{c}c_{1}\\ c_{2}\\ \vdots\\ c_{k}\end{array}\right) \]The solution is the partial autocorrelation function
\[ \Phi_{kk} = Corr(x_t,x_{t-k}|x_{t-1},x_{t-2},...,x_{t-k+1}) \]If the process is of order p, the partial autocorrelation function has cutoff after lag p \[\Phi_{kk}=0 \;\; \forall \; k>p\]
Negative conditional log-likelihood
\[ L(a,v)=\frac{1}{2}\log(v) + \frac{\frac{1}{T-1}\sum_{t=0}^{T-1} (x_t-\sum_{j=1}^p a_jx_{t-j})^2}{2v}, \]is a useful approximation to the true negative log-likelihood.
A more accurate approximation is given by \[ L(a,v)=\frac{1}{2}\log(v) + \frac{\frac{1}{T-1}\sum_{t=-\infty}^{T-1} \langle\tilde{\xi}_t^2(a|x,\xi)\rangle}{2v}, \]
A process cann be defined as \[ x_t = (1-\sum_{j=1}^{q} m_jB^j) \xi_t \]
Has the autocorrelation function \[ c_k = \left\lbrace \begin{array}{l l} \frac{-m_k + m_1m_{k+1}+...+ m_{q-k} m_q}{1 + m_1^2 + ... + m_q^2} & k=1,2,...,q\\ 0 & k>q \end{array} \right. \]
Find order q: last non-zero lag of the autocorrelation function
AR and MA models can be mixed; this is called an ARMA(p,q) model \[ x_t = \sum_{j=1}^p a_jB^j x_t + \sum_{j=1}^q m_jB^j \xi_t + \xi_t \]
Autoregressive representation: \[ x_t=\sum_{k=1}^\infty (-1)^{k+1}\frac{\prod_{a=0}^{k-1}(d-a)}{k!}x_{t-k}+\xi_t \]
Moving-average representation: \[ (1-B)^d x_t = \xi_t \]
ARFIMA(p,d,q): \[ (1-\sum_{j=1}^p a_jB^j)(1-B)^d x_t = (1-\sum_{j=1}^q m_j B^j) \xi_t \]