Training and Testing
- Goodness of fit does not mean that the model has a good predictive power
- How to ensure the model 'generalizes' to new data?
- Divide data into training set and test set
- Model has to be fitted without including information from the test set
- Only run training so long as the performance on the test set still improves
Linear Models
\[ \begin{array}{c c}x_1 & \searrow \\ x_2& \nearrow \end{array}\; w_1 x_1 + w_2 x_2 + b_1 \]
Gradient Descent
- There are different ways of fitting a model
- Example from exercise: Nelder-Meat: algorithm for local maximum in nonlinear systems with several parameters: iterate ensemble of points - reflect, expand, contract worst point - shrink area
Minimizing parameter a on function F: \[ a_{n+1} = a_n - l \nabla F(a) \] with learning rate l
F is the loss function prediction x to the measurement z. I.e. for a mean squared error:\[ a_{n+1} = a_n - 2l [x(a)-z]\frac{\partial x}{\partial a} \]
- stochastic: at each step (epoch), minimize with one sample or a subset of samples (batch)
The Perceptron
\[ x\;\begin{array}{c c c}\nearrow & \Theta (w_1 x + b_1) & \searrow \\ \searrow & \Theta(w_2 x + b_2) & \nearrow \end{array}\; w_3 \Theta(w_1 x + b_1) + w_4 \Theta(w_2 x + b_2) +b_3 \]
- There can be multiple inputs and outputs
\[ \begin{array}{c c c c}x_1 &{-\rightarrow \atop \diagdown\nearrow} & \Theta (w_1 x_1 + w_2 x_2 + b_1) & \searrow \\ x_2& {\diagup\searrow \atop -\rightarrow} & \Theta(w_3 x_3 + w_4 x_4 + b_2) & \nearrow \end{array}\; w_5 \Theta(w_1 x_1 + w_2 x_2 + b_1) + w_6 \Theta(w_3 x_3 + w_4 x_4 + b_2) +b_3 \]
- Tasks: regression and classification
The Neural Network
Nice and very simple introduction
Different activation functions \[ \Theta \rightarrow h(x) \]
- tanh \[ h(x)=\tanh(x) \]
- relu \[ h(x)=\left\lbrace\begin{array}{ll}0 & x\leq 0 \\ x & x>0 \end{array}\right. \]
- softplus \[ h(x)=\log(1+e^x) \]
- ...
More Layers
\[ x\;\begin{array}{c c c c c c c} & h(w_1 x + b_1) &-\rightarrow & h( w_3 h(w_1 x + b_1) + w_4 h(w_2 x + b_2) +b_3) \\ \nearrow & & \diagdown\nearrow & & \searrow \\ \searrow & &\diagup\searrow & & \nearrow \\ & h(w_2 x + b_2) &-\rightarrow & h( w_5 h(w_1 x + b_1) + w_6 h(w_2 x + b_2) +b_4) & \end{array}\; w_7 h( w_3 h(w_1 x + b_1) + w_4 h(w_2 x + b_2) +b_3) + w_8 h( w_5 h(w_1 x + b_1) + w_6 h(w_2 x + b_2) +b_4) + b_5 \]
- Input layer - hidden layers - output layer
- Here: feedforward neural network in contrast to recurrent neural networks
Back-Propagation
Initialize weights (e.g. with Gaussian distribution) and biases (e.g. as 0).
Perform stochastic gradient descent for the network. Update parameters starting sequentially.
Example:\[ y_i=w_3 h(w_1 x_i + b_1) + w_4 h(w_2 x_i + b_2) +b_3 \]
Error: \[ E=\sum_{i}(y_i-z_i)^2 \]
\[ \frac{\partial E}{\partial b_3} = 2(y_i-z_i) \]
\[ \frac{\partial E}{\partial w_4} = 2(y_i-z_i)h(w_2 x_i + b_2) \]
\[ \frac{\partial E}{\partial w_3} = 2(y_i-z_i)h(w_1 x_i + b_1) \]
\[ \frac{\partial E}{\partial b_2} = 2(y_i-z_i)w_4\frac{\partial h(w_2 x_i + b_2)}{\partial b_2} \]
\[ ... \]
Epochs and early stopping
- Batch: share of data considered for each update step
- Epoch: one update step
- Early stopping: instead of calculating a fixed number of epochs, stop if a condition is fulfilled (e.g. error on validation set increases)
Deep Learning
- Neural networks with many layers
- Typically, not all layers are of the same type
Convolutional Neural Networks
- Sometimes it is benificiary to not link all the nodes but only neighboring nodes
- Convolutional layer - filter over input data
- This establishes a context between nodes, e.g. for images, pixels should be connected if they are close to each other
- Algorithms for images also typically have maxpooling
Recurrent Neural Networks
\[ \begin{array}{rcl}\mathrm{Input}_1 \rightarrow \times w_1 \rightarrow +b_1 \rightarrow&h(\bullet)&\rightarrow \times w_3 \rightarrow +b_2 \rightarrow \mathrm{Output}_1 \\&{| \atop w_2}&\\ \mathrm{Input}_2 \rightarrow \times w_1 \rightarrow&\stackrel{\downarrow}{+}&b_1 \rightarrow h(\bullet) \rightarrow \times w_3 \rightarrow +b_2 \rightarrow \mathrm{Output}_2 \end{array} \]
- Vanishing/ exploding gradient problem
- Similar to AR(1) process
Long Short Time Memory
Long termmemory L and short term memory (=output) S; the current input is Ii; h is the sigmoid function
Forget gate
\[ L\rightarrow L h( w_1S+w_2I_i+b_1 ) \]
Input gate
\[ L\rightarrow L+ h( w_3S+w_4I_i+b_2 ) \tanh( w_5S+w_6I_i+b_3 ) \]
Output gate
\[ S\rightarrow h( w_7S+w_8I_i+b_4 ) \tanh(L) \]
Application to classification of diffusion trajectories