Training and Testing
- Goodness of fit does not mean that the model has a good predictive power
- How to ensure the model 'generalizes' to new data?
- Divide data into training set and test set
- Model has to be fitted without including information from the test set
- Only run training so long as the performance on the test set still improves
Linear Models
\[ \begin{array}{c c}x_1 & \searrow \\ x_2& \nearrow \end{array}\; w_1 x_1 + w_2 x_2 + b_1 \]
Gradient Descent
- There are different ways of fitting a model
- Example from exercise: Nelder-Meat: algorithm for local maximum in nonlinear systems with several parameters: iterate ensemble of points - reflect, expand, contract worst point - shrink area
Minimizing parameter a on function F: \[ a_{n+1} = a_n - l \nabla F(a) \] with learning rate l
F is the loss function prediction x to the measurement z. I.e. for a mean squared error:\[ a_{n+1} = a_n - 2l [x(a)-z]\frac{\partial x}{\partial a} \]
- stochastic: at each step (epoch), minimize with one sample or a subset of samples (batch)
The Perceptron
\[ x\;\begin{array}{c c c}\nearrow & \Theta (w_1 x + b_1) & \searrow \\ \searrow & \Theta(w_2 x + b_2) & \nearrow \end{array}\; w_3 \Theta(w_1 x + b_1) + w_4 \Theta(w_2 x + b_2) +b_3 \]
- There can be multiple inputs and outputs
\[ \begin{array}{c c c c}x_1 &{-\rightarrow \atop \diagdown\nearrow} & \Theta (w_1 x_1 + w_2 x_2 + b_1) & \searrow \\ x_2& {\diagup\searrow \atop -\rightarrow} & \Theta(w_3 x_3 + w_4 x_4 + b_2) & \nearrow \end{array}\; w_5 \Theta(w_1 x_1 + w_2 x_2 + b_1) + w_6 \Theta(w_3 x_3 + w_4 x_4 + b_2) +b_3 \]
- Tasks: regression and classification
The Neural Network
Nice and very simple introduction
Different activation functions \[ \Theta \rightarrow h(x) \]
- tanh \[ h(x)=\tanh(x) \]
- relu \[ h(x)=\left\lbrace\begin{array}{ll}0 & x\leq 0 \\ x & x>0 \end{array}\right. \]
- softplus \[ h(x)=\log(1+e^x) \]
- ...
More Layers
\[ x\;\begin{array}{c c c c c c c} & h(w_1 x + b_1) &-\rightarrow & h( w_3 h(w_1 x + b_1) + w_4 h(w_2 x + b_2) +b_3) \\ \nearrow & & \diagdown\nearrow & & \searrow \\ \searrow & &\diagup\searrow & & \nearrow \\ & h(w_2 x + b_2) &-\rightarrow & h( w_5 h(w_1 x + b_1) + w_6 h(w_2 x + b_2) +b_4) & \end{array}\; w_7 h( w_3 h(w_1 x + b_1) + w_4 h(w_2 x + b_2) +b_3) + w_8 h( w_5 h(w_1 x + b_1) + w_6 h(w_2 x + b_2) +b_4) + b_5 \]
- Input layer - hidden layers - output layer
- Here: feedforward neural network in contrast to recurrent neural networks
Initialize weights (e.g. with Gaussian distribution) and biases (e.g. as 0).
Perform stochastic gradient descent for the network. Update parameters starting sequentially.
Example:\[ y_i=w_3 h(w_1 x_i + b_1) + w_4 h(w_2 x_i + b_2) +b_3 \]
Error: \[ E=\sum_{i}(y_i-z_i)^2 \]
\[ \frac{\partial E}{\partial b_3} = 2(y_i-z_i) \]
\[ \frac{\partial E}{\partial w_4} = 2(y_i-z_i)h(w_2 x_i + b_2) \]
\[ \frac{\partial E}{\partial w_3} = 2(y_i-z_i)h(w_1 x_i + b_1) \]
\[ \frac{\partial E}{\partial b_2} = 2(y_i-z_i)w_4\frac{\partial h(w_2 x_i + b_2)}{\partial b_2} \]
\[ ... \]
Epochs and early stopping
- Batch: share of data considered for each update step
- Epoch: one update step
- Early stopping: instead of calculating a fixed number of epochs, stop if a condition is fulfilled (e.g. error on validation set increases)
Deep Learning
- Neural networks with many layers
- Typically, not all layers are of the same type
Convolutional Neural Networks
- Sometimes it is benificiary to not link all the nodes but only neighboring nodes
- Convolutional layer - filter over input data
- This establishes a context between nodes, e.g. for images, pixels should be connected if they are close to each other
- Algorithms for images also typically have maxpooling
Recurrent Neural Networks
\[ \begin{array}{rcl}\mathrm{Input}_1 \rightarrow \times w_1 \rightarrow +b_1 \rightarrow&h(\bullet)&\rightarrow \times w_3 \rightarrow +b_2 \rightarrow \mathrm{Output}_1 \\&{| \atop w_2}&\\ \mathrm{Input}_2 \rightarrow \times w_1 \rightarrow&\stackrel{\downarrow}{+}&b_1 \rightarrow h(\bullet) \rightarrow \times w_3 \rightarrow +b_2 \rightarrow \mathrm{Output}_2 \end{array} \]
- Vanishing/ exploding gradient problem
- Similar to AR(1) process
Long Short Time Memory
Long termmemory L and short term memory (=output) S; the current input is Ii; h is the sigmoid function
Forget gate
\[ L\rightarrow L h( w_1S+w_2I_i+b_1 ) \]
Input gate
\[ L\rightarrow L+ h( w_3S+w_4I_i+b_2 ) \tanh( w_5S+w_6I_i+b_3 ) \]
Output gate
\[ S\rightarrow h( w_7S+w_8I_i+b_4 ) \tanh(L) \]
Application to classification of diffusion trajectories