Datenanalyse und Stochastische Modellierung
11. Supervised Learning

Training and Testing

  • Goodness of fit does not mean that the model has a good predictive power
  • How to ensure the model 'generalizes' to new data?
  • Divide data into training set and test set
  • Model has to be fitted without including information from the test set
  • Only run training so long as the performance on the test set still improves

Linear Models

\[ \begin{array}{c c}x_1 & \searrow \\ x_2& \nearrow \end{array}\; w_1 x_1 + w_2 x_2 + b_1 \]

Gradient Descent

  • There are different ways of fitting a model
  • Example from exercise: Nelder-Meat: algorithm for local maximum in nonlinear systems with several parameters: iterate ensemble of points - reflect, expand, contract worst point - shrink area

Minimizing parameter a on function F: \[ a_{n+1} = a_n - l \nabla F(a) \] with learning rate l

F is the loss function prediction x to the measurement z. I.e. for a mean squared error:\[ a_{n+1} = a_n - 2l [x(a)-z]\frac{\partial x}{\partial a} \]

  • stochastic: at each step (epoch), minimize with one sample or a subset of samples (batch)

The Perceptron

\[ x\;\begin{array}{c c c}\nearrow & \Theta (w_1 x + b_1) & \searrow \\ \searrow & \Theta(w_2 x + b_2) & \nearrow \end{array}\; w_3 \Theta(w_1 x + b_1) + w_4 \Theta(w_2 x + b_2) +b_3 \]
  • There can be multiple inputs and outputs
  • \[ \begin{array}{c c c c}x_1 &{-\rightarrow \atop \diagdown\nearrow} & \Theta (w_1 x_1 + w_2 x_2 + b_1) & \searrow \\ x_2& {\diagup\searrow \atop -\rightarrow} & \Theta(w_3 x_3 + w_4 x_4 + b_2) & \nearrow \end{array}\; w_5 \Theta(w_1 x_1 + w_2 x_2 + b_1) + w_6 \Theta(w_3 x_3 + w_4 x_4 + b_2) +b_3 \]
  • Tasks: regression and classification

The Neural Network

Nice and very simple introduction

Different activation functions \[ \Theta \rightarrow h(x) \]

  • tanh \[ h(x)=\tanh(x) \]
  • relu \[ h(x)=\left\lbrace\begin{array}{ll}0 & x\leq 0 \\ x & x>0 \end{array}\right. \]
  • softplus \[ h(x)=\log(1+e^x) \]
  • ...

More Layers

\[ x\;\begin{array}{c c c c c c c} & h(w_1 x + b_1) &-\rightarrow & h( w_3 h(w_1 x + b_1) + w_4 h(w_2 x + b_2) +b_3) \\ \nearrow & & \diagdown\nearrow & & \searrow \\ \searrow & &\diagup\searrow & & \nearrow \\ & h(w_2 x + b_2) &-\rightarrow & h( w_5 h(w_1 x + b_1) + w_6 h(w_2 x + b_2) +b_4) & \end{array}\; w_7 h( w_3 h(w_1 x + b_1) + w_4 h(w_2 x + b_2) +b_3) + w_8 h( w_5 h(w_1 x + b_1) + w_6 h(w_2 x + b_2) +b_4) + b_5 \]
  • Input layer - hidden layers - output layer
  • Here: feedforward neural network in contrast to recurrent neural networks

Back-Propagation

Initialize weights (e.g. with Gaussian distribution) and biases (e.g. as 0).

Perform stochastic gradient descent for the network. Update parameters starting sequentially.

Example:\[ y_i=w_3 h(w_1 x_i + b_1) + w_4 h(w_2 x_i + b_2) +b_3 \]

Error: \[ E=\sum_{i}(y_i-z_i)^2 \] \[ \frac{\partial E}{\partial b_3} = 2(y_i-z_i) \] \[ \frac{\partial E}{\partial w_4} = 2(y_i-z_i)h(w_2 x_i + b_2) \] \[ \frac{\partial E}{\partial w_3} = 2(y_i-z_i)h(w_1 x_i + b_1) \] \[ \frac{\partial E}{\partial b_2} = 2(y_i-z_i)w_4\frac{\partial h(w_2 x_i + b_2)}{\partial b_2} \] \[ ... \]

Epochs and early stopping

  • Batch: share of data considered for each update step
  • Epoch: one update step
  • Early stopping: instead of calculating a fixed number of epochs, stop if a condition is fulfilled (e.g. error on validation set increases)

Deep Learning

  • Neural networks with many layers
  • Typically, not all layers are of the same type

Convolutional Neural Networks

  • Sometimes it is benificiary to not link all the nodes but only neighboring nodes
  • Convolutional layer - filter over input data
  • This establishes a context between nodes, e.g. for images, pixels should be connected if they are close to each other
  • Algorithms for images also typically have maxpooling

Recurrent Neural Networks

\[ \begin{array}{rcl}\mathrm{Input}_1 \rightarrow \times w_1 \rightarrow +b_1 \rightarrow&h(\bullet)&\rightarrow \times w_3 \rightarrow +b_2 \rightarrow \mathrm{Output}_1 \\&{| \atop w_2}&\\ \mathrm{Input}_2 \rightarrow \times w_1 \rightarrow&\stackrel{\downarrow}{+}&b_1 \rightarrow h(\bullet) \rightarrow \times w_3 \rightarrow +b_2 \rightarrow \mathrm{Output}_2 \end{array} \]
  • Vanishing/ exploding gradient problem
  • Similar to AR(1) process

Long Short Time Memory

Long termmemory L and short term memory (=output) S; the current input is Ii; h is the sigmoid function

Forget gate

\[ L\rightarrow L h( w_1S+w_2I_i+b_1 ) \]

Input gate

\[ L\rightarrow L+ h( w_3S+w_4I_i+b_2 ) \tanh( w_5S+w_6I_i+b_3 ) \]

Output gate

\[ S\rightarrow h( w_7S+w_8I_i+b_4 ) \tanh(L) \]

Application to classification of diffusion trajectories