Neural Networks

Datenanalyse und Stochastische Modellierung
11. Supervised Learning

Training and Testing

Goodness of fit does not mean that the model has a good predictive power
How to ensure the model 'generalizes' to new data?
Divide data into training set and test set
Model has to be fitted without including information from the test set
Only run training so long as the performance on the test set still improves

Linear Models

\[ \begin{array}{c c}x_1 & \searrow \\ x_2& \nearrow \end{array}\; w_1 x_1 + w_2 x_2 + b_1 \]

Gradient Descent

There are different ways of fitting a model
Example from exercise: Nelder-Meat: algorithm for local maximum in nonlinear systems with several parameters: iterate ensemble of points - reflect, expand, contract worst point - shrink area

Minimizing parameter a on function F: \[ a_{n+1} = a_n - l \nabla F(a) \] with learning rate l

F is the loss function prediction x to the measurement z. I.e. for a mean squared error:\[ a_{n+1} = a_n - 2l [x(a)-z]\frac{\partial x}{\partial a} \]

stochastic: at each step (epoch), minimize with one sample or a subset of samples (batch)

The Perceptron

\[ x\;\begin{array}{c c c}\nearrow & \Theta (w_1 x + b_1) & \searrow \\ \searrow & \Theta(w_2 x + b_2) & \nearrow \end{array}\; w_3 \Theta(w_1 x + b_1) + w_4 \Theta(w_2 x + b_2) +b_3 \]

There can be multiple inputs and outputs
Tasks: regression and classification

The Neural Network

Nice and very simple introduction

Different activation functions \[ \Theta \rightarrow h(x) \]

tanh \[ h(x)=\tanh(x) \]
relu \[ h(x)=\left\lbrace\begin{array}{ll}0 & x\leq 0 \\ x & x>0 \end{array}\right. \]
softplus \[ h(x)=\log(1+e^x) \]
...

More Layers

\[ x\;\begin{array}{c c c c c c c} & h(w_1 x + b_1) &-\rightarrow & h( w_3 h(w_1 x + b_1) + w_4 h(w_2 x + b_2) +b_3) \\ \nearrow & & \diagdown\nearrow & & \searrow \\ \searrow & &\diagup\searrow & & \nearrow \\ & h(w_2 x + b_2) &-\rightarrow & h( w_5 h(w_1 x + b_1) + w_6 h(w_2 x + b_2) +b_4) & \end{array}\; w_7 h( w_3 h(w_1 x + b_1) + w_4 h(w_2 x + b_2) +b_3) + w_8 h( w_5 h(w_1 x + b_1) + w_6 h(w_2 x + b_2) +b_4) + b_5 \]

Input layer - hidden layers - output layer
Here: feedforward neural network in contrast to recurrent neural networks

Back-Propagation

Initialize weights (e.g. with Gaussian distribution) and biases (e.g. as 0).

Perform stochastic gradient descent for the network. Update parameters starting sequentially.

Example:\[ y_i=w_3 h(w_1 x_i + b_1) + w_4 h(w_2 x_i + b_2) +b_3 \]

Error: \[ E=\sum_{i}(y_i-z_i)^2 \] \[ \frac{\partial E}{\partial b_3} = 2(y_i-z_i) \] \[ \frac{\partial E}{\partial w_4} = 2(y_i-z_i)h(w_2 x_i + b_2) \] \[ \frac{\partial E}{\partial w_3} = 2(y_i-z_i)h(w_1 x_i + b_1) \] \[ \frac{\partial E}{\partial b_2} = 2(y_i-z_i)w_4\frac{\partial h(w_2 x_i + b_2)}{\partial b_2} \] \[ ... \]

Epochs and early stopping

Batch: share of data considered for each update step
Epoch: one update step
Early stopping: instead of calculating a fixed number of epochs, stop if a condition is fulfilled (e.g. error on validation set increases)

Deep Learning

Neural networks with many layers
Typically, not all layers are of the same type

Convolutional Neural Networks

Sometimes it is benificiary to not link all the nodes but only neighboring nodes
Convolutional layer - filter over input data
This establishes a context between nodes, e.g. for images, pixels should be connected if they are close to each other
Algorithms for images also typically have maxpooling

Recurrent Neural Networks

\[ \begin{array}{rcl}\mathrm{Input}_1 \rightarrow \times w_1 \rightarrow +b_1 \rightarrow&h(\bullet)&\rightarrow \times w_3 \rightarrow +b_2 \rightarrow \mathrm{Output}_1 \\&{| \atop w_2}&\\ \mathrm{Input}_2 \rightarrow \times w_1 \rightarrow&\stackrel{\downarrow}{+}&b_1 \rightarrow h(\bullet) \rightarrow \times w_3 \rightarrow +b_2 \rightarrow \mathrm{Output}_2 \end{array} \]

Vanishing/ exploding gradient problem
Similar to AR(1) process

Long Short Time Memory

Long termmemory L and short term memory (=output) S; the current input is Ii; h is the sigmoid function

Forget gate

\[ L\rightarrow L h( w_1S+w_2I_i+b_1 ) \]

Input gate

\[ L\rightarrow L+ h( w_3S+w_4I_i+b_2 ) \tanh( w_5S+w_6I_i+b_3 ) \]

Output gate

\[ S\rightarrow h( w_7S+w_8I_i+b_4 ) \tanh(L) \]

Application to classification of diffusion trajectories

Find parameters in model (diffusion exponents)
Classification to find the appropriate model
Is exponent due to autocorrelations, non-stationarity, Aging?
G Muñoz-Gil et al. Nature communications (2021)
e.g. A Argun, G Volpe, S Bo, Journal of Physics A (2021) (see figure)