by Siyu Tang, 2023-10-11,12

ComputerVision_week_4_Wednesday_2023.pdf

Sequence models

Recurrent neural networks

The models we saw last week (MLP, conv nets) were feedforward models (input → hidden layers → output).

How to process inputs that are sequences of variable length, for example a sentence or to track something in a video?

recurrent neural networks (RNN)

An RNN is a neural network that maps from an input space of sequences to an output space of sequences in a stateful way. That is, the prediction of the output $\mathrm{y}_t$ depends not only on the input $\mathrm{x}_t$, but also on the hidden state of the system, $\mathrm{h}_t$ which gets updated over time, as the sequence is processed.

→ it has a(n infinite) “memory”: the hidden state h is a function of all previous inputs in the sequence, enabling long-term dependencies

In a diagram, sometimes it’s drawn with a loop edge, sometimes with an unrolled verison.

(In the example on the right, there is only one hidden layer, but there can be multiple.)

Screenshot from 2024-01-20 16-26-58.png

Applications of RNNs

W4L1_object_tracking.png

Training RNNs

Backprop in an RNN is called backpropagation throught time, since it goes through all the sequence elements and corresponding hidden layers.

W4L1_1__RNN_b.png

$$ \text{Hidden layer:} \hspace{2em} h_t = f(W_h h_{t-1} + W_x x_t) $$

The weights are shared over time (same $W_h$ and $W_x$ for each $t$)!

For a gradient update of $W_h$, we need to add up the gradients through every path from $W_h$ to $L$ → need to sum over $L_j$ (because the overall loss is the sum of the sequence item losses) and over $h_k$ (because $W_h$ is used in multiple places):

$$ \frac{\partial L}{\partial W_h} = \sum_{j=0}^{T-1}\frac{\partial L_j}{\partial W_h} = \sum_{j=0}^{T-1} \sum_{k=1}^{j} \frac{\partial L_j}{\partial h_k} \frac{\partial h_k}{\partial W} $$

On each path (each item in this sum) the derivative $\frac{\partial L_j}{\partial h_k}$ can be written as a product, using the chain rule:

$$ \frac{\partial L_j}{\partial h_k} = \frac{\partial L_j}{\partial y_j} \frac{\partial y_j}{\partial h_j} \frac{\partial h_j}{\partial h_k} $$

$$ \text{And, } \hspace{1em} \frac{\partial h_j}{\partial h_k} = \prod_{m=k+1}^j \frac{\partial h_m}{\partial h_{m-1}} $$

Putting everything together:

Screenshot from 2024-01-20 17-10-30.png

As an example, we can have a hidden layer (”state update”): $h_m = \tanh(W_h h_{m-1} + W_x x_m)$, and let’s approximate $\tanh(x) \approx x$. Then $\frac{\partial h_m}{\partial h_{m-1}} \approx W_h$, so $\frac{\partial h_k}{\partial h_{j}} \approx W_h^{j-k-1}$

→ Problem:

Vanishing / exploding gradients

Repeated matrix multiplication leads to vanishing and exploding gradients. Why? Let’s look at the eigenvalues and eigendecomposition: $W_h = Q^{-1} \ast \Lambda \ast Q$, (where $\Lambda$ is the diagonal eigenvalue matrix), then $W_h^n = Q^{-1} \ast \Lambda^n \ast Q$. → for a <1 (abs val) eigenvalue, the $n$-th power will be very small (⇒ vanishing gradient), for a >1 (abs val) eigenvalue, it will be very large (⇒ exploding gradient).

Many solutions: gradient clipping, gated recurrent unit, etc.

Transformers

Transformers are the foundation of all the large language models that exist nowadays. These days it’s also used a lot in computer vision.