Sequence models

Recurrent neural networks

The models we saw last week (MLP, conv nets) were feedforward models (input → hidden layers → output).

How to process inputs that are sequences of variable length, for example a sentence or to track something in a video?

→ recurrent neural networks (RNN)

An RNN is a neural network that maps from an input space of sequences to an output space of sequences in a stateful way. That is, the prediction of the output $\mathrm{y}_t$ depends not only on the input $\mathrm{x}_t$, but also on the hidden state of the system, $\mathrm{h}_t$ which gets updated over time, as the sequence is processed.

→ it has a(n infinite) “memory”: the hidden state h is a function of all previous inputs in the sequence, enabling long-term dependencies

In a diagram, sometimes it’s drawn with a loop edge, sometimes with an unrolled verison.

(In the example on the right, there is only one hidden layer, but there can be multiple.)

Screenshot from 2024-01-20 16-26-58.png

Applications of RNNs

Object tracking: [Tracking of multiple objects by updating each objects hidden state using an RNN, He et al. CVPR 2019]
Human motion modeling [The Wanderings of Odysseus in 3D Scenes, Zhang et al. CVPR 2022]
Time series data, natural language, speech, human motion, and many more.

Training RNNs

Backprop in an RNN is called backpropagation throught time, since it goes through all the sequence elements and corresponding hidden layers.

$$ \text{Hidden layer:} \hspace{2em} h_t = f(W_h h_{t-1} + W_x x_t) $$

The weights are shared over time (same $W_h$ and $W_x$ for each $t$)!

For a gradient update of $W_h$, we need to add up the gradients through every path from $W_h$ to $L$ → need to sum over $L_j$ (because the overall loss is the sum of the sequence item losses) and over $h_k$ (because $W_h$ is used in multiple places):

$$ \frac{\partial L}{\partial W_h} = \sum_{j=0}^{T-1}\frac{\partial L_j}{\partial W_h} = \sum_{j=0}^{T-1} \sum_{k=1}^{j} \frac{\partial L_j}{\partial h_k} \frac{\partial h_k}{\partial W} $$

On each path (each item in this sum) the derivative $\frac{\partial L_j}{\partial h_k}$ can be written as a product, using the chain rule:

$$ \frac{\partial L_j}{\partial h_k} = \frac{\partial L_j}{\partial y_j} \frac{\partial y_j}{\partial h_j} \frac{\partial h_j}{\partial h_k} $$

$$ \text{And, } \hspace{1em} \frac{\partial h_j}{\partial h_k} = \prod_{m=k+1}^j \frac{\partial h_m}{\partial h_{m-1}} $$

Putting everything together:

Screenshot from 2024-01-20 17-10-30.png

As an example, we can have a hidden layer (”state update”): $h_m = \tanh(W_h h_{m-1} + W_x x_m)$, and let’s approximate $\tanh(x) \approx x$. Then $\frac{\partial h_m}{\partial h_{m-1}} \approx W_h$, so $\frac{\partial h_k}{\partial h_{j}} \approx W_h^{j-k-1}$

→ Problem:

Vanishing / exploding gradients

Repeated matrix multiplication leads to vanishing and exploding gradients. Why? Let’s look at the eigenvalues and eigendecomposition: $W_h = Q^{-1} \ast \Lambda \ast Q$, (where $\Lambda$ is the diagonal eigenvalue matrix), then $W_h^n = Q^{-1} \ast \Lambda^n \ast Q$. → for a <1 (abs val) eigenvalue, the $n$-th power will be very small (⇒ vanishing gradient), for a >1 (abs val) eigenvalue, it will be very large (⇒ exploding gradient).

Many solutions: gradient clipping, gated recurrent unit, etc.

Transformers

Transformers are the foundation of all the large language models that exist nowadays. These days it’s also used a lot in computer vision.