Machine Learning basics / recap

Learning: tune parameters $\mathrm{w}$ from training data $\{ (x_i, y_i) \}_{i=1}^{N}$ (or ground truth)

Inference: make novel predictions: $y=f_{\mathrm{w}}(x)$

Tasks: classification (like cat or dog), regression (like umbrella sales), etc.

Screenshot from 2024-01-12 20-55-51.png

Linear regression

Here $x_i \in \mathbb{R}^d, y_i \in \mathbb{R}$, (the task is regression)

Model: a linear function:

$$ f(\mathrm{x}, \mathrm{w}) = \mathrm{w}^{\mathsf{T}} \mathrm{x} $$

Here $\mathrm{x} = [1,x]^{\mathsf{T}}$, so $f(\mathrm{x}, \mathrm{w}) = w_0 + w_1x$, so the model can have a constant term

Error function: sum of squared errors (or least squares, or mean squared error, (or L2 distance squared)): (in ML it’s usually called loss function rather than error fn)

$$ E(\mathrm{w}) = \sum_{i=1}^N (f(\mathrm{x}i, \mathrm{w}) - y_i)^2 \\ = \sum{i=1}^N (\mathrm{x}_i^{\mathsf{T}}w - y_i)^2 \\ = ||\mathrm{Xw} - \mathrm{y}||_2^2 $$

Screenshot from 2024-01-12 21-31-24.png

Minimize the loss by solving for gradient = 0. Gradient of the loss function with respect to the parameters $\mathrm{w}$ is:

$$ \nabla_\mathrm{w} E(\mathrm{w}) = \nabla_\mathrm{w} ||\mathrm{Xw}-\mathrm{y}||2^2 \\ =\nabla\mathrm{w} (\mathrm{Xw}-\mathrm{y})^\mathsf{T}(\mathrm{Xw}-\mathrm{y}) \\ =\nabla_\mathrm{w} (\mathrm{w}^\mathsf{T}\mathrm{X}^\mathsf{T}\mathrm{Xw} - 2 \mathrm{w}^\mathsf{T}\mathrm{X}^\mathsf{T}\mathrm{y} + \mathrm{y}^\mathsf{T}\mathrm{y}) \\ = 2\mathrm{X}^\mathsf{T}\mathrm{Xw} - 2 \mathrm{X}^\mathsf{T}\mathrm{y} $$

As $E(\mathrm{w})$ is quadratic as convex, this has a closed form solution: (we don’t need gradient descent)

$$ \nabla_{\mathrm{w}} E(\mathrm{w}) = 0 \hspace{1em} \Rightarrow \hspace{1em} \mathrm{w} = \mathrm{X}^+ \mathrm{y} $$

where $\mathrm{X}^+$ is the Moore-Penrose inverse (or pseudoinverse) of $\mathrm{X}$, and if it has linearly independent colums, then it is $(\mathrm{X}^\mathsf{T}\mathrm{X})^{-1}\mathrm{X}^\mathsf{T}$.

Note that assuming that the function to be approximated is linear (or approx. linear) is a very strong assumption, so the model (or the resulting predictor) is very weak. A more sophisticated one:

Polynomial regression / curve fitting

Suppose that the original data is only 1 dimensional.

Let us choose a polynomial of order M to model the dataset $\mathcal{X}$:

$f(\mathrm{x}, \mathrm{w}) = \sum_{j=0}^M w_j x^j$, whith features $(1, x, x^2, ..., x^M)^\mathsf{T}$

This can be seen as a special case of linear regression, we just exteded the 1d data to M dimensional data (+1)

Polynomial fitting is a good example for explaining the notion of underfitting & overfitting…

Capacity, overfitting, underfitting

Plots of polynomials of various degrees M (red) fitted to the data (green). We observe underfitting (M = 0/1) and overfitting (M = 9). This is a model selection problem.

Screenshot 2024-01-12 225035.png

Screenshot 2024-01-12 225211.png