# Regression
It seems everything starts to grow from linear model, whatever regression or classification, linear models expand to almost many other learning models we face during the research. So starting from a very simple linear regression problem, given training set $\mathcal{D} = \left\{ \boldsymbol{x_i}, \boldsymbol{y_i} \right\}_{i=1}^n$ with $n$ sample, where $\boldsymbol{y}$ are the responses as numerical values. A linear regression model finds a linear weight vector, which minimizes a certain type of empirical error. This is obviously defined in perspective of statistical learning theory.
$
\arg\min_{\boldsymbol{w}}\frac{1}{n}\sum_{i=1}^n \| w^T x_i + b - y_i\|^2
$
where empirical error introduced by individual sample is equally weighted by $1/n$. It is seen that we are using a straight line to fit a possibly any shaped function, obviously the summation or mean error can be large due to the noise or intrinsic nonlinearity of function. However, a typical misunderstanding of linear model for beginners is that the term linear refers to $\boldsymbol{w}$ instead of $\boldsymbol{x}$. That is to say, we are expecting a linear model on parameters $\boldsymbol{w}$, but the feature vectors $\boldsymbol{x}$ can actually be any shape. Therefore, in literature we mostly see a feature mapping of input $\boldsymbol{x}$ as $\phi(\boldsymbol{x})$, and it does not break the linear property of the model. Therefore, we have our liner model as,
$
\arg\min_{\boldsymbol{w}} \frac{1}{n}\sum_{i=1}^n \| w^T \phi(x_i) + b - y_i\|^2
$
A typical feature mapping is polynomial feature mapping. Suppose we have a 2-dimensional input data sample $\left( x_1, x_2 \right)$, we define a feature mapping as follows,
$ \left( x_1, x_2 \right) \rightarrow \left( x_1, x_2, x_1^2+x_2^2 \right), $ where a 2-d plane is transformed as a paraboloid in 3-d space. Substituting back into previous linear function, it becomes a polynomial line fitting problem, but it is still linear in $\boldsymbol{w}$.
To solve the least square problem defined above to obtain an optimal parameter estimation $\boldsymbol{w}$, we take the gradient with respect to the loss and set it to zero. Note that the intercept $b$ can be folded in vector $\boldsymbol{w}$ by adding additional entry $1$ in the end, for simplicity, we ignore it in our formulation. The least square solution is,
$
\boldsymbol{w}^* = \left( \Phi(\boldsymbol{X})^T\Phi(\boldsymbol{X}) \right)^{-1}\Phi(\boldsymbol{X})^T\boldsymbol{Y}
$
As long as the $\Phi(\boldsymbol{X})^T\Phi(\boldsymbol{X})$ is not singular, there exists analytical solution. We will call $\Phi(\boldsymbol{X})$ design matrix, which takes each row as a feature mapping on $\boldsymbol{x}$. We will see in short that the inner product of feature mapping can be explicitly defined by a certain kernel function, which established a very important chapter of learning theory, i.e., $\textit{Kernel Methods}$. To predict the response for a new sample $\boldsymbol{x}^*$, we have,
$\boldsymbol{y}^* = \phi(\boldsymbol{x}^*) \left( \Phi(\boldsymbol{X})^T\Phi(\boldsymbol{X}) \right)^{-1}\Phi(\boldsymbol{X})^T\boldsymbol{Y}$
## From Probabilistic View
Different from minimising empirical errors from observations, we can examine the whole problem in a probabilistic view, that is, minimize the uncertainty from observations. If we reformulate the problem as a summation of a deterministic function and a indeterministic noise from a certain probabilistic distribution, we can write the linear model as,
$ \boldsymbol{y} = \boldsymbol{w}^T \phi(\boldsymbol{x}) + \epsilon $
where $\epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right)$ which is defined as a Gaussian noise, the bias term $b$ is again folded in$\boldsymbol{} w$. Moreover, we can get the response $\boldsymbol{Y}$ as a Gaussian distribution as well.
$ \boldsymbol{Y} \sim \prod_{i=1}^n \mathcal{N}\left( \boldsymbol{w}^T\phi(\boldsymbol{x}_i), \sigma^2 \right)$
In order to get the optimal parameter$\boldsymbol{} w$, we need to maximize the likelihood $p\left( \boldsymbol{Y} \mid \boldsymbol{X}, \boldsymbol{w} \right)$. It equals to maximize the log-likelihood, and the log-likelihood gives,
$
\ln p\left( \boldsymbol{Y}\mid\boldsymbol{X}, \boldsymbol{w}\right) = -\frac{n}{2}\ln(2\pi)-n\ln(\sigma) - \frac{1}{2\sigma^2}\|\boldsymbol{Y}- \Phi(\boldsymbol{X})\boldsymbol{w}\|^2
$
From the equation, we can see maximizing the log-likelihood ($\textit{MLE}$) equals minimizing the least squared error. We will get the exact same solution as in least square solution.
### Overfitting
Now suppose we have four 1-dimensional observations$\boldsymbol{} X$, and define an arbitrary feature mapping$\phi$, we expect to find a linear model to minimize the least squared error, as we see in \eqref{eq:linear_reg}.
$\begin{bmatrix}2 \\ 1 \\-1\end{bmatrix} = \phi\left( \begin{bmatrix}1.5 \\ 0.5 \\ 2.5 \end{bmatrix} \right) \boldsymbol{w}$
If we define the feature mapping $\phi$ as
$\begin{bmatrix}1.5 \\ 0.5 \\ 2.5 \end{bmatrix} \xrightarrow{\phi} \begin{bmatrix}1 & 0 & 0 \\ 0 & 1 & 0\\ 0 & 0 & 1 \end{bmatrix}$
We can see that$\boldsymbol{} w$is exactly the response vector $\boldsymbol{y}$, and the least squared error is minimized as zero. A feature mapping can always be defined to achieve zero error, if there's no constraints at all. But obviously, there is no benefit to use this linear model on any prediction task, and mostly likely we will get a very high prediction error based on that. If the model performs well on training dataset, but poorly on unseen data, we can call this situation as overfitting. And certainly, overfitting is a central problem that machine learning attempts to solve.
To avoid overfitting, we can firstly introduce constraint by adding a penalty term on the complexity of the$\boldsymbol{} w$, which is known as $\textit{regularization}$. For example, by penalizing a 2-norm of $\boldsymbol{w}$, we can generalize the least squared error problem as,
$\arg\min_{\boldsymbol{w}} \frac{1}{n}\sum_{i=1}^n \| \boldsymbol{w}^T \phi(x_i) + b - y_i\|^2 + \frac{\lambda}{2}\boldsymbol{w}^T\boldsymbol{w}$
Similarly taking the derivative w.r.t. $\boldsymbol{w}$ and set to zero, we can get the optimal parameters as,
$\boldsymbol{w}^* = \left( \Phi(\boldsymbol{X})^T\Phi(\boldsymbol{X}) + \lambda \boldsymbol{I}\right)^{-1}\Phi(\boldsymbol{X})^T\boldsymbol{Y}$
Again from probabilistic view, we introduce uncertainty on$\boldsymbol{} w$instead of only considering uncertainty on response$\boldsymbol{} y$. Define a prior on$\boldsymbol{} w$following a D-dimensional Gaussian distribution,
$\boldsymbol{w}\sim \mathcal{N}\left( \boldsymbol{0}, \Sigma_{\boldsymbol{w}} \right)$
We want to capture the posterior distribution on$\boldsymbol{} w$after observations of $\left( \boldsymbol{X}, \boldsymbol{Y}\right)$, that is, the objective is to maximize the posterior according to Bayes theorem,
$\max p\left( \boldsymbol{w}\mid\boldsymbol{X}, \boldsymbol{Y}\right) = \frac{p\left( \boldsymbol{Y}\mid \boldsymbol{X}, \boldsymbol{w}\right)p\left( \boldsymbol{w}\right)}{\int p\left( \boldsymbol{Y}\mid \boldsymbol{X}, \boldsymbol{w}\right)p\left( \boldsymbol{w}\right)d\boldsymbol{w}}$
The denominator in the posterior is also called marginal likelihood, which is independent of $\boldsymbol{w}$, therefore, taking the logarithm of the posterior, we have,
$\ln p\left( \boldsymbol{w} \mid \boldsymbol{X}, \boldsymbol{Y}\right) = -\frac{(n+d)}{2}\ln(2\pi)-n\ln(\Sigma_{\boldsymbol{w}}) - \frac{1}{2}\ln|\Sigma_{\boldsymbol{w}}|- \frac{1}{\sigma^{2}\|\boldsymbol{Y}}- \Phi(\boldsymbol{X})\boldsymbol{w}\|^2 - \frac{1}{2}\boldsymbol{w}^T\Sigma_{\boldsymbol{w}}^{-1}\boldsymbol{w}$
Take the gradient w.r.t.$\boldsymbol{} w$and set to zero, we can get very similar results as in least square solution.
$\boldsymbol{w}_{map} = \left( \Phi(\boldsymbol{X})^T\Phi(\boldsymbol{X}) + \sigma^2\boldsymbol{\Sigma_{\boldsymbol{w} w}^{-1}} \right)^{-1}\Phi(\boldsymbol{X})^T\boldsymbol{Y}$
If the prior is defined with an isotropic Gaussian, we see that the $\textit{MAP}$ solution is equivalent to $\ell_2$ regularization form. Now let us look back at the posterior, note that,
$\begin{align}
p\left( \boldsymbol{w} \mid \boldsymbol{X}, \boldsymbol{Y} \right) & \propto p\left( \boldsymbol{Y} \mid \boldsymbol{X}, \boldsymbol{w} \right)p\left( \boldsymbol{w} \right) \nonumber\\
& \propto \exp \left\lbrack -\frac{1}{2\sigma^2}\left( \Phi(\boldsymbol{X})\boldsymbol{w} - \boldsymbol{Y}\right)^T\left( \Phi(\boldsymbol{X})\boldsymbol{w} - \boldsymbol{Y}\right) \right\rbrack \exp \left( \frac{1}{2}\boldsymbol{w}^T\Sigma_{\boldsymbol{w}}^{-1}\boldsymbol{w} \right) \nonumber\\
& \propto \exp\left\{ -\frac{1}{2}\boldsymbol{w}^T\left( \frac{1}{\sigma^2}\Phi(\boldsymbol{X})^T \Phi(\boldsymbol{X}) + \Sigma_{\boldsymbol{w}}^{-1}\right)\boldsymbol{w} + \frac{1}{\sigma^2}\boldsymbol{Y}^T\Phi(\boldsymbol{X})\boldsymbol{w} \right\}
\end{align}$
By completing the square we can get the mean and covariance of the posterior, that is,
$\begin{align}
\boldsymbol{w}^* &\sim\set N \left( \boldsymbol{m}^*, \boldsymbol{A}^{-1} \right) \nonumber \\
\boldsymbol{m}^* &= \frac{1}{\sigma^2} \boldsymbol{A}^{-1} \Phi(\boldsymbol{X})^T\boldsymbol{Y} \\
\boldsymbol{A} &= \frac{1}{\sigma^2}\Phi(\boldsymbol{X})^T\Phi(\boldsymbol{X})+\Sigma_{\boldsymbol{w}}^{-1}
\end{align}$
And we see that the $\textit{MAP}$ solution is exactly the same as the mode of the posterior. To predict a new input sample $\boldsymbol{x}^*$, we can derive a predictive distribution instead of just a single value at the mode, and it is again a Gaussian with posterior mean multiplied with the test sample, and variance of the predictive distribution is the quadratic form on the posterior covariance, which grows with the magnitude of test samples.
$\begin{align}
p(\boldsymbol{y}^* \mid \boldsymbol{X}, \boldsymbol{Y}, \boldsymbol{x}^*) & = \int p(\boldsymbol{y}^*\mid \boldsymbol{X}, \boldsymbol{Y}, \boldsymbol{w}, \boldsymbol{x}^*)p(\boldsymbol{w} \mid \boldsymbol{X}, \boldsymbol{Y})d\boldsymbol{w} \nonumber \\
& \sim \mathcal{N}\left( \boldsymbol{\Phi(\boldsymbol{x}^*)^T} \boldsymbol{m}^*, \Phi(\boldsymbol{x}^*)^T \boldsymbol{A}^{-1} \Phi(\boldsymbol{x}^*) \right)
\end{align}$
# Classification
Now let us consider classification problem, where we expect a functional $\pi(\boldsymbol{x})$ to map input $\boldsymbol{x}$ to class labels, in binary case $\boldsymbol{y}=\left( +1, -1 \right)$. More commonly, we can define a Bernoulli distribution $p(y=+1 \mid x)$ for one class and $1-p\left( y=+1 \mid x \right)$ for another. Typically, we would choose a sigmoid function, \eg logistic function or $\textit{tanh}$ to warp a possibly infinite value into a bounding box, e.g., $\left\lbrack 0, 1 \right\rbrack$ for logistic function. This is a desired behaviour, since any function value will be transformed to a probability. A logistic function is defined, $\sigma(a) = \frac{1}{1+\exp(-a)}$
Obviously, we also have,
$\begin{align*}
\sigma(-a) = 1-\sigma(a) \\
\tanh(a) = 2\sigma(2a) -1 \\
\frac{\partial \sigma(a)}{\partial a} = \left( 1-\sigma(a) \right)\sigma(a)
\end{align*}$
Thus, the conditional class distribution given input dataset can be defined as a sigmoid function on the linear model $p\left( y \mid \boldsymbol{x} \right) = \sigma(yf(\boldsymbol{x}))$, again we fold the bias term$b$in the parameters $\boldsymbol{w}$.
> *to be continued*