Following are the most common probability distributions in machine learning. We list some of the details and important facts about them which could appear in interview.
# Bernoulli
The odds of an event occurs, i.e., $p(x=1\,|\,\mu)=\mu$. We have,
$
x\sim Bern(x\,|\,\mu)=\mu^x(1-\mu)^{1-x}, \text{where}\,\,\, x\in\{0,1\}
$
Moreover, we can get its mean and variance,
$
\begin{array}
\mathbb{E}(x) = \mu \\ Var(x)=\mu(1-\mu)
\end{array}
$
# Binomial
A dataset of N binary variable, binomial distribution describes the total number of success $p(x=m\,|\,N, \mu)$, and each trial follows a Bernoulli distribution parameterised by $\mu$,
$x\sim\begin{pmatrix}m\\N\end{pmatrix}\mu^m(1-\mu)^{N-m}$
Likewise, we have the mean and variance. Derivation of this follows when we consider observations as $N$ independent Bernoulli variables: $x=(x_1,x_2,\ldots,x_N), \text{where}\,\, x_i\in\{0,1\} \,\,\text{and}\,\, x_i \sim Bern(\mu)$
$
\begin{array}
\mathbb{E}(x)=\mu N\\Var(x)=\mu(1-\mu)N
\end{array}
$
# Beta distribution
This can be a conjugate prior to binomial distribution, as we shall see that the posterior is again a Beta distribution. The meta parameters $a\,\, \text{and}\,\,b$ are called effective numbers, since it is not difficult to discover that $a-1$ looks similar to $m$ in Binomial distribution. The practical meaning of $a$ and $b$ encode prior knowledge about how many successes and failures we assumed in prior.
$\begin{array}{c}
x\sim Beta(x\,|\,a,b)=\frac{\Gamma(a+b)}{\Gamma(a)+\Gamma(b)}x^{a-1}(1-x)^{b-1} \\ \mathbb{E}(x)=\frac{a}{a+b} \\
Var(x)=\frac{ab}{(a+b)^2(a+b+1)}
\end{array}
$
The gamma function forms the scaling factor makes sure the probability is valid.
$
\Gamma(x)=\int_0^{\infty}u^{x-1}e^{-u}du
$
When $x$ is integer, this is the factorial $\Gamma(x+1)=x!$.
# Multinomial
A dataset of N multivariate, where $\mathbf{x}=(0,0,\ldots,1,0,0)_k$ with $x_k=1,\, x_{j\neq k}=0$ can be coded as 1-of-K schema. Suppose we observed values as $\mathbf{m}=\{m_1,m_2,\ldots,m_k\}$ with $\sum_km_k=N$,
$
\begin{array}{c}
\text{Multi}(\mathbf{m}\,|\, \mathbf{\mu},N)=\begin{pmatrix}N\\m_1m_2\ldots m_k\end{pmatrix}\prod_k \mu_k^{m_k}\\ \text{where}\,\, \mathbf{\mu}=(\mu_1,\mu_2,\ldots,\mu_k) \,\, \text{and} \,\, \sum_i \mu_i = 1
\end{array}
$
For each class $k$,
$
\begin{array}{c}
\mathbb{E}(m_k)=\mu_kN\\
Var(m_k)=\mu_k(1-\mu_k)N\\
cov(m_k, m_j)=-N\mu_k\mu_j
\end{array}
$
# Dirichlet
Dirichlet distribution is a conjugate prior to multinomial distribution, and it describes the individual variables $\mu_i$ in multinomial. Basically you should understand it as a multivariate version of beta distribution.
$
\text{Dir}(\mathbf{\mu}\,|\,\mathbf{\alpha})=\frac{\Gamma(\sum_k \alpha_k)}{\Gamma(\alpha_1)\ldots\Gamma(\alpha_k)}\prod_k\mu_k^{\alpha_k-1}
$
Let $\hat{\alpha}=\sum_k\alpha_k$, we have:
$
\begin{array}{c}
\mathbb{E}(\mu_k)=\frac{\alpha_k}{\hat{\alpha}} \\
Var(\mu_k)=\frac{\alpha_k (\hat{\alpha}-\alpha_k)}{\hat{\alpha}^2(\hat{\alpha}+1)}\\
cov(\mu_k,\mu_j)=-\frac{\alpha_k \alpha_j}{\hat{\alpha}^2(\hat{\alpha}+1)}
\end{array}
$
# Normal distribution
It’s probably the most widely used distribution in ML, and there’re univariate and multivariate versions of it. For univariate $x$,
$
\begin{array}{c}
x\sim \mathcal{N}(m, \sigma^2)\\
p(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-m)^2}{2\sigma^2}}
\end{array}
$
Similarly for multivariate $\mathbf{x}\in \mathcal{R}^D$
$
\begin{array}{c}
\mathbf{x}\sim\mathcal{N}(\mu, \Sigma)\\
p(\mathbf{x})=(2\pi)^{-1/D}|\Sigma|^{-1/2}\text{exp}\{-\frac{1}{2}(\mathbf{x}-\mu)^T\Sigma^{-1}(\mathbf{x}-\mu)\}
\end{array}
$
Sometimes it is just convenient to use inverse covariance (precision) $\beta=\Sigma^{-1}$.
Since normal distribution is so important, you may be asked to write both marginal and conditional Gaussians. If we have $m$ independent variables $x_i \sim N(u_i, \sigma_i^2), \,\,\, i=1,2,\ldots,m$ then we have variable $z=\sum_i x_i\sim N(\sum_i u_i, \sum_i \sigma_i^2)$, note the mean and variance terms. Following properties may be useful:
1. If $Z\sim N(0, I)$ then $X=\mu+\Sigma^{1/2}Z \sim N(\mu, \Sigma)$
2. Let us define partition: $X=(X_a, X_b)\sim N(\mu, \Sigma)$ and $\mu = \begin{bmatrix}\mu_a\\ \mu_b\end{bmatrix}$, $\Sigma=\begin{bmatrix}\Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb}\end{bmatrix}$
1. We have marginal $X_a\sim N(\mu_a, \Sigma_{aa})$
2. We also have conditional $
X_b|X_a \sim N(\mu_b+\Sigma_{ba}\Sigma_{aa}^{-1}(x_a - \mu_a), \Sigma_{bb}-\Sigma_{ba}\Sigma_{aa}^{-1}\Sigma_{ab})$
3. If $a$ is a vector then $a^TX\sim N(a^T\mu, a^T\Sigma a)$
4. $V=(X-\mu)^T\Sigma^{-1}(X-\mu)\sim \chi_p^2$