Following are the most common probability distributions in machine learning. We list some of the details and important facts about them which could appear in interview. # Bernoulli The odds of an event occurs, i.e., $p(x=1\,|\,\mu)=\mu$. We have, $ x\sim Bern(x\,|\,\mu)=\mu^x(1-\mu)^{1-x}, \text{where}\,\,\, x\in\{0,1\} $ Moreover, we can get its mean and variance, $ \begin{array} \mathbb{E}(x) = \mu \\ Var(x)=\mu(1-\mu) \end{array} $ # Binomial A dataset of N binary variable, binomial distribution describes the total number of success $p(x=m\,|\,N, \mu)$, and each trial follows a Bernoulli distribution parameterised by $\mu$, $x\sim\begin{pmatrix}m\\N\end{pmatrix}\mu^m(1-\mu)^{N-m}$ Likewise, we have the mean and variance. Derivation of this follows when we consider observations as $N$ independent Bernoulli variables: $x=(x_1,x_2,\ldots,x_N), \text{where}\,\, x_i\in\{0,1\} \,\,\text{and}\,\, x_i \sim Bern(\mu)$ $ \begin{array} \mathbb{E}(x)=\mu N\\Var(x)=\mu(1-\mu)N \end{array} $ # Beta distribution This can be a conjugate prior to binomial distribution, as we shall see that the posterior is again a Beta distribution. The meta parameters $a\,\, \text{and}\,\,b$ are called effective numbers, since it is not difficult to discover that $a-1$ looks similar to $m$ in Binomial distribution. The practical meaning of $a$ and $b$ encode prior knowledge about how many successes and failures we assumed in prior. $\begin{array}{c} x\sim Beta(x\,|\,a,b)=\frac{\Gamma(a+b)}{\Gamma(a)+\Gamma(b)}x^{a-1}(1-x)^{b-1} \\ \mathbb{E}(x)=\frac{a}{a+b} \\ Var(x)=\frac{ab}{(a+b)^2(a+b+1)} \end{array} $ The gamma function forms the scaling factor makes sure the probability is valid. $ \Gamma(x)=\int_0^{\infty}u^{x-1}e^{-u}du $ When $x$ is integer, this is the factorial $\Gamma(x+1)=x!$. # Multinomial A dataset of N multivariate, where $\mathbf{x}=(0,0,\ldots,1,0,0)_k$ with $x_k=1,\, x_{j\neq k}=0$ can be coded as 1-of-K schema. Suppose we observed values as $\mathbf{m}=\{m_1,m_2,\ldots,m_k\}$ with $\sum_km_k=N$, $ \begin{array}{c} \text{Multi}(\mathbf{m}\,|\, \mathbf{\mu},N)=\begin{pmatrix}N\\m_1m_2\ldots m_k\end{pmatrix}\prod_k \mu_k^{m_k}\\ \text{where}\,\, \mathbf{\mu}=(\mu_1,\mu_2,\ldots,\mu_k) \,\, \text{and} \,\, \sum_i \mu_i = 1 \end{array} $ For each class $k$, $ \begin{array}{c} \mathbb{E}(m_k)=\mu_kN\\ Var(m_k)=\mu_k(1-\mu_k)N\\ cov(m_k, m_j)=-N\mu_k\mu_j \end{array} $ # Dirichlet Dirichlet distribution is a conjugate prior to multinomial distribution, and it describes the individual variables $\mu_i$ in multinomial. Basically you should understand it as a multivariate version of beta distribution. $ \text{Dir}(\mathbf{\mu}\,|\,\mathbf{\alpha})=\frac{\Gamma(\sum_k \alpha_k)}{\Gamma(\alpha_1)\ldots\Gamma(\alpha_k)}\prod_k\mu_k^{\alpha_k-1} $ Let $\hat{\alpha}=\sum_k\alpha_k$, we have: $ \begin{array}{c} \mathbb{E}(\mu_k)=\frac{\alpha_k}{\hat{\alpha}} \\ Var(\mu_k)=\frac{\alpha_k (\hat{\alpha}-\alpha_k)}{\hat{\alpha}^2(\hat{\alpha}+1)}\\ cov(\mu_k,\mu_j)=-\frac{\alpha_k \alpha_j}{\hat{\alpha}^2(\hat{\alpha}+1)} \end{array} $ # Normal distribution It’s probably the most widely used distribution in ML, and there’re univariate and multivariate versions of it. For univariate $x$, $ \begin{array}{c} x\sim \mathcal{N}(m, \sigma^2)\\ p(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-m)^2}{2\sigma^2}} \end{array} $ Similarly for multivariate $\mathbf{x}\in \mathcal{R}^D$ $ \begin{array}{c} \mathbf{x}\sim\mathcal{N}(\mu, \Sigma)\\ p(\mathbf{x})=(2\pi)^{-1/D}|\Sigma|^{-1/2}\text{exp}\{-\frac{1}{2}(\mathbf{x}-\mu)^T\Sigma^{-1}(\mathbf{x}-\mu)\} \end{array} $ Sometimes it is just convenient to use inverse covariance (precision) $\beta=\Sigma^{-1}$. Since normal distribution is so important, you may be asked to write both marginal and conditional Gaussians. If we have $m$ independent variables $x_i \sim N(u_i, \sigma_i^2), \,\,\, i=1,2,\ldots,m$ then we have variable $z=\sum_i x_i\sim N(\sum_i u_i, \sum_i \sigma_i^2)$, note the mean and variance terms. Following properties may be useful: 1. If $Z\sim N(0, I)$ then $X=\mu+\Sigma^{1/2}Z \sim N(\mu, \Sigma)$ 2. Let us define partition: $X=(X_a, X_b)\sim N(\mu, \Sigma)$ and $\mu = \begin{bmatrix}\mu_a\\ \mu_b\end{bmatrix}$, $\Sigma=\begin{bmatrix}\Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb}\end{bmatrix}$ 1. We have marginal $X_a\sim N(\mu_a, \Sigma_{aa})$ 2. We also have conditional $ X_b|X_a \sim N(\mu_b+\Sigma_{ba}\Sigma_{aa}^{-1}(x_a - \mu_a), \Sigma_{bb}-\Sigma_{ba}\Sigma_{aa}^{-1}\Sigma_{ab})$ 3. If $a$ is a vector then $a^TX\sim N(a^T\mu, a^T\Sigma a)$ 4. $V=(X-\mu)^T\Sigma^{-1}(X-\mu)\sim \chi_p^2$