1. Basic Probability Rules

Probability theory is a branch of mathematics concerned with analyzing random events. Below are the fundamental rules and concepts in probability:

Probability of an Event

The probability of an event $A$, denoted $P(A)$, represents the likelihood of that event occurring. The probability of any event always lies within the range $0 \leq P(A) \leq 1$.

Sample Space and Events

  • Sample Space $S$: The set of all possible outcomes of a random experiment.
  • Event $A$: A subset of the sample space.

Complement Rule

The probability of the complement of an event $A$, denoted as $P(A^c)$, is given by: $P(A^c) = 1 - P(A)$.

Addition Rule of Probability

If $A$ and $B$ are two mutually exclusive events (having no common elements), the probability of either event $A$ or $B$ occurring is: $P(A \cup B) = P(A) + P(B)$.

If $A$ and $B$ are not mutually exclusive, the general addition rule is: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$.

If $A$ and $B$ are two independent events (where the occurrence of one event does not affect the probability of the other), the probability of both events $A$ and $B$ occurring is: $P(A \cap B) = P(A) \cdot P(B)$.

If $A$ and $B$ are not independent, the probability of both events occurring is: $P(A \cap B) = P(A) \cdot P(B|A)$, where $P(B|A)$ is the conditional probability of $B$ given that $A$ has occurred.

Addition Rule

$$ p(X) = \sum_Y p(X, Y) $$

This rule calculates the probability of a random variable $X$ by summing the joint probabilities $P(X, Y)$ over all possible values of a random variable $Y$.

Multiplication Rule

$$ p(X, Y) = p(Y | X) p(X) $$

This rule helps determine the joint probability of two random variables $X$ and $Y$.

Conditional Probability

From the multiplication rule, and the symmetry $P(X, Y) = P(Y, X)$, we can establish a relationship between conditional distributions:

$$ P(X | Y) = \frac{P(Y | X) P(X)}{P(Y)} \tag{1.1} $$

This is known as Bayes’ Rule.

By combining the multiplication and addition rules, the denominator can be expressed as:

$$ P(Y) = \sum_X p(Y | X) P(X) \tag{1.2} $$

In Bayes’ Rule, the denominator acts as a normalization constant. This constant ensures that the sum of the conditional probabilities $P(X | Y)$ on the left side of equation (1) equals 1, thus preserving the consistency and validity of the probability system.

To verify, assume we have two possibilities for $X$: $X_1$ and $X_2$. Applying Bayes’ Rule to both cases, we have:

$$ \begin{align*} P(X_1 | Y) &= \frac{P(Y | X_1) P(X_1)}{P(Y)} \tag{1.3} \ P(X_2 | Y) &= \frac{P(Y | X_2) P(X_2)}{P(Y)} \tag{1.4} \end{align*} $$

Since $P(X_1 | Y) + P(X_2 | Y) = 1$, we get:

$$ \begin{align*} & \frac{P(Y | X_1) P(X_1)}{P(Y)} + \frac{P(Y | X_2) P(X_2)}{P(Y)} = 1 \ \Leftrightarrow & P(Y) = P(Y | X_1) P(X_1) + P(Y | X_2) P(X_2) \end{align*} $$

which is the denominator shown in (1.2).


2. Probability Density Function

In addition to finding the probability of discrete variables, we also need to consider the case of continuous variables.

If the probability of a real variable $x$ falling within an interval $(x, x + \delta x)$ is defined by $p(x) \delta x$ as $\delta x \rightarrow 0$, then $p(x)$ is called the probability density function of $x$.

The probability that the value of $x$ lies in the interval $(a, b)$ is defined by:

$$ p(x \in (a, b)) = \int_a^b p(x) dx $$

Since the value of a probability must be non-negative, and since the value of $x$ must lie somewhere within the real number range, the probability density function $p(x)$ must satisfy two conditions:

$$ \begin{cases} p(x) \geq 0, \ \int_{-\infty}^{\infty} p(x) dx = 1 \end{cases} $$

The addition and multiplication rules of probability, as well as Bayes’ Rule, also apply to probability densities, or combinations of continuous and discrete variables. For example, if $x$ and $y$ are two continuous variables, their addition and multiplication rules take the form:

$$ \begin{align*} p(x) &= \int p(x, y) dy, \ p(x, y) &= p(x | y) p(y) \end{align*} $$


3. Expectation and Variance

Expectation

An important calculation in probability is finding the weighted average, where each term in the sum is multiplied by a weight before taking the average.

The weighted average of a function $f(x)$ following a probability distribution $p(x)$ is called the expectation of $f(x)$, denoted $\mathbf{E}[f]$.

For a discrete distribution, its expectation is:

$$ \mathbf{E}[f] = \sum_x p(x) f(x) $$

This represents a weighted average based on the relative probabilities of different values of $x$.

In the case of continuous variables, the expectation is given by an integral with respect to the probability density function:

$$ \mathbf{E}[f] = \int p(x) f(x) dx $$

Alternatively, if we have a finite set of $N$ points generated from a probability distribution or probability density, the expectation can be approximated by a finite sum over these points:

$$ \mathbb{E}[f] \simeq \frac{1}{N} \sum_{n=1}^{N} f(x_n) $$

At times, we need to consider the expectation of multivariable functions, for example, $f(x, y)$. Here, we use subscripts to indicate the variable being averaged. For instance:

$$ \mathbf{E}_x[f(x, y)] $$

indicates the average of the function $f(x, y)$ with respect to the distribution of $x$.

We also consider finding the conditional expectation of a conditional distribution, specifically:

$$ \mathbf{E}[f | y] = \sum_x p(x | y) p(x) $$

For continuous variables, this becomes:

$$ \mathbf{E}[f | y] = \int p(x | y) p(x) dx $$

Variance

Variance represents the degree of dispersion of the values of the random variable $x$ around its expected value $E[x]$. The formula for the variance of a variable following a probability distribution is defined as:

$$ \begin{align*} Var(x) & = E[(x - E[x])^2], \ & = E[(x^2 - 2xE[x] + E[x]^2)] \ & = E[x^2] - 2E[x]E[x] +E[E[x]^2] \ \text{(linearity property of expectation)} \ & = E[x^2] - 2E[x^2] +E[x]^2 \ & = E[x^2] - E[x]^2 \end{align*} $$

We see that variance can be calculated by taking the expected value of $x^2$ and subtracting the square of the expected value of $x$.

Covariance

Covariance between two random variables $x$ and $y$ measures the degree to which these variables change together.

The formula for covariance between $x$ and $y$ is:

$$ \begin{align*} Cov(x, y) &= E_{x, y}[(x - E[x])(y - E[y])] \ & = E_{x, y} [xy - xE[y] - yE[x] + E[x]E[y]] \ & = E_{x,y}[xy] - E_{x,y}[xE[y]] - E_{x,y}[yE[x]] + E_{x,y}[E[x]E[y]] \ & = E_{x,y}[xy] - E[x]E[y] - E[x]E[y] + E[x]E[y] \ & = E_{x,y}[xy] - E[x]E[y] \tag{1} \end{align*} $$

  • If the covariance is positive, it means $x$ and $y$ are positively correlated: when $x$ increases, $y$ increases, and vice versa.
  • If the covariance is negative, $x$ and $y$ are negatively correlated: when $x$ increases, $y$ decreases, and vice versa.
  • If the covariance is zero, then $x$ and $y$ are independent variables.

The proof of the third case is straightforward, because when $x$ and $y$ are independent, we have the property of joint expectation:

$$ E[x, y] = E[x]E[y] $$

Thus, equation $(1)$ becomes $E[x]E[y] - E[x]E[y] = 0$.


4. Gaussian Distribution

The Gaussian distribution is one of the most important probability distributions for continuous variables, also known as the normal distribution.

For the case of a real variable $x$, the Gaussian distribution is defined by:

$$ \mathcal{N}(x|\mu, \sigma^2) = \frac{1}{(2\pi\sigma^2)^{\frac{1}{2}}} \exp\left(-\frac{1}{2\sigma^2}(x-\mu)^2\right) \tag{2} $$

This distribution is governed by two parameters: $\mu$, called the mean, and $\sigma^2$, called the variance. The square root of the variance, denoted as $\sigma$, is called the standard deviation, and the inverse of the variance, written as $\beta = 1/\sigma^2$, is called the precision.

From the form of equation $(2)$, we observe that the Gaussian distribution satisfies:

$$ \mathcal{N}(x|\mu, \sigma^2) > 0 $$

This is because:

  • $\frac{1}{(2\pi\sigma^2)^{\frac{1}{2}}} > 0$. This is a positive constant, because in the Gaussian distribution, the variance $\sigma^2 > 0$ instead of being non-negative. If the variance were zero, it would mathematically imply that all values of the random variable are exactly equal to the mean $\mu$, with no dispersion. This would make the probability density function invalid.
  • $\exp\left(-\frac{1}{2\sigma^2}(x-\mu)^2\right) > 0$, because the exponential function of any real number is always greater than zero, even if the number is negative.

For $(2)$ to be a valid probability distribution, we need to prove that:

$$ \begin{align*} &\int_{-\infty}^{\infty} \mathcal{N}(x|\mu, \sigma^2) , dx = 1 \ \Leftrightarrow & \int_{-\infty}^{\infty} \frac{1}{(2\pi\sigma^2)^{\frac{1}{2}}} \exp\left(-\frac{1}{2\sigma^2}(x-\mu)^2\right) , dx = 1 \tag{3} \end{align*} $$

From $(3)$, we set $z = \frac{x - \mu}{\sigma}$, and the expression becomes:

$$ \begin{align*} \int_{-\infty}^{\infty} \frac{1}{(2\pi\sigma^2)^{\frac{1}{2}}} \exp\left(-\frac{(\sigma z)^2}{2\sigma^2}\right) \sigma , dz &= 1 \ \Leftrightarrow & \int_{-\infty}^{\infty} \frac{1}{(2\pi\sigma^2)^{\frac{1}{2}}} \exp\left(-\frac{z^2}{2}\right) \sigma , dz = 1 \ \Leftrightarrow & \int_{-\infty}^{\infty} \frac{\sigma}{(2\pi\sigma^2)^{\frac{1}{2}}} \exp\left(-\frac{z^2}{2}\right) , dz = 1 \ \Leftrightarrow & \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{z^2}{2}\right) , dz = 1 \ \Leftrightarrow & \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} \exp\left(-\frac{z^2}{2}\right) , dz = 1 \tag{4} \end{align*} $$

Since $\frac{dz}{dx} = \frac{1}{\sigma}$, equation $(4)$ represents the standard normal distribution, a special case of the normal distribution with mean $= 0$ and variance $= 1$. Therefore, $(4)$ is a valid distribution, and hence, $(2)$ is also a valid distribution.

Expectation and Variance

The expectation of a variable $x$ following a Gaussian distribution is given by:

$$ \mathbf{E}[x] = \int_{-\infty}^{\infty} \mathcal{N}(x|\mu, \sigma^2) x , dx = \mu \tag{5} $$

Similarly, the second moment is:

$$ \mathbf{E}[x^2] = \int_{-\infty}^{\infty} \mathcal{N}(x|\mu, \sigma^2) x^2 , dx = \mu^2 + \sigma^2 \tag{6} $$

From equations $(5)$ and $(6)$, the variance of the Gaussian distribution is:

$$ Var(x) = \mathbf{E}[x^2] - \mathbf{E}[x]^2 = \sigma^2 $$