Entropy
Consider the example of the event where a UFO appears in the sky. We have never seen anything like it, and we would be very surprised if such an event occurred one day because it is not expected based on our current knowledge. Therefore, the amount of information can be thought of as the level of surprise when knowing the value of $x$. The measure of the amount of information depends on the probability distribution $p(x)$ and is specifically given by the logarithm of $p(x)$ as follows:
$$ \mathbb{H}(x) = -\log_2 p(x) $$
The negative sign appears because for all $x \in [0,1]$, $\log_2(x) \leq 0$ (the case when $0$ occurs is when $x=1$). Since we are using entropy to measure, the value needs to be positive to reflect the magnitude of the value. Thus, we have the negative sign in front. When the base of the logarithm is 2, the unit of $h(x)$ is bit.
The average information is calculated by taking the expectation over the amount of information, given by:
$$ \mathbb{H}[x] = -\sum_{x} p(x)h(x) = -\sum_{x} p(x)\log_2 p(x) $$
This is called the entropy of the random variable $X$, representing the average amount of information (or uncertainty) associated with a random variable. It measures the level of uncertainty of a probability distribution. For example, if we observe a sequence ${X}_n \sim p$ generated from the distribution $p$. If the entropy of ${X}_n$ is high, predicting each character ${X}_n$ will be difficult; conversely, if the entropy is 0, then every ${X}_n$ has the same value, and the dataset $\mathbf{D} = (X_1, X_2, …, X_n)$ is considered to contain little information. Therefore, if the entropy is high, it means that the dataset $\mathbf{D}$ contains more information, as the value of each $X_n$ differs, leading to data diversity in $\mathbf{D}$.
It is important to note that the discrete distribution with the largest entropy is the uniform distribution. This, semantically, makes perfect sense: we know that in a uniform distribution, the probability that the value of the variable $X$ lies within the range $[a, b]$ is $\frac{1}{b-a}$, meaning the probability that $X$ takes any value is the same. Therefore, it is very difficult to predict the value of each $X$. Hence, for a random variable $x$ that can take $K$ values, the entropy of the distribution is maximized when the probability that $x = k$, i.e., $p(x=k) = \frac{1}{K}$. In this case, $\mathbb{H}(X) = \log_2 K$.
On the other hand, distributions with small entropy are those where the probability mass function concentrates on one or a few states. A distribution with minimal entropy (i.e., equal to 0) is any delta function that places all the probability on a single point.
Consider the example: a random variable $X$ with $p(X = a) = 1$, $p(X \neq a) = 0$. The distribution of $X$ can be described through the delta function as follows:
$$ P(X=a) = \delta(x-a) $$
This means that for any value other than $a$ that $X$ could take, the probability is 0 (note: $\delta(y) = 0$ when $y \neq 0$, this is the property of the delta function). Since $X$ can only take the value $a$, the entropy of the distribution of $X$ is:
$$ \mathbb{H}(X) = - P(X=a) \log_2(P(X=a)) $$
Since $P(X=a)=1$, and $\log_2(1) = 0$, we have $\mathbb{H}(X)=0$. When the entropy is minimal, the distribution has no uncertainty because we are certain that only one value exists for the random variables in that distribution.
Entropy is also similarly defined for continuous random variables and is called differential entropy:
$$ \mathbb{H}[X] = -\int p(X) \log_2 p(X) dX $$
Cross-entropy
Cross-entropy measures the loss when we use a distribution $q$ to represent the distribution $p$. When the cross-entropy is small, it means that the distributions $p$ and $q$ are closer to each other, more similar, compared to when the cross-entropy is large:
$$ \mathbb{H}(X) \triangleq p_k \log_2 q_k $$
Joint entropy
The joint entropy of two random variables $X$ and $Y$ is given by:
$$ \mathbb{H}(X, Y) = - \sum_{x,y} p(x,y) \log_2 p(x,y) $$
In the case of continuous variables (differential entropy):
$$ \mathbb{H}(X, Y) = - \int \int p(x,y) \log_2 p(x,y) , dx , dy $$
It is important to note that when $X$ and $Y$ are independent, the joint entropy $\mathbb{H}(X, Y) = \mathbb{H}(X) + \mathbb{H}(Y)$. This is the upper bound of the joint entropy. Intuitively, this makes sense: when $X$ and $Y$ are correlated, they reduce the “degree of freedom” of the system, so the entropy is smaller than when $X$ and $Y$ are independent.
What about the lower bound of joint entropy? The lower bound appears when one of the variables is completely dependent on the other. This is the opposite case to the upper bound, where $X$ and $Y$ are completely independent.
It is easy to see that for $Y$ to be completely dependent on $X$, $Y$ must be a deterministic function of $X$, for example $y = 2x$. In this case, $\mathbb{H}(X, Y) = \mathbb{H}(X)$. Why? Because to describe both $X$ and $Y$, we can describe $Y$ from $X$, so we only need to know the uncertainty of $X$ to describe the joint entropy of $X$ and $Y$. Thus, the lower bound of joint entropy is:
$$ \mathbb{H}(X, Y) \geq \max{(\mathbb{H}(X), \mathbb{H}(Y))} \geq 0 $$
This is a demonstration of a reality in entropy: When we add more random variables to the system, the entropy of the system cannot decrease, it can only stay the same or increase. The reason is that, from the lower bound, we can see that the uncertainty of the overall system cannot be smaller than the uncertainty of any individual variable. Adding more “unknowns” does not make the system clearer, and to reduce the uncertainty of the system, we need to increase the amount of observed data.
Conditional entropy
In the case of the joint distribution $p(X,Y)$, the additional average information required to determine $Y$ when the value of $X$ is known is called conditional entropy, and is given by:
$$ \mathbb{H}[Y|X] = - \int p(X,Y) \log_2 p(Y|X) , dX , dY $$
Additionally, it can be easily seen that by using the product rule, conditional entropy satisfies the following relation:
$$ \begin{align*} \mathbb{H}[X,Y] &= - \int \int p(X,Y) \log_2 p(X,Y) , dX , dY \\ &= - \int \int p(X,Y) \log_2 \big(p(Y|X) p(X)\big) , dX , dY \\ &= - \int \int p(X,Y) \big( \log_2 p(Y|X) + \log_2 p(X) \big) , dX , dY \\ &= - \int \int p(X,Y) \log_2 p(Y|X) , dX , dY - \int \int p(X,Y) \log_2 p(X) , dX , dY \\ &= - \int \int p(X,Y) \log_2 p(Y|X) , dX , dY - \int \int p(X,Y) , dY \log_2 p(X) , dX \\ &= - \int \int p(X,Y) \log_2 p(Y|X) , dX , dY - \int p(X) \log_2 p(X) , dX \\ &= \mathbb{H}[Y|X] + \mathbb{H}[X] \end{align*} $$
This equation shows that the amount of information needed to describe both $X$ and $Y$ is given by the amount of information needed to describe $X$ alone, plus the additional information required to determine $Y$.
If $Y$ is a deterministic function of $X$, knowing $X$ will help us determine $Y$, so $\mathbb{H}(Y | X) = 0$ (there is no uncertainty about the output of $Y$). But if the two variables are independent, knowing $X$ does not provide any information to determine $Y$, so $\mathbb{H}(Y | X) = \mathbb{H}(Y)$.
Since $\mathbb{H}(X, Y) \leq \mathbb{H}(X) + \mathbb{H}(Y)$, and $\mathbb{H}(X, Y) = \mathbb{H}(Y | X) + \mathbb{H}(X)$, we have:
$$ \mathbb{H}(Y | X) \leq \mathbb{H}(Y) $$
Equality holds if and only if $X$ and $Y$ are independent. This means that, on average, knowing information about $X$ will reduce the uncertainty of $Y$, or at least the uncertainty of $Y$ will not increase. This is not always true, as there are cases where knowing information about $X$ increases the uncertainty of $Y$, i.e., $\mathbb{H}(Y | X) > \mathbb{H}(Y)$. However, we are speaking about the expected average case, where having more information is always beneficial in reducing uncertainty.
Relative entropy and mutual information
Consider a distribution $p(X)$ that is unknown, and we approximate it with a distribution $q(X)$. The average additional information required to determine the value of $X$ when using $q(X)$ instead of $p(X)$ is given by:
$$ \begin{align*} \text{KL}(p||q) &= - \int p(X) \log_2 q(X) , dX - \left( - \int p(X) \log_2 p(X) , dX \right) \\ &= - \int p(X) \left( \log_2 q(X) - \log_2 p(X) \right) , dX \\ &= - \int p(X) \log_2 \left( \frac{q(X)}{p(X)} \right) , dX. \end{align*} $$
This is called the relative entropy or Kullback-Leibler (KL) divergence between the distributions $p(X)$ and $q(X)$. Note that this is not a symmetric quantity, i.e., $\text{KL}(p||q) \neq \text{KL}(q||p)$. Also, the KL divergence satisfies $\text{KL}(p||q) \geq 0$, with equality occurring if and only if $p(X) = q(X)$. Therefore, we can interpret KL divergence as a measure of the difference between two distributions.
Let’s further analyze KL divergence. In the first line, we can rewrite it as:
$$ \begin{align*} \text{KL}(p||q) &= - \int p(X) \log_2 q(X) , dX - \left( - \int p(X) \log_2 p(X) , dX \right) \\ &= \mathbb{H}_{ce}(p, q) - \mathbb{H}(p) \end{align*} $$
We recognize that the first term is the cross-entropy (CE) between $p$ and $q$, and the second term is the entropy of $p$, which is the actual distribution we are trying to represent. The CE of $p$ and $q$ is the lower bound of the number of bits needed to compress data generated from the distribution $p$ if we are representing $p$ using $q$. Thus, we can interpret the KL divergence between $p$ and $q$ as the extra bits you need to use if you want to represent data from the true distribution $p$ with the distribution $q$ instead of using $p$.
Since optimal data compression is achieved when we know the true distribution, if we do not, we need to account for the additional information. Hence, there is an important relationship between data compression and density estimation.
Suppose we are trying to approximate the unknown distribution using a parametric distribution $q(X|\mathbf{\theta})$, controlled by a set of adjustable parameters $\mathbf{\theta}$, such as a multivariate Gaussian distribution. One way to determine $\mathbf{\theta}$ is to minimize the KL divergence between $p(X)$ and $q(X|\mathbf{\theta})$. This is not possible since we do not know $p(X)$. However, if we have observed a finite set of training points $X_n$ drawn from $p(X)$, then the expectation with respect to $p(X)$ can be approximated by a finite sum over these points:
$$ \text{KL}(p||q) \approx \frac{1}{N} \sum_{n=1}^{N} \left( - \log_2 q(X_n|\mathbf{\theta}) + \log_2 p(X_n) \right) $$
Since the second term does not depend on $\mathbf{\theta}$, minimizing the KL divergence approximately is equivalent to minimizing the log-likelihood function of $q(X_n|\mathbf{\theta})$ evaluated on the training set.
Mutual Information
Consider the joint distribution $p(X,Y)$ between two sets of variables $X$ and $Y$. If these sets are independent, then $p(X,Y) = p(X)p(Y)$. If the variables are not independent, we can check how “close to independence” they are by considering the Kullback-Leibler divergence between the joint distribution and the product of their marginal distributions, given by:
$$ \text{I}[X,Y] = \text{KL}(p(X,Y) || p(X)p(Y)) = -\int \int p(X,Y) \log_2 \left( \frac{p(X)p(Y)}{p(X,Y)} \right) , dX , dY $$
This is called the mutual information between the variables $X$ and $Y$. Mutual information satisfies $\text{I}[X,Y] \geq 0$, with equality occurring if and only if $X$ and $Y$ are independent. Furthermore, by using the sum and product rules of probability, we can show that mutual information is related to conditional entropy:
$$ \begin{align*} \text{I}[X,Y] &= - \int \int p(X,Y) \log_2 \left( \frac{p(X)p(Y)}{p(X,Y)} \right) , dX , dY \\ &= - \int \int p(X,Y) \log_2 \left( \frac{p(X)p(Y)}{p(X|Y)p(Y)} \right) , dX , dY \\ &= - \int \int p(X,Y) \log_2 \left( \frac{p(X)}{p(X|Y)} \right) , dX , dY \\ &= - \int \int p(X,Y) \left( \log_2 p(X) - \log_2 p(X|Y) \right) , dX , dY \\ &= - \int \int p(X,Y) \log_2 p(X) , dX , dY + \int \int p(X,Y) \log_2 p(X|Y) , dX , dY \\ &= - \int \int p(X,Y) , dY \log_2 p(X) , dX + \int \int p(X,Y) \log_2 p(X|Y) , dX , dY \\ &= - \int p(X) \log_2 p(X) , dX + \int \int p(X,Y) \log_2 p(X|Y) , dX , dY \\ &= \mathbb{H}[X] - \mathbb{H}[X|Y] = \mathbb{H}[Y] - \mathbb{H}[Y|X] \end{align*} $$