1. Quantifying the Uncertainty of Events
We are probably familiar with examples of probability, such as two boxes, one green and one red, each containing differently colored balls. Then, we might calculate probabilities, such as drawing three red balls consecutively from the green box or finding the probability of drawing a yellow ball from the red box, etc. In these examples, we view probability from the perspective of the frequency of random events that can be repeated.
However, when considering an uncertain event, such as whether the moon has ever orbited the sun or whether the Arctic ice will disappear by the end of this century, we find that these are not events that can be repeated countless times to define a concept of probability, as we did earlier with the boxes of balls. However, we usually have some initial ideas (predictions), such as the rate of Arctic ice melt. If we obtain new evidence, such as from a newly launched Earth observation satellite collecting new diagnostic data, we can revise our views on the rate of ice loss. Our assessment of such issues will influence the actions we take, such as the extent to which we try to reduce greenhouse gas emissions.
In such situations, we want to be able to quantify our expression of uncertainty and make accurate adjustments to that uncertainty based on new evidence, and then make optimal actions or decisions as a consequence. All of this can be achieved through the Bayesian interpretation of probability.
2. Bayesian Inference
The term “inference” refers to the “act of moving from sample data to generalizations, typically with a calculated degree of certainty.”
The term “Bayesian” refers to methods of inference that express “degree of certainty” using probability theory and make use of Bayes’ rule to update the degree of certainty of the given data.
Bayes’ rule is very simple: it is a formula used to calculate the probability distribution of possible values of an unknown (or hidden) quantity $ X $ based on some observed data $ Y = y $.
$$ p(\mathbf{X}=x|Y=y) = \frac{p(Y=y|\mathbf{X}=x)p(\mathbf{X}=x)}{p(Y=y)} \tag{2.1} $$
which originates from: $ p(x,y) = p(x|y)p(y) = p(y|x)p(x) $. In expression $ (2.1) $, $ p(\mathbf{X}=x) $ is our prior belief about $ \mathbf{X} $ before observing the data, known as the prior distribution. $ p(Y=y|\mathbf{X}=x) $ represents the probability of observing the value $ Y = y $ given $ \mathbf{X} = x $. This is called the observation distribution. When evaluated at the actual observed value $ y $, we have the likelihood function $ p(Y=y|\mathbf{X}=x) $. When you have an actual observed value $ y $, you want to know the probability of $ y $ occurring given that the value of $ \mathbf{X} $ is $ x $. At this point, the likelihood function $ p(Y=y|\mathbf{X}=x) $ is a way to assess how well the value $ x $ explains the observed data $ y $. In Bayesian statistics, this likelihood is used to update the prior distribution to create the posterior distribution, reflecting our updated knowledge about $ \mathbf{X} $ after observing the data.
By multiplying the prior distribution $ p(\mathbf{X}=x) $ with the likelihood $ p(Y=y|\mathbf{X}=x) $ for each $ x $, we obtain the unnormalized joint distribution $ p(\mathbf{X}=x,Y=y) $. To normalize, we can divide by the marginal likelihood $ p(Y=y) $, because:
$$ p(Y=y) = \int_{x \in \mathbf{X}} p(Y=y|\mathbf{X}=x)p(\mathbf{X}=x)dx $$
3. Marginal Probability
In Bayesian statistics, marginal probability is an important component used to compare models and estimate parameters. It is the probability of the observed data under a specific statistical model, obtained by integrating over the parameter space of the model. Marginal probability can be understood as the probability of the model itself and is therefore often referred to as model evidence, or simply evidence.
Since marginal probability is calculated by integrating over the entire possible value space of the model parameters, it does not depend on any specific parameter but is dependent on the dimensionality and size of the parameter space. It also depends on the model and the prior.
Mathematically, marginal probability is represented as:
$$ p(\mathbf{X}|\alpha) = \int_\theta p(\mathbf{X}|\theta)p(\theta|\alpha)d\theta $$
where $ \mathbf{X} $ is the dataset $ \mathbf{X} = (x_1, x_2, …, x_n) $, with each $ x_i \sim p(x|\theta) $ and the distribution $ p(x|\theta) $ parameterized by $ \theta $, and $ \theta $ is a random variable following some distribution $ \theta \sim p(\theta|\alpha) $. The marginal probability gives the probability of $ p(\mathbf{X}|\alpha) $ when $ \theta $ is integrated out. Integrating out here means that we calculate $ p(\mathbf{X}|\alpha) $ by integrating over the entire value space of $ \theta $, and as a result, $ \theta $ will be excluded from the final result.
During the integration process: $ \theta $ is an intermediate variable used to compute the value of $ p(\mathbf{X} \mid \alpha) $. After the integration process: $ \theta $ is “excluded” from the final result, meaning that the result no longer contains $ \theta $ as a separate variable, and therefore it no longer depends on the value of $ \theta $. This is because the integration has “summed up” all the information about $ \theta $ into a single overall value, representing the distribution of the data based on the given model.